The Windows Azure HDInsight Log Analysis Toolkit is a command-line tool with utilities for downloading and analyzing Windows Azure Storage logs. It’s primary intended use is for analyzing Windows Azure Storage throttling that can be encountered by large HDInsight clusters (for more information, see Maximizing HDInsight throughput to Azure Blob Storage). To use the toolkit, follow these steps:

  1. Download and install the Windows Azure HDInsight Log Analysis Toolkit.
  2. Turn on logging for your storage account.
  3. Run a Hadoop job.
  4. Download your storage logs using the Log Analysis Toolkit (LAT).
  5. Analyze your storage logs using the LAT.

Details for each step are below.

Install the Log Analysis Toolkit (LAT)

Download and run the .msi installer here: Windows Azure HDInsight Log Analysis Toolkit. The .msi will install the toolkit in the C:\Program Files (x86)\Windows Azure HDInsight Log Analysis Toolkit\ directory. The lat.exe will not be added to your Path environment variable by default.

Turn on Storage Account Logging

The LAT is designed to download and analyze Windows Azure Storage account logs. To turn on logging for your account, follow the instructions here. (For general information about storage analytics, see About Storage Analytics Logging.)

Run a job

For example purposes, The 10GB GraySort sample is used below. The jars for this sample are already in place when you provision an HDInsight cluster. The instructions for running this sample are here: The 10GB GraySort Sample. The PowerShell code below can be used to submit a job. A few things to note:

  • As one of the arguments for the “teragen” job, has been changed from“100000000” to “1000000000” as is used in the documentation. This is to increase the amount of data generated, sorted, and validated. This is a more realistic example.
  • For this example, a 16-node HDInsight cluster was used. Clusters with fewer nodes are not likely to run into throttling by Windows Azure Storage as they simply cannot produce enough throughput to go beyond the storage throttling limits.
  • The cluster’s self-throttling factors to are manually set to “1.0”, which essentially turns self-throttling off. The intention is to produce some level of throttling by Azure Storage for the purposes of this example.

$clusterName = "bswandemo"

#-----------TERAGEN------------------
$teragenJob = New-AzureHDInsightMapReduceJobDefinition `
    -JobName "teragenJob" `
    -JarFile "wasb:///example/jars/hadoop-examples.jar" `
    -ClassName "teragen" `
    –Arguments "-Dmapred.map.tasks=50", "1000000000", "wasb:///example/data/10GB-sort-input" `
    -Defines @{"fs.azure.selfthrottling.write.factor" = "1.0"}

$teragenJob | Start-AzureHDInsightJob -Cluster $clusterName `
            | Wait-AzureHDInsightJob

#----------TERASORT-------------------
$terasortJob = New-AzureHDInsightMapReduceJobDefinition `
    -JobName "terasortjob" `
    -JarFile “/example/jars/hadoop-examples.jar” `
    -ClassName "terasort" `
    –Arguments "-Dmapred.map.tasks=50", “-Dmapred.reduce.tasks=25”, “wasb:///example/data/10GB-sort-input”, “wasb:///example/data/10GB-sort-output”

$terasortJob | Start-AzureHDInsightJob -Cluster $clusterName `
             | Wait-AzureHDInsightJob

#----------TERAVALIDATE----------------
$teravalidateJob = New-AzureHDInsightMapReduceJobDefinition `
    -JarFile “/example/jars/hadoop-examples.jar” `
    -ClassName "teravalidate" `
    –Arguments "-Dmapred.map.tasks=50", “-Dmapred.reduce.tasks=25”, “wasb:///example/data/10GB-sort-output”, “wasb:///example/data/10GB-sort-validate” `
    -Defines @{"fs.azure.selfthrottling.read.factor" = "1.0"}

$teravalidateJob | Start-AzureHDInsightJob -Cluster $clusterName `
                 | Wait-AzureHDInsightJob

 

Download Storage Logs

Use the Log Analysis Tool (LAT) that you downloaded earlier to download the storage logs that were generated when you ran the job above. (Note that you will have to wait up to 10 minutes for the logs to be generated before you can download them.) To use the download utility, you need 5 pieces of information:

  • -start: The start time of the job.
  • -end: The end time of the job.
  • -logcache: The directory to which the logs will be downloaded.
  • -account: Your storage account name (e.g. bswanstorage).
  • -key: Your storage account key.

With this information, you can run a command similar to this one (the -force option is only necessary if the log cache directory doesn't exist):

lat.exe download -start "2014-01-02T22:49:00Z" -end "2014-01-02T23:23:00Z" -logcache "C:\logcache" -account "Your account name" -key "Your account key" -force

Here’s example output:

image_thumb1

Navigating to your c:\logcache directory, you can see the downloaded files:

image_thumb4

 

Analyze Storage Logs

The next step is to analyze the storage logs. To do this, you again need 5 pieces of information (the 6th is optional):

  • -logcache: The folder you specified when downloading your storage logs.
  • -account: The name of your storage account.
  • -start: The start time of the job.
  • -end: The end time of the job.
  • -name: The name of the job. This name will be used as part of the name of the file to which the storage analyses are written.
  • -note: (Optional) This is a note about the job being analyzed.

With this information, you can execute a command similar to this:

lat.exe throttlinganalysis -logcache "C:\logcache" -account "Your storage account name" -start "2014-01-02T22:49:00Z" -end "2014-01-02T23:23:00Z" -name "teragen3" -note "throttling factor set to 1, 16 node cluster"

Here’s example output from the command line:

image_thumb9

The analysis is written to two .csv files: jobname.summary.csv and jobname.details.csv (teragen3.summary.csv and teragen3.details.csv in my case). The summary file will contain information similar to this:

image_thumb18

We can see from the summary that some throttling did occur in this example. To get a better idea of when the throttling occurred,  look at the details file, which contains information about what happened at each 1-second interval of the job:

image_thumb21

Because the details file is a .csv file, you can use Excel features to create a picture of throughput for the job. To do this, select columns A through G, select INSERT from the main menu, and choose Scatter from the Charts options, and finally, select the Straight Lines option:

image_thumb23

You should see a graph similar to this one:

image_thumb15

As you can see, Azure Storage throttling occurred during the first job (Teragen) because of a high write load.  In this particular example, the peak write loads were around 5Gbps which happens to be the throughput quota for uploads to a geo-replicated storage account. So, this example was straying into the red-zone a little.  One solution to prevent the Teragen job from issuing writes to Azure Storage above the quota rate is to adjust fs.azure.selftthrottling.write to produce some self-throttling of storage requests by the Hadoop tasks.  Other solutions include increasing storage bandwidth quota, reducing cluster size or reducing the cluster usage for the Teragen job. 

Last edited Jan 30, 2014 at 8:52 PM by BrianSwanMsft, version 6

Comments

No comments yet.