Using Job Submission cmdlets

New cmdlets submit jobs to clusters remotely using web services. They work from any client machine where cmdlets are installed - no need to RDP into the cluster to submit jobs. The cmdlets support following job types:
  • Map reduce
  • Hive
  • Pig
  • Streaming

Installation

Prepare work environment

Submit Hive query

  • Submitting Hive query to remote cluster consist of two steps. First you select the cluster which you will submit jobs to by using “Use” command.
    Use-AzureHDInsightCluster "mysilverquick"
  • Then you send Hive queries to the selected cluster and results of the query output or errors will be returned back to the PowerShell console. This is similar to standard Hadoop console environment (hive.cmd) but works directly from your client machine
    hive "select * from hivesampletable limit 10"

  • The command above will produce following output.
Submitting Hive query..
Started Hive query with Job Id : job_201309260027_0046
Hive query completed Successfully


8      18:54:20      en-US  Android       Samsung       SCH-i500      California       United States 13.9204007    0      0
23     19:19:44      en-US  Android       HTC    Incredible    Pennsylvania  United States NULL   0      0
23     19:19:46      en-US  Android       HTC    Incredible    Pennsylvania  United States 1.4757422     0      1
23     19:19:47      en-US  Android       HTC    Incredible    Pennsylvania  United States 0.245968      0      2
28     01:37:50      en-US  Android       Motorola      Droid X       Colorado       United States 20.3095339    1      1
28     00:53:31      en-US  Android       Motorola      Droid X       Colorado       United States 16.2981668    0      0
28     00:53:50      en-US  Android       Motorola      Droid X       Colorado       United States 1.7715228     0      1
28     16:44:21      en-US  Android       Motorola      Droid X       Utah   United States 11.6755987    2      1
28     16:43:41      en-US  Android       Motorola      Droid X       Utah   United States 36.9446892    2      0
  • Known issue: When query has special characters like ‘%’ submitting it using above example will fail. Templeton web service will not be able to correctly handle that special character. The work around is to create script file on the storage account and submit query as File using –File parameter. For example:
    hive –File "wasb://yourcontainer@yourstorageaccount/yourjobfolder/query.hql"
  • The same limitation and workaround applies to other job submission cmdlets described below. We are working on fixing it in next update.

Submit map reduce job

  • Submitting map reduce job requires a few more steps. First, we’ll capture cluster name in the variable.
    $clustername = "yourclustername"
  • Next step is to create job definition object which will capture arguments of the job about to be submitted to the cluster. In this example we’ll submit standard word count job which is distributed in hadoop-examples.jar. You can copy hadoop-examples.jar file to “/example/jars/” folder in your default storage account from your cluster file system. It is located in "C:\apps\Samples\hadoop-examples.jar"
    $wordCountJob = New-AzureHDInsightMapReduceJobDefinition -JarFile "/example/jars/hadoop-examples.jar" -ClassName "wordcount" -Arguments "/example/data/gutenberg/davinci.txt", "/example/output/WordCount"
  • Now we’ll submit the job to the cluster, wait for its completion and print output of the job to the console.
    $wordCountJob `
        | Start-AzureHDInsightJob -Cluster $clustername `
        | Wait-AzureHDInsightJob -WaitTimeoutInSeconds 3600 `
        | Get-AzureHDInsightJobOutput –StandardError
  • The output file produced by the job is stored in the “/example/output/WordCount” location as specified by job arguments.
  • In similar manner you can create job definitions for Pig, Hive and Streaming jobs using New-AzureHDInsightPigJobDefinition, New-AzureHDInsightHiveJobDefinition, New-AzureHDInsightStreamingMapReduceJobDefinition and then submit them to cluster using Start-AzureHDInsightJob, Wait-AzureHDInsightJob and Get-AzureHDInsightJobOutput cmdlets.

List jobs running or completed on the cluster

  • At any time you can inspect status of the jobs by using Get-AzureHDInsightJob cmldet.
    Get-AzureHDInsightJob $clustername | ft -a
  • You can also cancel execution of the job using Stop-AzureHDinsightJob cmdlet.
    $sreds = Get-Credential
    Stop-AzureHDInsightJob $clustername -Credential $creds –JobId $jobid

Last edited Dec 12, 2013 at 4:54 AM by maxluk, version 4

Comments

No comments yet.