WebHCat Overview

A fairly common requests from customers is the ability to initiate job execution on your HDInsight cluster programmatically. Some of these scenarios include:
  • Scheduled execution of a job (every night at midnight, update the recommendation database).
  • Incorporating job execution into a larger application (allow a client to configure and kick off web log processing).
  • Building end user query tools.
In order to enable these scenarios, an HDInsight cluster exposes a WebHCat endpoint. WebHCat is a REST API to provide metadata management and remote job submission to the Hadoop cluster. You can find updated documentation here. Note, WebHCat has also been referred to as "Templeton" so expect to see some references to that.

WebHCat surfaces the following capabilities:

Using WebHCat .Net Client

Within a .NET application, you can easily use Microsoft.Hadoop.WebClient client library to submit and monitor jobs.
  • Create an instance of WebHCatHttpClient passing in your server configuration, namely:
URL: https://yourhadoopcluster.azurehdinsight.net:563
username/password: cluster credentials
  • Invoke the job type you want. Initially, a basic reply will be sent back with the job id. You can either poll the job status, or use the WaitForJobToCompleteAsync method in order to obtain a task which will be leveraged when the job completes.
The following sample code shows how you can do this (in C#)
using Microsoft.Hadoop.WebHCat.Protocol;

httpClient = new WebHCatHttpClient(new Uri("https://yourazurecluster.azurehdinsight.net:563"), "username", "password");
string outputDir = "basichivejob";
var t1 = httpClient.CreateHiveJob(@"select * from awards;", null, null, outputDir, null);
var response = t1.Result;
var output = response.Content.ReadAsAsync<JObject>();
string id = output.Result.GetValue("id").ToString();

CreateHiveJob will submit the job, and will return with the job id that is read out using Json.net. We then subscribe to the completion of the job using WaitForJobToComplete. This will block until the job actually completes. At this point, we could use the WebHDFS Client to retrieve the output, or use any of the standard storage management tooling.

Last edited Aug 5, 2013 at 4:49 PM by maxluk, version 2


Horace_Slughorn Jun 25, 2013 at 1:07 PM 
Would be interested to see a code sample of how to poll the job status, to retrieve progression of a job in HDInsight, like a "percentage complete", and a final success or failure status, since I don't think WaitForJobToCompleteAsync() will tell you if the job fails?