Overview of .Net Map/Reduce

Hadoop Streaming is a facility for writing map-reduce jobs in the language of you choice. Hadoop API for .NET is a wrapper to Streaming that provides a convenient experience for .NET developers. An understanding of the concepts and general functionality provided by Hadoop Streaming is necessary for successful use of this API: see http://hadoop.apache.org/common/docs/r0.20.0/streaming.html for this background information.

The main facilities provided by this API are:
  1. Abstraction of job execution to avoid manual construction of streaming command-line.
  2. Mapper, Reducer, Combiner base classes and runtime wrappers that provide helpful abstractions. For example, the ReducerCombinerBase class provides input through (string key, IEnumerable<string> value) groups.
  3. Detection of .NET dependencies and automatic inclusion in streaming job.
  4. Local unit-testing support for map/combine/reduce classes via StreamingUnit class
  5. Support for JSON I/O and strongly typed mapper/combiner/reducer via Json* classes. The pattern used by the JSON classes can be used to create other serialization wrappers.

The jobs can be submitted for execution via the API. The jobs can be submitted to local or remote Hadoop cluster. In case of local cluster a Hadoop Streaming command is generated and executed. The command is displayed on the console and can be used for direct invocation if required. In case of remote cluster Templeton/WebHCat web service is used.

Input & Output formats

The input/output format supported is line-oriented tab-separated records, staged in a Hadoop-supported file system such as HDFS or Azure Blob Storage. The input may comprise many files but each should have a consistent format: records delimited by \n\r, columns delimited by \t.

When a job comprises both a mapper and reducer, the key values emitted by the mapper must be plain text that can be sorted successfully with an ordinal-text-comparer such as provided by .NETs `StringComparison.Ordinal`.

In all other cases the record fields may comprise formatted text such as Json or other text representation of structured data. The API includes support for Json fields via the classes in the `Microsoft.Hadoop.MapReduce.Json` namespace.

If data is in a binary format or document-oriented format (such as a folder full of .docx files), the input to a map-reduce job will typically be files that list the path to each real file, one path per line. The mapper can then look up the files using StorageSystem APIs.

Next: Example Map-Reduce program

Last edited Mar 18, 2013 at 7:22 PM by maxluk, version 1


EgyptDevelopers Dec 14, 2014 at 11:41 AM 
I think the correct URL is

pwsmietan Jun 20, 2014 at 7:23 PM 
Has this URL been fixed somewhere? It's still broken on this page!

shyamn Apr 21, 2014 at 6:47 PM 
The http://hadoop.apache.org/common/docs/r0.20.0/streaming.html link appears broken. I get an HTTP 404 Not Found error when I click on this link.