Overview of .Net Map/Reduce
Hadoop Streaming is a facility for writing map-reduce jobs in the language of you choice. Hadoop API for .NET is a wrapper to Streaming that provides a convenient experience for .NET developers. An understanding of the concepts and general functionality provided
by Hadoop Streaming is necessary for successful use of this API: see
for this background information.
The main facilities provided by this API are:
- Abstraction of job execution to avoid manual construction of streaming command-line.
- Mapper, Reducer, Combiner base classes and runtime wrappers that provide helpful abstractions. For example, the ReducerCombinerBase class provides input through (string key, IEnumerable<string> value) groups.
- Detection of .NET dependencies and automatic inclusion in streaming job.
- Local unit-testing support for map/combine/reduce classes via StreamingUnit class
- Support for JSON I/O and strongly typed mapper/combiner/reducer via Json* classes. The pattern used by the JSON classes can be used to create other serialization wrappers.
The jobs can be submitted for execution via the API. The jobs can be submitted to local or remote Hadoop cluster. In case of local cluster a Hadoop Streaming command is generated and executed. The command is displayed on the console and can be used for direct
invocation if required. In case of remote cluster Templeton/WebHCat web service is used.
Input & Output formats
The input/output format supported is line-oriented tab-separated records, staged in a Hadoop-supported file system such as HDFS or Azure Blob Storage. The input may comprise many files but each should have a consistent format: records delimited by \n\r, columns
delimited by \t.
When a job comprises both a mapper and reducer, the key values emitted by the mapper must be plain text that can be sorted successfully with an ordinal-text-comparer such as provided by .NETs `StringComparison.Ordinal`.
In all other cases the record fields may comprise formatted text such as Json or other text representation of structured data. The API includes support for Json fields via the classes in the `Microsoft.Hadoop.MapReduce.Json` namespace.
If data is in a binary format or document-oriented format (such as a folder full of .docx files), the input to a map-reduce job will typically be files that list the path to each real file, one path per line. The mapper can then look up the files using StorageSystem
Example Map-Reduce program