Complex type serialization

The primary data format for Hadoop Streaming is line-oriented text and so the normal currency of Map and Reduce implementations is System.String. It can often be convenient to transform the strings to/from .NET objects and this requires a serialization mechanism. A set of classes that use Json.NET as the serialization engine are provided in the Microsoft.Hadoop.MapReduce.Json namespace. As an example of their use, consider input data that has JSON format values:
    {ID=2, Name="Alan"}
    {ID=3, Name="Bob"}

Further, assume that a class definition that can represent the values is
    public Employee {
        public int ID {get;set;}
        public string Name {get;set;}

The Json Mapper classes can help perform the deserialization and transformation to Employee instances that is required for convenient processing. Let's assume the output of the Mapper will be simple strings; in this case the appropriate Mapper type to use is JsonInMapperBase<>. For example:
    public class MyMapper : JsonInMapperBase<Employee> {
        public override void Map(Employee value, MapperContext context){

JsonInMapperBase performs the deserialization of the input lines and the instantiation of Employee objects. The Map function that must be implemented can deal with Employee inputs rather than strings.

Other classes in `Microsoft.Hadoop.MapReduce.Json` support transferring object-representations between mapper and reducer and as the output of the reducer.

Next: Debugging

Last edited Mar 18, 2013 at 7:28 PM by maxluk, version 1


No comments yet.