Serializing data with the Microsoft Avro Library

Nov 27, 2014 at 10:37 AM
I am experiencing problems when the structure \ class that needs to serialized has more than one variable defined of type Guid. When AvroContainer.CreateReader is called a KeyNotFoundException is raised: The given key was not present in the dictionary.

Serialization is based on Sample 5 of this article: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-dotnet-avro-serialization/#Scenario5I

Sample code to simulate exception:
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Hadoop.Avro;
using System.Xml;
using System.Runtime.Serialization;
using Microsoft.Hadoop.Avro.Container;

namespace Avro.TestSerialization
{
  [DataContract(Namespace = "Avro.WcfHost.Model")]
  public struct TestDTO
  {
    [DataMember]
    public Guid GuidA;

    [DataMember]
    public string TestString;

    [DataMember]
    public Guid GuidB;
  }

  public class TestSerialization
  {
    private readonly IAvroSerializer<TestDTO> m_AvroSerializer;

    public bool TestEncodeDecode(TestDTO Obj)
    {
      TestDTO ObjResult;

      var memoryStreamWriter = new MemoryStream();
      
      var avroSerializerSettings = new AvroSerializerSettings();
      avroSerializerSettings.Resolver = new AvroDataContractResolver(allowNullable: true);

      using (var w = AvroContainer.CreateWriter<TestDTO>(memoryStreamWriter, Codec.Deflate))
      {
        using (var sequentialWriter = new SequentialWriter<TestDTO>(w, 24))
        {
          sequentialWriter.Write((TestDTO)Obj);
        }
      }

      using (var memoryStreamReader = new MemoryStream(memoryStreamWriter.ToArray()))
      {
        using (var sequentialReader = new SequentialReader<TestDTO>(AvroContainer.CreateReader<TestDTO>(memoryStreamReader, leaveOpen: true, settings: avroSerializerSettings, codecFactory: new CodecFactory())))
        {
          ObjResult = (TestDTO)sequentialReader.Objects.FirstOrDefault();
        }
      }

      if (ObjResult.Equals(Obj))
      {
        return true;
      }
      else
      {
        return false;
      }
    }
  }
}
I have tested this using Microsoft.Hadoop.Avro Version 1.4.0.0.

Is there something wrong with my implementation?

Regards,

Christiaan
Nov 27, 2014 at 2:22 PM
Ok I have downloaded the source code for Microsoft.Hadoop.Avro library and debugged the code.

I think I have found the problem, the above mentioned class, TestDTO, schema get serialized as follows:

"{\"type\":\"record\",\"name\":\"Avro.WcfHost.Model.TestDTO\",\"fields\":[{\"name\":\"GuidA\",\"type\":{\"type\":\"fixed\",\"name\":\"System.Guid\",\"size\":16}},{\"name\":\"TestString\",\"type\":[\"null\",\"string\"]},{\"name\":\"GuidB\",\"type\":\"System.Guid\"}]}"

It seems that a Guid gets classified as a NamedSchema. When the reader tries to de-serialize the second Guid it cannot find the definition in the namedSchema Dictionary and when it tries to Parse the type the Exception gets thrown because the key (name) does not exist in {\"name\":\"GuidB\",\"type\":\"System.Guid\"} as it needs to use the NamedSchema information contained in GuidA definition

JsonSchemaBuilder.cs line: 349

I have added the following code inbetween line 349 and 350 to add a newNamedSchema if it does not exist:
if (type is NamedSchema)
{
  NamedSchema newNamedSchema = (NamedSchema)type;
  string typeFullName = newNamedSchema.FullName;
  if (!namedSchemas.ContainsKey(typeFullName))
  {
    namedSchemas.Add(typeFullName, newNamedSchema);
  }
}
This seems to solve the problem, can someone please confirm that my interpretation is correct?