Distributed cache

Jan 24, 2013 at 10:25 AM
Edited Jan 29, 2013 at 10:36 AM

Hi, does anyone have a simple example of using the HDFS distributed cache via this SDK?

 

I need to load a machine learning model (XML) to use in each map task, and thought the distributed cache might be the best way.  If anyone knows a better way, please let me know.

P.S. Although the model is not large, it is too big to pass as a parameter to the job (I get an exception).  Also the model may change between each execution of a chain of jobs.

 

Thanks, John

Jan 30, 2013 at 10:55 AM

I ended up deploying the model using HadoopJobConfiguration.FilesToInclude along with some other native dlls I was deploying.

I was then able to load the model in the MapperBase.Initialize override using the location of the mapper assembly  (Path.GetDirectoryName(this.GetType().Assembly.Location)).

I am guessing there is a better/preferred way of accomplishing this, so please let me know!

Thanks, John

Jul 13, 2013 at 7:39 AM
JohnGS wrote:
I ended up deploying the model using HadoopJobConfiguration.FilesToInclude along with some other native dlls I was deploying. I was then able to load the model in the MapperBase.Initialize override using the location of the mapper assembly  (Path.GetDirectoryName(this.GetType().Assembly.Location)). I am guessing there is a better/preferred way of accomplishing this, so please let me know! Thanks, John
When will the MapperBase.Initialize be called? How about loading the model from an azure storage account?

I was wondering if I declare a property to carry a object inside the Mapper Class which inherited from the MapperBase class, will the property be shared by the whole cluster, or by a single node, or by a single process?
Coordinator
Jul 18, 2013 at 12:04 AM
To John, using filesToInclude is right appraoch for this task. You should be able to read file from current folder as hadoop will copy it to the working folder of each node.

To Wen: The task is launched on each data node as separate process. The assembly is copied to the data nodes so definition of the job class will be the same across nodes in the cluster.
Jul 18, 2013 at 2:20 AM
Edited Jul 18, 2013 at 2:27 AM
Thanks, maxluk.

Beside using filesToInclude is there any other way to copy files to each data node? As I am now doing tests, I may submit my jobs many times. But copying all the shared files takes considerable time. so I hope there is a way to upload files to a specified folder on each data node. Then I am free to filesToInclude, could just upload the assembly.
Coordinator
Jul 18, 2013 at 6:00 PM
At the moment you can add files to HDINSIGHT only on a per job basis. We plan to add cluster customization features that would allow creation of the cluster with additional files deployed.