Streaming Files from HDFS; Cloudera support

Oct 25, 2012 at 3:26 PM

First off, congratulations to the team on getting this awesome SDK together! It looks very promising and I think it will be a new era in big data on Windows/.NET.

There are 2 things holding me back from using this for my specific needs after looking at the API: the ability to stream large files from HDFS, and support for different distributions like Cloudera.

First about streaming large files. The only options in the HdfsFile type to read a file from HDFS are to ReadAllBytes, ReadAllLines, or CopyToLocal. That is great for smallish files, but not great for large files. I have an existing HDFS cluster I'm working with that has many, many large files (1+ GB). Calling ReadAllBytes and ReadAllLines are not ideal for very large files due to having to put all of that in RAM, and CopyToLocal takes a decent amount of time and creates a temp file that you have to clean up later. If you could create an equivalent of the HdfsFileSystem.OpenFileStream method from the Hadoop on Azure preview, that would be awesome.

Second about Cloudera support. The existing Hadoop cluster I'm working with is using Cloudera 0.20.2 (CDH3u3). As you may know, they have a different protocol version for some reason or another which prevents the standard Apache-format client from connecting. I've found this typically isn't an issue with other means of interacting with the JVM by swapping out the HADOOP_HOME environment variable with a path to the extracted Cloudera distribution. But this .NET SDK is looking for hadoop.cmd, which does not come with that distribution. I tried copying that hadoop.cmd into the Cloudera bin path with HADOOP_HOME pointing to that folder, but I now get a StreamingException with "Copy to local did not succeed". It would be nice if it were possible to support this distribution as well.

Thanks!

Oct 25, 2012 at 8:28 PM

Thank you for the feedback and support.

re: HdfsFile/HdfsPath: I agree that they currently support only limited scenarios; they are included as the minimal set of necessary utilities while we work on getting more complete Hdfs support into the product.  As you mention, the short-term solution for processing large HDFS files is to use HdfsFile.CopyToLocal and System.IO.*

re: Cloudera distribution: Good feedback and certainly something to consider.  (and if you get things working, please share the information).

-Mike

Oct 9, 2013 at 4:41 PM