Hadoop Streaming Job with binary input?

你离开我真会死。 提交于 2020-01-05 06:53:09

问题


I wish to convert a binary file in one format to a SequenceFile.

I have a Python script that takes that format on stdin and can output whatever I want.

The input format is not line-based. The individual records are binary themselves, hence the output format cannot be \t delimited or broken into lines with \n.

Can I use the Hadoop Streaming interface to consume a binary format? How do I produce a binary output format?

I assume the answer is "No" unless I hear otherwise.


回答1:


You may consider using NullWritable as output, and generating the SequenceFile directly inside of your python script. You can look up the hadoop-python project in github to see candidate code: though it is admittedly bit large-ish/heavy it does handle the sequencefile generation.



来源:https://stackoverflow.com/questions/15012162/hadoop-streaming-job-with-binary-input

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!