Hadoop streaming with Python: Keeping track of line numbers

问题

I am trying to do what should be a simple task: I need to convert a text file to upper case using Hadoop streaming with Python.

I want to do it by using the TextInputFormat which passes file position keys and text values to the mappers. The problem is that Hadoop streaming automatically discards the file position keys, which are needed to preserve the ordering of the document.

How can I retain the file position information of the input to the mappers? Or is there a better way to convert a document to upper case using Hadoop streaming?

Thank you.

回答1:

If your job is just to upper case a single file, then Hadoop isn't really going to give you anything that streaming the file to a single machine, performing the upper case and then writing the contents back up to HDFS. Even with a huge file (say 1TB), you are still going to need to get everything to a single reducer such that when it is written back to HDFS it's stored in a single contiguous file.

In this case i would configure your streaming job to have a single mapper per file (set the split min and max size to something huge, larger than the file itself), and run a map only job.

来源：https://stackoverflow.com/questions/20303448/hadoop-streaming-with-python-keeping-track-of-line-numbers

标签

python

Hadoop

position

streaming

line-numbers

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!