Hadoop streaming: single file or multi file per map. Don't Split

放肆的年华 提交于 2020-01-02 10:29:32

问题


I have a lot of zip files that need to be processed by a C++ library. So I use C++ to write my hadoop streaming program. The program will read a zip file, unzip it, and process the extracted data. My problem is that:

  1. my mapper can't get the content of exactly one file. It usually gets something like 2.4 files or 3.2 files. Hadoop will send several files to my mapper but at least one of the file is partial. You know zip files can't be processed like this. Can I get exactly one file per map? I don't want to use file list as input and read it from my program because I want to have the advantage of data locality.

  2. I can accept the contents of multiple zip file per map if Hadoop don't split the zip files. I mean exactly 1, 2, 3 files, not something like 2.3 files. Actually it will be even better because my program need to load about 800MB data file for processing the unziped data. Can we do this?


回答1:


You can find the solution here:

http://wiki.apache.org/hadoop/FAQ#How_do_I_get_each_of_a_job.27s_maps_to_work_on_one_complete_input-file_and_not_allow_the_framework_to_split-up_the_files.3F

The easiest way I would suggest is to set mapred.min.split.size to a large value so that your files do not get split.

If this does not work then you would need to implement an InputFormat which is not very difficult to do and you can find the steps at: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat




回答2:


Rather then depending on the min split size I would suggest an easier way is to Gzip your files.

There is a way to compress files using gzip

http://www.gzip.org/

If you are on Linux you compress the extracted data with

gzip -r /path/to/data

Now that you have this pass this data as your input in your hadoop streaming job.



来源:https://stackoverflow.com/questions/14027594/hadoop-streaming-single-file-or-multi-file-per-map-dont-split

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!