Hadoop streaming with zip input files

杀马特。学长 韩版系。学妹 提交于 2020-01-04 05:26:16

问题


I'm trying to run a streaming job where the input files are csv inside zip files. I tried using this, however it doesn't seem for work with CDH4 (I get the error class com.cotdp.hadoop.ZipFileInputFormat not org.apache.hadoop.mapred.InputFormat)

Anyone know of an input file reader I can use for streaming with zip files? If possible, I'm looking for a multi file reader (that can be given the top level directory).


回答1:


I ended up writing zipstream.

Note that is process only the first file in the zip, I'll probably add support for multiple files later.




回答2:


There are two hadoop api's for input formats. mapred.InputFormat, and mapreduce.InputFormat.

mapreduce is the newer API and the one you should be using if you can.

I would check to see which InputFormat the ZipInputFormat actually implements. If it implements the mapreduce version you'll need to move your job over to this second API.

For a bit of background: In an earlier Hadoop version 'mapred' was depreciated in favor of 'mapreduce', a newer, faster, and cleaner implementation. Unfortunately this new API didn't include all the features of the old one, so in more recent versions of Hadoop 'mapred' was reinstated, and now there are two APIs that basically do the same thing.



来源:https://stackoverflow.com/questions/15257447/hadoop-streaming-with-zip-input-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!