hadoop streaming produces uncompressed files despite mapred.output.compress=true

问题

I run a hadoop streaming job like this:

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar 
       -Dmapred.reduce.tasks=16
       -Dmapred.output.compres=true
       -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
       -input foo
       -output bar
       -mapper "python zot.py"
       -reducer /bin/cat

I do get 16 files in the output directory which contain the correct data, but the files are not compressed:

$ hadoop fs -get bar/part-00012
$ file part-00012
part-00012: ASCII text, with very long lines

why is part-00012 not compressed?
how do I have my data set split into a small number (say, 16) gzip-compressed files?

PS. See also "Using gzip as a reducer produces corrupt data"

PPS. This is for vw.

PPPS. I guess I can do hadoop fs -get, gzip, hadoop fs -put, hadoop fs -rm 16 times, but this seems like a very non-hadoopic way.

回答1:

There is a typo in your mapred.output.compres parameter. If you look through your job history I'll bet it's turned off.

Also you could avoid having the reduce-stage all together, since that's just catting files. Unless you specifically need 16 part files, try leaving it map-only.

    hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar 
   -Dmapred.reduce.tasks=0
   -Dmapred.output.compress=true
   -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
   -input foo
   -output bar
   -mapper "python zot.py"

来源：https://stackoverflow.com/questions/23767799/hadoop-streaming-produces-uncompressed-files-despite-mapred-output-compress-true

标签

Hadoop

hadoop-streaming

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!