Using gzip as a reducer produces corrupt data

问题

When I run hadoop streaming like this:

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar 
       -Dmapred.reduce.tasks=16
       -input foo
       -output bar
       -mapper "python zot.py"
       -reducer gzip

I get 16 files in the output directory which are, alas, corrupt:

$ hadoop fs -get bar/part-00012
$ file part-00012
gzip compressed data, from Unix
$ cat part-00012 | gunzip >/dev/null

gzip: stdin: invalid compressed data--format violated

when I inspect the output of cat part-00012 | gunzip visually, I see parts which look somewhat right and then quite wrong and then gunzip dies.

why is the file corrupt?

PS. I know I can have my data set split into a small number gzip-compressed files using mapred.output.compress=true.

PPS. This is for vw.

回答1:

You'll want to use output.compress directly in the jobconf settings. No need to send it through gzip. See my answer to your other question.

来源：https://stackoverflow.com/questions/23767971/using-gzip-as-a-reducer-produces-corrupt-data

标签

Hadoop

compression

hadoop-streaming

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!