问题
When I run hadoop streaming like this:
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar
-Dmapred.reduce.tasks=16
-input foo
-output bar
-mapper "python zot.py"
-reducer gzip
I get 16 files in the output directory which are, alas, corrupt:
$ hadoop fs -get bar/part-00012
$ file part-00012
gzip compressed data, from Unix
$ cat part-00012 | gunzip >/dev/null
gzip: stdin: invalid compressed data--format violated
when I inspect the output of cat part-00012 | gunzip
visually, I see parts which look somewhat right and then quite wrong and then gunzip
dies.
- why is the file corrupt?
PS. I know I can have my data set split into a small number gzip-compressed files using mapred.output.compress=true.
PPS. This is for vw.
回答1:
You'll want to use output.compress directly in the jobconf settings. No need to send it through gzip. See my answer to your other question.
来源:https://stackoverflow.com/questions/23767971/using-gzip-as-a-reducer-produces-corrupt-data