Hadoop: compress file in HDFS?

后端 未结 7 1773
逝去的感伤
逝去的感伤 2020-11-27 18:23

I recently set up LZO compression in Hadoop. What is the easiest way to compress a file in HDFS? I want to compress a file and then delete the original. Should I create a

相关标签:
7条回答
  • 2020-11-27 18:55

    For me, it's lower overhead to write a Hadoop Streaming job to compress files.

    This is the command I run:

    hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
      -Dmapred.output.compress=true \
      -Dmapred.compress.map.output=true \
      -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
      -Dmapred.reduce.tasks=0 \
      -input <input-path> \
      -output $OUTPUT \
      -mapper "cut -f 2"
    

    I'll also typically stash the output in a temp folder in case something goes wrong:

    OUTPUT=/tmp/hdfs-gzip-`basename $1`-$RANDOM
    

    One additional note, I do not specify a reducer in the streaming job, but you certainly can. It will force all the lines to be sorted which can take a long time with a large file. There might be a way to get around this by overriding the partitioner but I didn't bother figuring that out. The unfortunate part of this is that you potentially end up with many small files that do not utilize HDFS blocks efficiently. That's one reason to look into Hadoop Archives

    0 讨论(0)
提交回复
热议问题