Hadoop: compress file in HDFS?

后端 未结 7 1765
逝去的感伤
逝去的感伤 2020-11-27 18:23

I recently set up LZO compression in Hadoop. What is the easiest way to compress a file in HDFS? I want to compress a file and then delete the original. Should I create a

相关标签:
7条回答
  • 2020-11-27 18:31

    The streaming command from Jeff Wu along with a concatenation of the compressed files will give a single compressed file. When a non java mapper is passed to the streaming job and the input format is text streaming outputs just the value and not the key.

    hadoop jar contrib/streaming/hadoop-streaming-1.0.3.jar \
                -Dmapred.reduce.tasks=0 \
                -Dmapred.output.compress=true \
                -Dmapred.compress.map.output=true \
                -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
                -input filename \
                -output /filename \
                -mapper /bin/cat \
                -inputformat org.apache.hadoop.mapred.TextInputFormat \
                -outputformat org.apache.hadoop.mapred.TextOutputFormat
    hadoop fs -cat /path/part* | hadoop fs -put - /path/compressed.gz
    
    0 讨论(0)
  • 2020-11-27 18:33

    This is what I've used:

    /*
     * Pig script to compress a directory
     * input:   hdfs input directory to compress
     *          hdfs output directory
     * 
     * 
     */
    
    set output.compression.enabled true;
    set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;
    
    --comma seperated list of hdfs directories to compress
    input0 = LOAD '$IN_DIR' USING PigStorage();
    
    --single output directory
    STORE input0 INTO '$OUT_DIR' USING PigStorage(); 
    

    Though it's not LZO so it may be a bit slower.

    0 讨论(0)
  • 2020-11-27 18:40

    I suggest you write a MapReduce job that, as you say, just uses the Identity mapper. While you are at it, you should consider writing the data out to sequence files to improve performance loading. You can also store sequence files in block-level and record-level compression. Yo should see what works best for you, as both are optimized for different types of records.

    0 讨论(0)
  • 2020-11-27 18:43

    I know this is old thread, but if anyone following this thread (like me) it would be useful to know that any of following 2 methods gives you a tab (\t) character at the end of each line

     hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
          -Dmapred.output.compress=true \
          -Dmapred.compress.map.output=true \
          -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
          -Dmapred.reduce.tasks=0 \
          -input <input-path> \
          -output $OUTPUT \
          -mapper "cut -f 2"
    
    
    hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
            -Dmapred.reduce.tasks=1 \
            -Dmapred.output.compress=true \
            -Dmapred.compress.map.output=true \
            -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
            -input /input/raw_file \
            -output /archives/ \
            -mapper /bin/cat \
            -reducer /bin/cat \
            -inputformat org.apache.hadoop.mapred.TextInputFormat \
            -outputformat org.apache.hadoop.mapred.TextOutputFormat
    

    From this hadoop-streaming.jar adds x'09' at the end of each line, I found the fix and we need to set following 2 parameters to respecitve delimiter you use (in my case it was ,)

     -Dstream.map.output.field.separator=, \
     -Dmapred.textoutputformat.separator=, \
    

    full command to execute

    hadoop jar <HADOOP_HOME>/jars/hadoop-streaming-2.6.0-cdh5.4.11.jar \
            -Dmapred.reduce.tasks=1 \
            -Dmapred.output.compress=true \
            -Dmapred.compress.map.output=true \
     -Dstream.map.output.field.separator=, \
     -Dmapred.textoutputformat.separator=, \
            -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec \
            -input file:////home/admin.kopparapu/accenture/File1_PII_Phone_part3.csv \
            -output file:///home/admin.kopparapu/accenture/part3 \
     -mapper /bin/cat \
            -reducer /bin/cat \
            -inputformat org.apache.hadoop.mapred.TextInputFormat \
            -outputformat org.apache.hadoop.mapred.TextOutputFormat
    
    0 讨论(0)
  • 2020-11-27 18:47

    @Chitra I cannot comment due to reputation issue

    Here is everything in one command: Instead of using the second command, you can reduce into one compressed file directly

    hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
            -Dmapred.reduce.tasks=1 \
            -Dmapred.output.compress=true \
            -Dmapred.compress.map.output=true \
            -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
            -input /input/raw_file \
            -output /archives/ \
            -mapper /bin/cat \
            -reducer /bin/cat \
            -inputformat org.apache.hadoop.mapred.TextInputFormat \
            -outputformat org.apache.hadoop.mapred.TextOutputFormat
    

    Thus, you gain a lot of space by having only one compress file

    For example, let's say i have 4 files of 10MB (it's plain text, JSON formatted)

    The map only is giving me 4 files of 650 KB If I map and reduce I have 1 file of 1.05 MB

    0 讨论(0)
  • 2020-11-27 18:52

    Well, if you compress a single file, you may save some space, but you can't really use Hadoop's power to process that file since the decompression has to be done by a single Map task sequentially. If you have lots of files, there's Hadoop Archive, but I'm not sure it includes any kind of compression. The main use case for compression I can think of is compressing the output of Maps to be sent to Reduces (save on network I/O).

    Oh, to answer your question more complete, you'd probably need to implement your own RecordReader and/or InputFormat to make sure the entire file got read by a single Map task, and also it used the correct decompression filter.

    0 讨论(0)
提交回复
热议问题