Merging multiple files into one within Hadoop

前端 未结 8 785
遇见更好的自我
遇见更好的自我 2020-12-01 02:18

I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do

相关标签:
8条回答
  • 2020-12-01 02:37

    okay...I figured out a way using hadoop fs commands -

    hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]
    

    It worked when I tested it...any pitfalls one can think of?

    Thanks!

    0 讨论(0)
  • 2020-12-01 02:46

    If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.

    For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:

    hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
    

    Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us.

    0 讨论(0)
  • 2020-12-01 02:49

    In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.

    hadoop jar \
        $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
        -Dmapred.reduce.tasks=1 \
        -Dmapred.job.queue.name=$QUEUE \
        -input "$INPUT" \
        -output "$OUTPUT" \
        -mapper cat \
        -reducer cat
    

    If you want compression add
    -Dmapred.output.compress=true \ -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

    0 讨论(0)
  • 2020-12-01 02:49
    hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>
    
    0 讨论(0)
  • 2020-12-01 02:50

    All the solutions are equivalent to doing a

    hadoop fs -cat [dir]/* > tmp_local_file  
    hadoop fs -copyFromLocal tmp_local_file 
    

    it only means that the local m/c I/O is on the critical path of data transfer.

    0 讨论(0)
  • 2020-12-01 02:52

    If you are working in Hortonworks cluster and want to merge multiple file present in HDFS location into a single file then you can run 'hadoop-streaming-2.7.1.2.3.2.0-2950.jar' jar which runs single reducer and get the merged file into HDFS output location.

    $ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \

    -Dmapred.reduce.tasks=1 \
    -input "/hdfs/input/dir" \
    -output "/hdfs/output/dir" \
    -mapper cat \
    -reducer cat
    

    You can download this jar from Get hadoop streaming jar

    If you are writing spark jobs and want to get a merged file to avoid multiple RDD creations and performance bottlenecks use this piece of code before transforming your RDD

    sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)

    This will merge all part files into one and save it again into hdfs location

    0 讨论(0)
提交回复
热议问题