Merging multiple files into one within Hadoop

前端未结

关注

 8  785

I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do

相关标签:

8条回答

傲寒

2020-12-01 02:37
okay...I figured out a way using hadoop fs commands -
```
hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]
```
It worked when I tested it...any pitfalls one can think of?

Thanks!
0 讨论(0)
发布评论:

提交评论
- 加载中...
谎友^

2020-12-01 02:46
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.

For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
```
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
```
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-12-01 02:49
In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.
```
hadoop jar \
    $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
    -Dmapred.reduce.tasks=1 \
    -Dmapred.job.queue.name=$QUEUE \
    -input "$INPUT" \
    -output "$OUTPUT" \
    -mapper cat \
    -reducer cat
```
If you want compression add
-Dmapred.output.compress=true \ -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
0 讨论(0)
发布评论:

提交评论
- 加载中...
南方客

2020-12-01 02:49
```
hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
广开言路

2020-12-01 02:50
All the solutions are equivalent to doing a
```
hadoop fs -cat [dir]/* > tmp_local_file  
hadoop fs -copyFromLocal tmp_local_file 
```
it only means that the local m/c I/O is on the critical path of data transfer.
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人共我

2020-12-01 02:52
If you are working in Hortonworks cluster and want to merge multiple file present in HDFS location into a single file then you can run 'hadoop-streaming-2.7.1.2.3.2.0-2950.jar' jar which runs single reducer and get the merged file into HDFS output location.

$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
```
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat
```
You can download this jar from Get hadoop streaming jar

If you are writing spark jobs and want to get a merged file to avoid multiple RDD creations and performance bottlenecks use this piece of code before transforming your RDD

sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)

This will merge all part files into one and save it again into hdfs location
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页