Processing large set of small files with Hadoop

后端未结

关注

 5  809

I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the

相关标签:

5条回答

天命终不由人

2021-01-01 00:36

From my still limited understanding og Hadoop, I believe the right solution would be to create SequenceFile(s) containing your HTML files as values and possibly the URL as the key. If you do a M/R job over the SequenceFile(s), each mapper will process many files (depending on the split size). Each file will be presented to the map function as a single input. You may want to use SequenceFileAsTextInputFormat as the InputFormat to read these files.

Also see: Providing several non-textual files to a single map in Hadoop MapReduce

0 讨论(0)
发布评论:

提交评论
- 加载中...
生来不讨喜

2021-01-01 00:39

Can you concatenate files before submitting them to Hadoop?

0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2021-01-01 00:49

Using the HDFS won't change that you are causing hadoop to handle a large quantity of small files. The best option in this case is probably to cat the files into a single (or few large) file(s). This will reduce the number of mappers you have, which will reduce the number of things required to be processed.

To use the HDFS can improve performance if you are operating on a distributed system. If you are only doing psuedo-distributed (one machine) then the HDFS isn't going to improve performance. The limitation is the machine.

When you are operating on a large number of small files, that will require a large number of mappers and reducers. The setup/down can be comparable to the processing time of the file itself, causing a large overhead. cating the files should reduce the number of mappers hadoop runs for the job, which should improve performance.

The benefit you could see from using the HDFS to store the files would be in distributed mode, with multiple machines. The files would be stored in blocks (default 64MB) across machines and each machine would be capable of processing a block of data that resides on the machine. This reduces network bandwidth use so it doesn't become a bottleneck in processing.

Archiving the files, if hadoop is going to unarchive them will just result in hadoop still having a large number of small files.

Hope this helps your understanding.

0 讨论(0)
发布评论:

提交评论
- 加载中...
萌比男神i

2021-01-01 00:50

CombineFileInputFormat can be used in this case which works well for large numaber of small files. This packs many of such files in a single split thus each mapper has more to process (1 split = 1 map task). The overall processing time for mapreduce will also will also fall since there are lesser number of mappers running. Since ther are no archive-aware InputFormat using CombineFileInputFormat will improve performance.

0 讨论(0)
发布评论:

提交评论
- 加载中...
半阙折子戏

2021-01-01 00:53

I bookmarked this article recently to read it later and found the same question here :) The entry is a bit old, not exactly sure how relevant it is now. The changes to Hadoop are happening at a very rapid pace.

http://www.cloudera.com/blog/2009/02/the-small-files-problem/

The blog entry is by Tom White, who is also the author of "Hadoop: The Definitive Guide, Second Edition", a recommended read for those who are getting started with Hadoop.

http://oreilly.com/catalog/0636920010388

0 讨论(0)
发布评论:

提交评论
- 加载中...