问题
What are the factors which decide the number of mappers and reducers to use for a given set of data to achieve optimal performance? I am talking in terms of Apache Hadoop Map Reduce platform.
回答1:
According to the Cloudera blog
Have you set the optimal number of mappers and reducers?
The number of mappers is by default set to one per HDFS block. This is usually a good default, but see tip 2.
The number of reducers is best set to be the number of reduce slots in the cluster (minus a few to allow for failures). This allows the reducers to complete in a single wave.
回答2:
Mainly, number of mappers depends on amount of InputSplit generated by InputFormat#getInputSplits method. In particular FileInputSplit splits input directory in respect to blocks and files. Gzipped files don't split and whole input file passed to 1 mapper.
Two files:
f1 [ block1, block2],
f2 [block3, block4]
becomes 4 mappers
f1(offset of block1),
f1(offset of block2),
f2(offest of block3),
f2(offset of block4)
Other InputFormat has its own methods for files splitting (for example Hbase splits input on region boundaries).
Mappers can't be effectively controlled, except by using CombineFileInputFormat. But most mappers should be executed on host, where data resides.
Number of reduces in most cases specified by users. It mostly depends on amount of work, which need to be done in reducers. But their number should not be very big, because of algorithm, used by Mapper to distribute data among reducers. Some frameworks, like Hive can calculate number of reducers using empirical 1GB output per reducer.
General rule of thumb: use 1GB per reducer, but not more then 0.8-1.2 of your cluster capacity.
来源:https://stackoverflow.com/questions/12932044/what-determines-the-number-of-mappers-reducers-to-use-given-a-specified-set-of-d