What determines the number of mappers/reducers to use given a specified set of data [closed]

情到浓时终转凉″ 提交于 2021-02-07 10:35:30

问题


What are the factors which decide the number of mappers and reducers to use for a given set of data to achieve optimal performance? I am talking in terms of Apache Hadoop Map Reduce platform.


回答1:


According to the Cloudera blog

Have you set the optimal number of mappers and reducers?
The number of mappers is by default set to one per HDFS block. This is usually a good default, but see tip 2.
The number of reducers is best set to be the number of reduce slots in the cluster (minus a few to allow for failures). This allows the reducers to complete in a single wave.




回答2:


Mainly, number of mappers depends on amount of InputSplit generated by InputFormat#getInputSplits method. In particular FileInputSplit splits input directory in respect to blocks and files. Gzipped files don't split and whole input file passed to 1 mapper.

Two files:
f1 [ block1, block2], 
f2 [block3, block4] 
becomes 4 mappers 
f1(offset of block1), 
f1(offset of block2), 
f2(offest of block3),
f2(offset of block4)

Other InputFormat has its own methods for files splitting (for example Hbase splits input on region boundaries).

Mappers can't be effectively controlled, except by using CombineFileInputFormat. But most mappers should be executed on host, where data resides.

Number of reduces in most cases specified by users. It mostly depends on amount of work, which need to be done in reducers. But their number should not be very big, because of algorithm, used by Mapper to distribute data among reducers. Some frameworks, like Hive can calculate number of reducers using empirical 1GB output per reducer.

General rule of thumb: use 1GB per reducer, but not more then 0.8-1.2 of your cluster capacity.



来源:https://stackoverflow.com/questions/12932044/what-determines-the-number-of-mappers-reducers-to-use-given-a-specified-set-of-d

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!