Hadoop MapReduce: Clarification on number of reducers

ⅰ亾dé卋堺 提交于 2019-12-02 22:22:34
Judge Mental

one reducer is used for each key generated by the mapper

This comment is not correct. One call to the reduce() method is done for each key grouped by the grouping comparator. A reducer (task) is a process that handles zero or more calls to reduce(). The property to which you refer is talking about the number of reducer tasks.

To simplify @Judge Mental's (very accurate) answer a little bit: A reducer task can work on many keys at a time, but the mapred.reduce.tasks=# parameter declares how many simultaneous reducer tasks will run for a specific job.

An example if your mapred.reduce.tasks=10:
You have 2,000 keys, each key with 50 values (for an evenly distributed 10,000 k:v pairs). Each reducer should be roughly handling 200 keys (1,000 k:v pairs).

An example if your mapred.reduce.tasks=20:
You have 2,000 keys, each key with 50 values (for an evenly distributed 10,000 k:v pairs). Each reducer should be roughly handling 100 keys (500 k:v pairs).

In the example above, the fewer keys each reducer has to work with, the faster the overall job will be ... so long as you have the available reducer resources in the cluster, of course.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!