hadoop: difference between 0 reducer and identity reducer?

前端未结

关注

 4  852

故里飘歌

I am just trying to confirm my understanding of difference between 0 reducer and identity reducer.

0 reducer means reduce step will be skipped and mapper outp

相关标签:

4条回答

执念已碎

2020-12-01 06:04

You understanding is correct. I would define it as following: If you do not need sorting of map results - you set 0 reduced,and the job is called map only.
If you need to sort the mapping results, but do not need any aggregation - you choose identity reducer.
And to complete the picture we have a third case : we do need aggregation and, in this case we need reducer.

0 讨论(0)
发布评论:

提交评论
- 加载中...
感动是毒

2020-12-01 06:05

The main difference between "No Reducer" (mapred.reduce.tasks=0) and "Standard reducer" which is IdentityReducer (mapred.reduce.tasks=1 etc) is when you use "No reducer" there is no partitioning&shuffling processes after MAP stage. Therefore, in this case you will get 'pure' output from your mappers without any further processing. It helps for development and debugging puproses, but not only.

0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2020-12-01 06:13

Another use-case for using the Identity Reducer is to combine all the results into <# of reducers> output files. This can be handy if you are using Amazon Web Services to write to S3 directly, especially if the mapper output is small (e.g. a grep/search for a record), and you have a lot of mappers (e.g. 1000's).

0 讨论(0)
发布评论:

提交评论
- 加载中...
旧巷少年郎

2020-12-01 06:19

It depends on your business requirements. If you are doing a wordcount you should reduce your map output to get a total result. If you just want to change the words to upper case, you don't need a reduce.

0 讨论(0)
发布评论:

提交评论
- 加载中...