where does combiners combine mapper outputs - in map phase or reduce phase in a Map-reduce job?

前端未结

关注

 2  728

I was under the impression that combiners are just like reducers that act on the local map task, That is it aggregates the results of individual Map task in order to reduce

相关标签:

2条回答

无人共我

2021-01-13 05:08

The main function of a combiner is optimization. It acts like a mini-reducer for most cases. From page 206 of the same book, chapter - How mapreduce works(The map side):

Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.

The quote from your question,

If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.

Both the quotes indicate that a combiner is run primarily for compactness. Reducing the network bandwidth for output transfer is an advantage of this optimization.

Also, from the same book,

Recall that combiners may be run repeatedly over the input without affecting the final result. If there are only one or two spills, then the potential reduction in map output size is not worth the overhead in invoking the combiner, so it is not run again for this map output.

Meaning that hadoop doesn't guarentee how many times a combiner is run(could be zero also)

A combiner is never run for map-only jobs. It makes sense because, a combiner changes the map output. Also, since it doesn't guarantee the number of times it is called, the map output is not guaranteed to be the same either.

0 讨论(0)
发布评论:

提交评论
- 加载中...
半阙折子戏

2021-01-13 05:23
1. A combiner will not run if it is a Map-Only job.
2. A combiner only runs if there are more than 3 spill files written to the disk.
0 讨论(0)
发布评论:

提交评论
- 加载中...