What runs first: the partitioner or the combiner?

后端未结

关注

 8  568

星月不相逢

I was wondering between partitioner and combiner, which runs first?

I was of the opinion it is the partitiner first and then combiner and then the keys are redirecte

相关标签:

8条回答

故里飘歌

2020-12-29 14:00

Combiner does not change the key value pair of output map task . It combines based on same key and emits the same Key /List value pair .

Partitioner takes the input from map/combiner(if exists) then segments the data and in process can emit new K List Value pair .

so Map-->Combine->Partition-->Reduce.

0 讨论(0)
发布评论:

提交评论
- 加载中...
伪装坚强ぢ

2020-12-29 14:06

Combiner is a map side reducer. It means what the reducer performing everything done by combiner. The main use of the combiner is a tuneup/ optimize the performance. After combiner optimize the code, the petitioner separate and assists to get multiple outputs. Combiner is optional, but highly recommendable for large files.

The partitioner divides the data according to the number of reducers and depends on the requirements devides the output. For instance: The output male, female, separate 2 outputs by using partitioner.

First Combiner will come then Partitioner will come, both are come in Mapside only, but not in reducer side.

0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2020-12-29 14:11

In Hadoop- The definitive guide 3rd edition, page 209, we have below words:

Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort. Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.

Each time the memory buffer reaches the spill threshold, a new spill file is created, so after the map task has written its last output record, there could be several spill files. Before the task is finished, the spill files are merged into a single partitioned and sorted output file. The configuration property io.sort.factor controls the maximum number of streams to merge at once; the default is 10.

If there are at least three spill files (set by the min.num.spills.for.combine property), the combiner is run again before the output file is written. Recall that combiners may be run repeatedly over th einput without affecting the final result. If there are only one or two spills, the potential reduction in map output size is not worth the overhead in invoking the combiner, so it is not run again for this map output.So combiner is run during merge spilled file.

So it seems the answer is:

Map -> Partitioner -> Sort -> Combiner -> Spill -> Combiner(if spills>=3) -> Merge.

However, in Apache Tutorial there are below words:

The Mapper outputs are sorted and then partitioned per Reducer.

The content is different from The definitive guide. The answer here seems to be:

Map -> Sort -> Combiner -> Partitioner -> Spill -> Combiner(if spills>=3) -> Merge.

Which one is correct? I lean to accept the later one in Apache Tutorial, but not quite sure.

0 讨论(0)
发布评论:

提交评论
- 加载中...

粉色の甜心

2020-12-29 14:15

Partitioner runs before Combiner: MapReduce Comprehensive Diagram.

You can have custom partition logic, and after mapper results are partitioned, the partitions are sorted and Combiner is applied to the sorted partitions.

See Hadoop MapReduce Comprehensive Description.

I checked it by running a word-count program with custom Combiner and Partitioner with timestamps logging:

Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682580 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682582 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682583 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682583 : world : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682584 : world : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682585 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682585 : world : 1
18/04/23 14:41:22 INFO mapred.LocalJobRunner: 
18/04/23 14:41:22 INFO mapred.MapTask: Starting flush of map output
18/04/23 14:41:22 INFO mapred.MapTask: Spilling map output
18/04/23 14:41:22 INFO mapred.MapTask: bufstart = 0; bufend = 107; bufvoid = 104857600
18/04/23 14:41:22 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214368(104857472); length = 29/6553600
Apr 23, 2018 2:41:22 PM mapreduce.WordCountCombiner reduce
INFO: Combiner: 1524483682614 : hello 
Apr 23, 2018 2:41:22 PM mapreduce.WordCountCombiner reduce
INFO: Combiner: 1524483682615 : world

0 讨论(0)

天涯浪人

2020-12-29 14:16

Partition comes first.

According to "Hadoop, the definitive guide", output of Mapper first writen to memory buffer, then spilled to local dir when buffer is about to overflow. The spilling data is parted according to Partitioner, and in each partition the result is sorted and combined if Combiner given.

You can simply modify the wordcount MR program to verify it. My result is: ("the quick brown fox jumped over a lazy dog")

Word, Step, Time

fox, Mapper, **********754

fox, Partitioner, **********754

fox, Combiner, **********850

fox, Reducer, **********904

Obviously, Combiner runs after Partitioner.

0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2020-12-29 14:19

combiner runs before partitiooner

combiner runs after map, to reduce the item count of map output. so it decrease the network overload. reduce runs after partitioner

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页