I was wondering between partitioner and combiner, which runs first?
I was of the opinion it is the partitiner first and then combiner and then the keys are redirecte
Combiner does not change the key value pair of output map task . It combines based on same key and emits the same Key /List value pair .
Partitioner takes the input from map/combiner(if exists) then segments the data and in process can emit new K List Value pair .
so Map-->Combine->Partition-->Reduce.
Combiner is a map side reducer. It means what the reducer performing everything done by combiner. The main use of the combiner is a tuneup/ optimize the performance. After combiner optimize the code, the petitioner separate and assists to get multiple outputs. Combiner is optional, but highly recommendable for large files.
The partitioner divides the data according to the number of reducers and depends on the requirements devides the output. For instance: The output male, female, separate 2 outputs by using partitioner.
First Combiner will come then Partitioner will come, both are come in Mapside only, but not in reducer side.
In Hadoop- The definitive guide 3rd edition, page 209, we have below words:
Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort. Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.
Each time the memory buffer reaches the spill threshold, a new spill file is created, so after the map task has written its last output record, there could be several spill files. Before the task is finished, the spill files are merged into a single partitioned and sorted output file. The configuration property io.sort.factor controls the maximum number of streams to merge at once; the default is 10.
If there are at least three spill files (set by the min.num.spills.for.combine property), the combiner is run again before the output file is written. Recall that combiners may be run repeatedly over th einput without affecting the final result. If there are only one or two spills, the potential reduction in map output size is not worth the overhead in invoking the combiner, so it is not run again for this map output.So combiner is run during merge spilled file.
So it seems the answer is:
Map -> Partitioner -> Sort -> Combiner -> Spill -> Combiner(if spills>=3) -> Merge.
However, in Apache Tutorial there are below words:
The Mapper outputs are sorted and then partitioned per Reducer.
The content is different from The definitive guide. The answer here seems to be:
Map -> Sort -> Combiner -> Partitioner -> Spill -> Combiner(if spills>=3) -> Merge.
Which one is correct? I lean to accept the later one in Apache Tutorial, but not quite sure.
Partitioner
runs before Combiner
: MapReduce Comprehensive Diagram.
You can have custom partition logic, and after mapper results are partitioned, the partitions are sorted and Combiner
is applied to the sorted partitions.
See Hadoop MapReduce Comprehensive Description.
I checked it by running a word-count program with custom Combiner
and Partitioner
with timestamps logging:
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682580 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682582 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682583 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682583 : world : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682584 : world : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682585 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682585 : world : 1
18/04/23 14:41:22 INFO mapred.LocalJobRunner:
18/04/23 14:41:22 INFO mapred.MapTask: Starting flush of map output
18/04/23 14:41:22 INFO mapred.MapTask: Spilling map output
18/04/23 14:41:22 INFO mapred.MapTask: bufstart = 0; bufend = 107; bufvoid = 104857600
18/04/23 14:41:22 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214368(104857472); length = 29/6553600
Apr 23, 2018 2:41:22 PM mapreduce.WordCountCombiner reduce
INFO: Combiner: 1524483682614 : hello
Apr 23, 2018 2:41:22 PM mapreduce.WordCountCombiner reduce
INFO: Combiner: 1524483682615 : world
Partition comes first.
According to "Hadoop, the definitive guide", output of Mapper first writen to memory buffer, then spilled to local dir when buffer is about to overflow. The spilling data is parted according to Partitioner, and in each partition the result is sorted and combined if Combiner given.
You can simply modify the wordcount MR program to verify it. My result is: ("the quick brown fox jumped over a lazy dog")
Word, Step, Time
fox, Mapper, **********754
fox, Partitioner, **********754
fox, Combiner, **********850
fox, Reducer, **********904
Obviously, Combiner runs after Partitioner.
combiner runs before partitiooner
combiner runs after map, to reduce the item count of map output. so it decrease the network overload. reduce runs after partitioner