问题
A Combiner runs after the Mapper and before the Reducer,it will receive as input all data emitted by the Mapper instances on a given node. then emits output to the Reducers.
And also,If a reduce function is both commutative and associative, then it can be used as a Combiner.
My Question is what does the phrase "commutative and associative" mean in this situation?
回答1:
Assume you have a list of numbers, 1 2 3 4 5 6.
Associative here means you can take your operation and apply it to any subgroup, then apply it to the result of those and get the same answer:
(1) + (2 + 3) + (4 + 5 + 6)
==
(1 + 2) + (3 + 4) + (5) + (6)
==
...
Think of the parenthesis here as the execution of a combiner.
Commutative means that the order doesn't matter, so:
1 + 2 + 3 + 4 + 5 + 6
==
2 + 4 + 6 + 1 + 2 + 3
==
...
For example, addition, fits this property, as seen before. "Maximum" fits this property above as well, because the max of maxs is the max. max(a,b) == max(b,a).
Median is an example that doesn't work: the median of medians is not the true median.
Don't forget another important property of a combiner: the input types for the key/value and the output types of the key/value need to be the same. For example, you can't take in a string:int and return a string:float.
Often times, the reducer might output some sort of string instead of numerical value, which may prevent you from just plugging in your reducer as the combiner.
回答2:
For commutativity, let's say your reducer can be represented by a function (in the mathematical term) called f(). Then your reducer is commutative if f(a, b) = f(b, a) For instance:
- sum(A, B) is the same as sum(B, A)
- xor(A, B) is the same as xor(B, A)
- concat(A, B) is not the same as concat(B, A)
For associativity, the property is that f(f(a, b), c) = f(a, f(b, c)). For example:
- (A + B) + C is the same as A + (B + C)
- (A - B) - C is not the same as A - (B - C)
So in the context of Map/Reduce, your reducer has to respect these 2 properties. For example, if your reducer is doing just a sum(), or a max(), it respects both properties, but something like mean() or median() does not, and thus you can not use it as a combiner.
I personally see combiners as mini-reducers that run in memory after the map phase as an optimization to reduce network traffic, and the commutativity/associativity actually makes sense if you see a Map/Reduce this way:
来源:https://stackoverflow.com/questions/10220371/combiner-class-in-a-mapreduce-job