Am trying to determine if there are certain hooks available in the hadoop api (hadoop 2.0.0 mrv1) to handle data skew for a reducer. Scenario : Have a custom Composite key and partitioner in place to route data to reducers. In order to deal with the odd case but very likely case of a million keys and large values ending up on the same reducer need some sort of heuristic so that this data can be further partitioned to spawn off new reducers. Am thinking of a two step process
- set mapred.max.reduce.failures.percent to say 10% and let the job complete
- rerun the job on the failed data set by passing a configuration thru the driver which will cause my partitioner to then randomly partition the skewed data. The partitioner will implement the Configurable interface.
Is there a better way/another way ?
Possible counter-solution may be to write output of mappers and spin off another map job doing the work of the reducer, but do not want to pressurize the namenode.
This idea comes to my mind, I am not sure how good it is.
Lets say you are running the Job with 10 mappers currently, which is failing because of the data skewness. The idea is, you set the number of reducer to 15 and also define what the max number of (key,value) should go to one reducer from each mapper. You keep that information in a hash map in your custom partitioner class. Once a particular reducer reaches the limit, you start sending the next set of (key,value) pairs to another reducer from the extra 5 reducer which we have kept for handling the skewness.
If you process allow it, The use of a Combiner (reduce-type function) could help you. If you pre-aggregate the data in the Mapper side . Then, even all your data end in the same reducer the amount of data could be manageable.
An alternative could be reimplement the partitioner to avoid the skew case.
来源:https://stackoverflow.com/questions/32627836/hadoop-handling-data-skew-in-reducer