Hadoop handling data skew in reducer

自闭症网瘾萝莉.ら 提交于 2019-11-28 02:18:45

问题


Am trying to determine if there are certain hooks available in the hadoop api (hadoop 2.0.0 mrv1) to handle data skew for a reducer. Scenario : Have a custom Composite key and partitioner in place to route data to reducers. In order to deal with the odd case but very likely case of a million keys and large values ending up on the same reducer need some sort of heuristic so that this data can be further partitioned to spawn off new reducers. Am thinking of a two step process

  1. set mapred.max.reduce.failures.percent to say 10% and let the job complete
  2. rerun the job on the failed data set by passing a configuration thru the driver which will cause my partitioner to then randomly partition the skewed data. The partitioner will implement the Configurable interface.

Is there a better way/another way ?

Possible counter-solution may be to write output of mappers and spin off another map job doing the work of the reducer, but do not want to pressurize the namenode.


回答1:


This idea comes to my mind, I am not sure how good it is.

Lets say you are running the Job with 10 mappers currently, which is failing because of the data skewness. The idea is, you set the number of reducer to 15 and also define what the max number of (key,value) should go to one reducer from each mapper. You keep that information in a hash map in your custom partitioner class. Once a particular reducer reaches the limit, you start sending the next set of (key,value) pairs to another reducer from the extra 5 reducer which we have kept for handling the skewness.




回答2:


If you process allow it, The use of a Combiner (reduce-type function) could help you. If you pre-aggregate the data in the Mapper side . Then, even all your data end in the same reducer the amount of data could be manageable.

An alternative could be reimplement the partitioner to avoid the skew case.



来源:https://stackoverflow.com/questions/32627836/hadoop-handling-data-skew-in-reducer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!