问题
I have a hadoop MapReduce program that distributes keys unevenly. Some reducers end up with two keys, some with one key, and some with none. how do I force hadoop to distribute each partition with a certain key to a separate reducer. I have nine unique keys of the form:
0,0
0,1
0,2
1,0
1,1
1,2
2,0
2,1
2,2
and I set the job.setNumReduceTasks(9); but the hashpartitioner seems to hash two keys to the same hashcode causing overlapped keys being sent to the same reducer and leaving some reducers idle.
Does a random partitioner resolve this? will it send each unique key to a random reducer guaranteeing each reducer receives a single key. How do I enable it and replace the default?
EDIT:
can someone please explain why my output looks like
-rw-r--r-- 1 user supergroup 0 2018-04-19 18:58 outbin9/_SUCCESS
drwxr-xr-x - user supergroup 0 2018-04-19 18:57 outbin9/_logs
-rw-r--r-- 1 user supergroup 869 2018-04-19 18:57 outbin9/part-r-00000
-rw-r--r-- 1 user supergroup 1562 2018-04-19 18:57 outbin9/part-r-00001
-rw-r--r-- 1 user supergroup 913 2018-04-19 18:58 outbin9/part-r-00002
-rw-r--r-- 1 user supergroup 1771 2018-04-19 18:58 outbin9/part-r-00003
-rw-r--r-- 1 user supergroup 979 2018-04-19 18:58 outbin9/part-r-00004
-rw-r--r-- 1 user supergroup 880 2018-04-19 18:58 outbin9/part-r-00005
-rw-r--r-- 1 user supergroup 0 2018-04-19 18:58 outbin9/part-r-00006
-rw-r--r-- 1 user supergroup 0 2018-04-19 18:58 outbin9/part-r-00007
-rw-r--r-- 1 user supergroup 726 2018-04-19 18:58 outbin9/part-r-00008
The larger groups part-r-00001 and part-r-00003 have received keys 1,0 and 2,2 / 0,0 and 1,2 respectively. And notice that part-r-00006 and part-r-00007 are empty.
回答1:
HashPartitioner
is the default partitioner in Hadoop, which creates one Reduce task for each unique “key”. All the values with the same key goes to the same instance of your reducer, in a single call to the reduce function.
If user is interested to store a particular group of results in different reducers, then the user can write his own partitioner implementation. It can be general purpose or custom made to the specific data types or values that you expect to use in user application.
Custom Partitioner
is a process that allows you to store the results in different reducers, based on the user condition. By setting a partitioner to partition by the key, we can guarantee that, records for the same key will go to the same reducer. A partitioner ensures that only one reducer receives all the records for that particular key.
sample example link
来源:https://stackoverflow.com/questions/49933268/alternative-to-the-default-hashpartioner-provided-with-hadoop