How does partitioning in MapReduce exactly work?

后端 未结 2 735
萌比男神i
萌比男神i 2021-02-01 06:09

I think I have a fair understanding of the MapReduce programming model in general, but even after reading the original paper and some other sources many details are unclear to m

相关标签:
2条回答
  • 2021-02-01 06:57
    1. You can start the reducer tasks while the map tasks are still running (using a feature known as slowstart), but the reducers can only run the copy phase (acquiring the completed results from the completed map tasks. It will need to wait for all the mappers to complete before it can actually perform the final sort and reduce.
    2. A reduce task actually processes zero, one or more keys (rather than a discrete tasks for each key). Each reducer will need to acquire the map output from each map task that relates to its partition before these intermediate outputs are sorted and then reduced one key set at a time.
    3. Back to the note in 2 - a reducer task (one for each partition) runs on zero, one or more keys rather than a single task for each discrete key.

    It's also important to understand the spread and variation of your intermediate key as it is hashed and modulo'd (if using the default HashPartitioner) to determine which reduce partition should process that key. Say you had an even number of reducer tasks (10), and output keys that always hashed to an even number - then in this case the modulo of these hashs numbers and 10 will always be an even number, meaning that the odd numbered reducers would never process any data.

    0 讨论(0)
  • 2021-02-01 07:00

    Addendum to what Chris said,

    Basically, a partitioner class in Hadoop (e.g. Default HashPartitioner)

    has to implement this function,

    int getPartition(K key, V value, int numReduceTasks) 
    

    This function is responsible for returning you the partition number and you get the number of reducers you fixed when starting the job from the numReduceTasks variable, as seen for in the HashPartitioner.

    Based on what integer the above function return, Hadoop selects node where the reduce task for a particular key should run.

    Hope this helps.

    0 讨论(0)
提交回复
热议问题