Partitioning! how does hadoop make it? Use a hash function? what is the default function?

后端 未结 1 416
南笙
南笙 2021-01-01 00:54

Partitioning is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, val

相关标签:
1条回答
  • 2021-01-01 01:27

    The default partitioner in Hadoop is the HashPartitioner which has a method called getPartition. It takes key.hashCode() & Integer.MAX_VALUE and finds the modulus using the number of reduce tasks.

    For example, if there are 10 reduce tasks, getPartition will return values 0 through 9 for all keys.

    Here is the code:

    public class HashPartitioner<K, V> extends Partitioner<K, V> {
        public int getPartition(K key, V value, int numReduceTasks) {
            return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
        }
    }
    

    To create a custom partitioner, you would extend Partitioner, create a method getPartition, then set your partitioner in the driver code (job.setPartitionerClass(CustomPartitioner.class);). This is particularly helpful if doing secondary sort operations, for example.

    0 讨论(0)
提交回复
热议问题