问题
I've read somewhere that for operations that act on a single RDD, such as reduceByKey()
, running on a pre-partitioned RDD will cause all the values for each key to be computed locally on a single machine, requiring only the final, locally reduced value to be sent from each worker node back to the master. Which means that I have to declare a partitioner like:
val sc = new SparkContext(...)
val userData = sc.sequenceFile[UserID, UserInfo]("hdfs://...")
.partitionBy(new HashPartitioner(100)) // Create 100 partitions
.persist()
in order for reduceByKey
to work as I explained previously.
My question is, if I want to use reduceByKey (optimally), do I need to declare every time a partitioner or is it not necessary.
回答1:
Partitioning RDD to avoid network traffic when you execute reduceByKey
is hardly an optimal solution. Even if it doesn't require shuffling for reduceByKey
it has to shuffle a full dataset to perform partitioning.
Since this is usually much more expensive it doesn't make sense to pre-partition unless your goal is to reduce latency of reduceByKey
phase at the cost of increasing an overall latency or you can leverage this partitioning for other tasks.
回答2:
Not really. reduceByKey is using the data locality. From the RDD api:
/** * Merge the values for each key using an associative reduce function. This will also perform * the merging locally on each mapper before sending results to a reducer, similarly to a * "combiner" in MapReduce. */
This means that when you have a key-value RDD, in the first stage, the identical key at the level of each partition are reduced using the provided function followed by a shuffle and global reduce using the same function for all the already aggregated values. No need to provide a partitioner. It just works.
回答3:
Actually, the two qualities you are talking about are somewhat unrelated.
For reduceByKey()
, the first quality aggregates elements of the same key with the provided associative reduce function locally first on each executor and then eventually aggregated across executors. It is encapsulated in a boolean parameter called mapSideCombine
which if set to true does the above. If set to false, as it is with groupByKey()
, each record will be shuffled and sent to the correct executor.
The second quality concerns partitioning and how it is used. Each RDD, by virtue of its definition, contains a list of splits and (optionally) a partitioner. The method reduceByKey()
is overloaded and actually has a few definitions. For example:
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
This definition of the method actually uses the default existing partitioner from the parent RDD and reduces to the number of partitions set as the default parallelism level.
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
This definition of the method will use a
HashPartitioner
to appropriate data to their corresponding executors and the number of partitions will benumPartitions
.def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
Finally, this definition of the method supersedes the other two and takes in a generic (perhaps, custom) partitioner that will produce the number of partitions determined by how that partitioner partitions the keys.
The point of that is that you can actually encode your desired partitioner logic within the reduceByKey()
itself. If your intention was to avoid shuffling overhead by pre-partitioning, it doesn't really make sense either since you will still be shuffling on your pre-partition.
来源:https://stackoverflow.com/questions/33875623/reducebykey-function-in-spark