How to properly apply HashPartitioner before a join in Spark?
问题 To reduce shuffling during the joining of two RDDs, I decided to partition them using HashPartitioner first. Here is how I do it. Am I doing it correctly, or is there a better way to do this? val rddA = ... val rddB = ... val numOfPartitions = rddA.getNumPartitions val rddApartitioned = rddA.partitionBy(new HashPartitioner(numOfPartitions)) val rddBpartitioned = rddB.partitionBy(new HashPartitioner(numOfPartitions)) val rddAB = rddApartitioned.join(rddBpartitioned) 回答1: To reduce shuffling