I have two RDDs. One RDD is between 5-10 million entries and the other RDD is between 500 million - 750 million entries. At some point, I have to join these two rdds using a c
You can partition RDD's with the same partitioner, in this case partitions with the same key will be collocated on the same executor.
In this case you will avoid shuffle for join operations.
Shuffle will happen only once, when you'll update parititoner, and if you'll cache RDD's all joins after that should be local to executors
import org.apache.spark.SparkContext._
class A
class B
val rddA: RDD[(String, A)] = ???
val rddB: RDD[(String, B)] = ???
val partitioner = new HashPartitioner(1000)
rddA.partitionBy(partitioner).cache()
rddB.partitionBy(partitioner).cache()
Also you can try to update broadcast threshold size, maybe rddA can broadcasted:
--conf spark.sql.autoBroadcastJoinThreshold=300000000 # ~300 mb
We use 400mb for broadcast joins, and it works well.