How can I efficiently join a large rdd to a very large rdd in spark?

前端 未结 1 458
旧巷少年郎
旧巷少年郎 2021-02-08 10:22

I have two RDDs. One RDD is between 5-10 million entries and the other RDD is between 500 million - 750 million entries. At some point, I have to join these two rdds using a c

相关标签:
1条回答
  • 2021-02-08 10:58

    You can partition RDD's with the same partitioner, in this case partitions with the same key will be collocated on the same executor.

    In this case you will avoid shuffle for join operations.

    Shuffle will happen only once, when you'll update parititoner, and if you'll cache RDD's all joins after that should be local to executors

    import org.apache.spark.SparkContext._
    
    class A
    class B
    
    val rddA: RDD[(String, A)] = ???
    val rddB: RDD[(String, B)] = ???
    
    val partitioner = new HashPartitioner(1000)
    
    rddA.partitionBy(partitioner).cache()
    rddB.partitionBy(partitioner).cache()
    

    Also you can try to update broadcast threshold size, maybe rddA can broadcasted:

    --conf spark.sql.autoBroadcastJoinThreshold=300000000 # ~300 mb
    

    We use 400mb for broadcast joins, and it works well.

    0 讨论(0)
提交回复
热议问题