How can I efficiently join a large rdd to a very large rdd in spark?

前端未结

关注

 1  455

旧巷少年郎 2021-02-08 10:22

I have two RDDs. One RDD is between 5-10 million entries and the other RDD is between 500 million - 750 million entries. At some point, I have to join these two rdds using a c

1条回答

执念已碎 (楼主)

2021-02-08 10:58
You can partition RDD's with the same partitioner, in this case partitions with the same key will be collocated on the same executor.

In this case you will avoid shuffle for join operations.

Shuffle will happen only once, when you'll update parititoner, and if you'll cache RDD's all joins after that should be local to executors
```
import org.apache.spark.SparkContext._

class A
class B

val rddA: RDD[(String, A)] = ???
val rddB: RDD[(String, B)] = ???

val partitioner = new HashPartitioner(1000)

rddA.partitionBy(partitioner).cache()
rddB.partitionBy(partitioner).cache()
```
Also you can try to update broadcast threshold size, maybe rddA can broadcasted:
```
--conf spark.sql.autoBroadcastJoinThreshold=300000000 # ~300 mb
```
We use 400mb for broadcast joins, and it works well.
0 讨论(0)
发布评论:

提交评论
- 加载中...