Spark SQL broadcast hash join

后端未结

关注

 3  1051

I\'m trying to perform a broadcast hash join on dataframes using SparkSQL as documented here: https://docs.cloud.databricks.com/docs/latest/databricks_guide/06%20Spark%20SQL

相关标签:

3条回答

感情败类

2020-12-01 07:23
```
jon_rdd = sqlContext.sql( "select * from people_in_india  p
                            join states s
                            on p.state = s.name")


jon_rdd.toDebugString() / join_rdd.explain() : 
```
shuffledHashJoin :
all the data for the India will be shuffled into only 29 keys for each of the states. Problems: uneven sharding. Limited parallelism with 29 output partitions.

broadcaseHashJoin:

broadcast the small RDD to all worker nodes. parallelism of the large rdd is still maintained and shuffle is not even required.

PS: Image may ugly but informative.
0 讨论(0)
发布评论:

提交评论
- 加载中...

礼貌的吻别

2020-12-01 07:30

You can explicitly mark the DataFrame as small enough for broadcasting using broadcast function:

Python:

from pyspark.sql.functions import broadcast

small_df = ...
large_df = ...

large_df.join(broadcast(small_df), ["foo"])

or broadcast hint (Spark >= 2.2):

large_df.join(small_df.hint("broadcast"), ["foo"])

Scala:

import org.apache.spark.sql.functions.broadcast

val smallDF: DataFrame = ???
val largeDF: DataFrame = ???

largeDF.join(broadcast(smallDF), Seq("foo"))

or broadcast hint (Spark >= 2.2):

largeDF.join(smallDF.hint("broadcast"), Seq("foo"))

SQL

You can use hints (Spark >= 2.2):

SELECT /*+ MAPJOIN(small) */ * 
FROM large JOIN small
ON large.foo = small.foo

SELECT /*+  BROADCASTJOIN(small) */ * 
FROM large JOIN small
ON large.foo = small.foo

SELECT /*+ BROADCAST(small) */ * 
FROM large JOIN small
ON larger.foo = small.foo

R (SparkR):

With hint (Spark >= 2.2):

join(large, hint(small, "broadcast"), large$foo == small$foo)

With broadcast (Spark >= 2.3)

join(large, broadcast(small), large$foo == small$foo)

Note:

Broadcast join is useful if one of structures is relatively small. Otherwise it can be significantly more expensive than a full shuffle.

0 讨论(0)

暖寄归人

2020-12-01 07:38

With a broadcast join one side of the join equation is being materialized and send to all mappers. It is therefore considered as a map-side join.

As the data set is getting materialized and send over the network it does only bring significant performance improvement, if it considerable small.

So if you are trying to perform smallDF.join(largeDF)

Wait..!!! another constraint is that it also needs to fit completely into the memory of each executor.It also needs to fit into the memory of the Driver!

Broadcast variables are shared among executors using the Torrent protocol i.e.Peer-to-Peer protocol and the advantage of the Torrent protocol is that peers share blocks of a file among each other not relying on a central entity holding all the blocks.

Above mentioned example is sufficient enough to start playing with broadcast join.

Note: Cannot modify value after creation. If you try, change will only be on one&node

0 讨论(0)
发布评论:

提交评论
- 加载中...