Spark SQL broadcast hash join

后端 未结 3 1051
情书的邮戳
情书的邮戳 2020-12-01 06:50

I\'m trying to perform a broadcast hash join on dataframes using SparkSQL as documented here: https://docs.cloud.databricks.com/docs/latest/databricks_guide/06%20Spark%20SQL

相关标签:
3条回答
  • 2020-12-01 07:23
    jon_rdd = sqlContext.sql( "select * from people_in_india  p
                                join states s
                                on p.state = s.name")
    
    
    jon_rdd.toDebugString() / join_rdd.explain() : 
    

    shuffledHashJoin :
    all the data for the India will be shuffled into only 29 keys for each of the states. Problems: uneven sharding. Limited parallelism with 29 output partitions.

    broadcaseHashJoin:

    broadcast the small RDD to all worker nodes. parallelism of the large rdd is still maintained and shuffle is not even required.

    PS: Image may ugly but informative.

    0 讨论(0)
  • 2020-12-01 07:30

    You can explicitly mark the DataFrame as small enough for broadcasting using broadcast function:

    Python:

    from pyspark.sql.functions import broadcast
    
    small_df = ...
    large_df = ...
    
    large_df.join(broadcast(small_df), ["foo"])
    

    or broadcast hint (Spark >= 2.2):

    large_df.join(small_df.hint("broadcast"), ["foo"])
    

    Scala:

    import org.apache.spark.sql.functions.broadcast
    
    val smallDF: DataFrame = ???
    val largeDF: DataFrame = ???
    
    largeDF.join(broadcast(smallDF), Seq("foo"))
    

    or broadcast hint (Spark >= 2.2):

    largeDF.join(smallDF.hint("broadcast"), Seq("foo"))
    

    SQL

    You can use hints (Spark >= 2.2):

    SELECT /*+ MAPJOIN(small) */ * 
    FROM large JOIN small
    ON large.foo = small.foo
    

    or

    SELECT /*+  BROADCASTJOIN(small) */ * 
    FROM large JOIN small
    ON large.foo = small.foo
    

    or

    SELECT /*+ BROADCAST(small) */ * 
    FROM large JOIN small
    ON larger.foo = small.foo
    

    R (SparkR):

    With hint (Spark >= 2.2):

    join(large, hint(small, "broadcast"), large$foo == small$foo)
    

    With broadcast (Spark >= 2.3)

    join(large, broadcast(small), large$foo == small$foo)
    

    Note:

    Broadcast join is useful if one of structures is relatively small. Otherwise it can be significantly more expensive than a full shuffle.

    0 讨论(0)
  • 2020-12-01 07:38

    With a broadcast join one side of the join equation is being materialized and send to all mappers. It is therefore considered as a map-side join.

    As the data set is getting materialized and send over the network it does only bring significant performance improvement, if it considerable small.

    So if you are trying to perform smallDF.join(largeDF)

    Wait..!!! another constraint is that it also needs to fit completely into the memory of each executor.It also needs to fit into the memory of the Driver!

    Broadcast variables are shared among executors using the Torrent protocol i.e.Peer-to-Peer protocol and the advantage of the Torrent protocol is that peers share blocks of a file among each other not relying on a central entity holding all the blocks.

    Above mentioned example is sufficient enough to start playing with broadcast join.

    Note: Cannot modify value after creation. If you try, change will only be on one&node

    0 讨论(0)
提交回复
热议问题