Broadcast hash join - Iterative

前端 未结 3 1761
你的背包
你的背包 2020-12-31 14:39

We use broadcast hash join in Spark when we have one dataframe small enough to get fit into memory. When the size of small dataframe is below spark.sql.autoBroadcastJo

3条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2020-12-31 15:13

    The idea here is to create broadcast variable before join to easily control it. Without it you can't control these variables - spark do it for you.

    Example:

    from pyspark.sql.functions import broadcast
    
    sdf2_bd = broadcast(sdf2)
    sdf1.join(sdf2_bd, sdf1.id == sdf2_bd.id)
    

    To all broadcast variables(automatically created in joins or created by hands) this rules are applied:

    1. The broadcast data is sent only to the nodes that contain an executor that needs it.
    2. The broadcast data is stored in memory. If not enough memory is available, the disk is used.
    3. When you are done with a broadcast variable, you should destroy it to release memory.

提交回复
热议问题