Spark: Prevent shuffle/exchange when joining two identically partitioned dataframes

后端 未结 1 1154
臣服心动
臣服心动 2021-02-03 13:13

I have two dataframes df1 and df2 and I want to join these tables many times on a high cardinality field called visitor_id. I would like

相关标签:
1条回答
  • 2021-02-03 13:16

    You can use the bucketBy method of the DataFrameWriter (other documentation).

    In the following example, the value of the column VisitorID will be hashed into 500 buckets. Normally, for the join Spark would perform an exchange phase based on the hash on the VisitorID. However, in this case you already have the data pre-partitioned with the hash.

    inputRdd = sc.parallelize(list((i, i%200) for i in range(0,1000000)))
    
    schema = StructType([StructField("VisitorID", IntegerType(), True),
                        StructField("visitor_partition", IntegerType(), True)])
    
    inputdf = inputRdd.toDF(schema)
    
    inputdf.write.bucketBy(500, "VisitorID").saveAsTable("bucketed_table")
    
    inputDf1 = spark.sql("select * from bucketed_table")
    inputDf2 = spark.sql("select * from bucketed_table")
    inputDf3 = inputDf1.alias("df1").join(inputDf2.alias("df2"), col("df1.VisitorID") == col("df2.VisitorID"))
    

    Sometimes Spark query optimizer still choose broadcast exchange, so for our example, let's disable auto broadcasting

    spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
    

    The physical plan would look as follow:

    == Physical Plan ==
    *(3) SortMergeJoin [VisitorID#351], [VisitorID#357], Inner
    :- *(1) Sort [VisitorID#351 ASC NULLS FIRST], false, 0
    :  +- *(1) Project [VisitorID#351, visitor_partition#352]
    :     +- *(1) Filter isnotnull(VisitorID#351)
    :        +- *(1) FileScan parquet default.bucketed_6[VisitorID#351,visitor_partition#352] Batched: true, DataFilters: [isnotnull(VisitorID#351)], Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/bucketed_6], PartitionFilters: [], PushedFilters: [IsNotNull(VisitorID)], ReadSchema: struct<VisitorID:int,visitor_partition:int>, SelectedBucketsCount: 500 out of 500
    +- *(2) Sort [VisitorID#357 ASC NULLS FIRST], false, 0
       +- *(2) Project [VisitorID#357, visitor_partition#358]
          +- *(2) Filter isnotnull(VisitorID#357)
             +- *(2) FileScan parquet default.bucketed_6[VisitorID#357,visitor_partition#358] Batched: true, DataFilters: [isnotnull(VisitorID#357)], Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/bucketed_6], PartitionFilters: [], PushedFilters: [IsNotNull(VisitorID)], ReadSchema: struct<VisitorID:int,visitor_partition:int>, SelectedBucketsCount: 500 out of 500
    

    Doing something like:

    inputdf.write.partitionBy("visitor_partition").saveAsTable("partitionBy_2")
    

    Creates indeed the structure with a folder for each partition. But it's not working since the Spark join is based on the hash and is not able to leverage your custom structure.

    Edit: I misunderstood your example. I believe you were talking about something like partitionBy, not repartition as mentioned in the previous version.

    0 讨论(0)
提交回复
热议问题