问题
Suppose I have two partitioned dataframes:
df1 = spark.createDataFrame(
[(x,x,x) for x in range(5)], ['key1', 'key2', 'time']
).repartition(3, 'key1', 'key2')
df2 = spark.createDataFrame(
[(x,x,x) for x in range(7)], ['key1', 'key2', 'time']
).repartition(3, 'key1', 'key2')
(scenario 1) If I join them by [key1, key2] join operation is performed within each partition without shuffle (number of partitions in result dataframe is the same):
x = df1.join(df2, on=['key1', 'key2'], how='left')
assert x.rdd.getNumPartitions() == 3
(scenario 2) But If I joint them by [key1, key2, time] shuffle operation takes place (number of partitions in result dataframe is 200 which is driven by spark.sql.shuffle.partitions option):
x = df1.join(df2, on=['key1', 'key2', 'time'], how='left')
assert x.rdd.getNumPartitions() == 200
At the same time groupby and window operations by [key1, key2, time] preserve number of partitions and done without shuffle:
x = df1.groupBy('key1', 'key2', 'time').agg(F.count('*'))
assert x.rdd.getNumPartitions() == 3
I can’t understand is this a bug or there are some reasons for performing shuffle operation in second scenario? And how can I avoid shuffle if it's possible?
回答1:
I guess was able to figure out the reason of different result in Python and Scala.
The reason is in broadcast optimisation. If spark-shell is started with broadcast disabled both Python and Scala works identically.
./spark-shell --conf spark.sql.autoBroadcastJoinThreshold=-1
val df1 = Seq(
(1, 1, 1)
).toDF("key1", "key2", "time").repartition(3, col("key1"), col("key2"))
val df2 = Seq(
(1, 1, 1),
(2, 2, 2)
).toDF("key1", "key2", "time").repartition(3, col("key1"), col("key2"))
val x = df1.join(df2, usingColumns = Seq("key1", "key2", "time"))
x.rdd.getNumPartitions == 200
So looks like spark 2.4.0 isn't able to optimise described case out of the box and catalyst optimizer extension needed as suggested by @user10938362.
BTW. Here are info about writing catalyst optimizer extensions https://developer.ibm.com/code/2017/11/30/learn-extension-points-apache-spark-extend-spark-catalyst-optimizer/
回答2:
The behaviour of Catalyst Optimizer differs between pyspark and Scala (using Spark 2.4 at least).
I ran both and got two different plans.
Indeed you get 200 partitions in pyspark, unless you state for pyspark explicitly:
spark.conf.set("spark.sql.shuffle.partitions", 3)
Then 3 partitions are processed, and thus 3 retained under pyspark.
A little surprised as I would have thought under the hood it would be common. So people keep telling me. It just goes to show.
Physical Plan for pyspark with param set via conf:
== Physical Plan ==
*(5) Project [key1#344L, key2#345L, time#346L]
+- SortMergeJoin [key1#344L, key2#345L, time#346L], [key1#350L, key2#351L, time#352L], LeftOuter
:- *(2) Sort [key1#344L ASC NULLS FIRST, key2#345L ASC NULLS FIRST, time#346L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key1#344L, key2#345L, time#346L, 3)
: +- *(1) Scan ExistingRDD[key1#344L,key2#345L,time#346L]
+- *(4) Sort [key1#350L ASC NULLS FIRST, key2#351L ASC NULLS FIRST, time#352L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(key1#350L, key2#351L, time#352L, 3)
+- *(3) Filter ((isnotnull(key1#350L) && isnotnull(key2#351L)) && isnotnull(time#352L))
+- *(3) Scan ExistingRDD[key1#350L,key2#351L,time#352L]
来源:https://stackoverflow.com/questions/55229290/question-about-joining-dataframes-in-spark