How to use DataFrame Window expressions and withColumn and not to change partition?

后端未结

关注

 2  1381

故里飘歌 2021-01-25 07:08

For some reason I have to convert RDD to DataFrame, then do something with DataFrame.

My interface is RDD,so I have

2条回答

后悔当初 (楼主)

2021-01-25 07:44

Let's make this as simple as possible, we will generate the same data into 4 partitions

scala> val df = spark.range(1,9,1,4).toDF
df: org.apache.spark.sql.DataFrame = [id: bigint]

scala> df.show
+---+
| id|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
+---+

scala> df.rdd.getNumPartitions
res13: Int = 4

We don't need 3 window functions to prove this, so let's do it with one :

scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window

scala> val df2 = df.withColumn("csum", sum($"id").over(Window.orderBy($"id")))
df2: org.apache.spark.sql.DataFrame = [id: bigint, csum: bigint]

So what's happening here is that we didn't just add a column but we computed a window of cumulative sum over the data and since you haven't provided an partition column, the window function will move all the data to a single partition and you even get a warning from spark :

scala> df2.rdd.getNumPartitions
17/06/06 10:05:53 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
res14: Int = 1

scala> df2.show
17/06/06 10:05:56 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+----+
| id|csum|
+---+----+
|  1|   1|
|  2|   3|
|  3|   6|
|  4|  10|
|  5|  15|
|  6|  21|
|  7|  28|
|  8|  36|
+---+----+

So let's add now a column to partition on. We will create a new DataFrame just for the sake of demonstration :

scala> val df3 = df.withColumn("x", when($"id"<5,lit("a")).otherwise("b"))
df3: org.apache.spark.sql.DataFrame = [id: bigint, x: string]

It has indeed the same number of partitions that we defined explicitly on df :

scala> df3.rdd.getNumPartitions
res18: Int = 4

Let's perform our window operation using the column x to partition :

scala> val df4 = df3.withColumn("csum", sum($"id").over(Window.orderBy($"id").partitionBy($"x")))
df4: org.apache.spark.sql.DataFrame = [id: bigint, x: string ... 1 more field]

scala> df4.show
+---+---+----+                                                                  
| id|  x|csum|
+---+---+----+
|  5|  b|   5|
|  6|  b|  11|
|  7|  b|  18|
|  8|  b|  26|
|  1|  a|   1|
|  2|  a|   3|
|  3|  a|   6|
|  4|  a|  10|
+---+---+----+

The window function will repartition our data using the default number of partitions set in spark configuration.

scala> df4.rdd.getNumPartitions
res20: Int = 200

0 讨论(0)

查看其它2个回答