How to use DataFrame Window expressions and withColumn and not to change partition?

后端 未结 2 1381
故里飘歌
故里飘歌 2021-01-25 07:08

For some reason I have to convert RDD to DataFrame, then do something with DataFrame.

My interface is RDD,so I have

2条回答
  •  后悔当初
    2021-01-25 07:44

    Let's make this as simple as possible, we will generate the same data into 4 partitions

    scala> val df = spark.range(1,9,1,4).toDF
    df: org.apache.spark.sql.DataFrame = [id: bigint]
    
    scala> df.show
    +---+
    | id|
    +---+
    |  1|
    |  2|
    |  3|
    |  4|
    |  5|
    |  6|
    |  7|
    |  8|
    +---+
    
    scala> df.rdd.getNumPartitions
    res13: Int = 4
    

    We don't need 3 window functions to prove this, so let's do it with one :

    scala> import org.apache.spark.sql.expressions.Window
    import org.apache.spark.sql.expressions.Window
    
    scala> val df2 = df.withColumn("csum", sum($"id").over(Window.orderBy($"id")))
    df2: org.apache.spark.sql.DataFrame = [id: bigint, csum: bigint]
    

    So what's happening here is that we didn't just add a column but we computed a window of cumulative sum over the data and since you haven't provided an partition column, the window function will move all the data to a single partition and you even get a warning from spark :

    scala> df2.rdd.getNumPartitions
    17/06/06 10:05:53 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
    res14: Int = 1
    
    scala> df2.show
    17/06/06 10:05:56 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
    +---+----+
    | id|csum|
    +---+----+
    |  1|   1|
    |  2|   3|
    |  3|   6|
    |  4|  10|
    |  5|  15|
    |  6|  21|
    |  7|  28|
    |  8|  36|
    +---+----+
    

    So let's add now a column to partition on. We will create a new DataFrame just for the sake of demonstration :

    scala> val df3 = df.withColumn("x", when($"id"<5,lit("a")).otherwise("b"))
    df3: org.apache.spark.sql.DataFrame = [id: bigint, x: string]
    

    It has indeed the same number of partitions that we defined explicitly on df :

    scala> df3.rdd.getNumPartitions
    res18: Int = 4
    

    Let's perform our window operation using the column x to partition :

    scala> val df4 = df3.withColumn("csum", sum($"id").over(Window.orderBy($"id").partitionBy($"x")))
    df4: org.apache.spark.sql.DataFrame = [id: bigint, x: string ... 1 more field]
    
    scala> df4.show
    +---+---+----+                                                                  
    | id|  x|csum|
    +---+---+----+
    |  5|  b|   5|
    |  6|  b|  11|
    |  7|  b|  18|
    |  8|  b|  26|
    |  1|  a|   1|
    |  2|  a|   3|
    |  3|  a|   6|
    |  4|  a|  10|
    +---+---+----+
    

    The window function will repartition our data using the default number of partitions set in spark configuration.

    scala> df4.rdd.getNumPartitions
    res20: Int = 200
    

提交回复
热议问题