How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?

前端 未结 1 1665
庸人自扰
庸人自扰 2020-12-24 15:21

In Spark 1.6.0 / Scala, is there an opportunity to get collect_list(\"colC\") or collect_set(\"colC\").over(Window.partitionBy(\"colA\").orderBy(\"colB\")

1条回答
  •  有刺的猬
    2020-12-24 16:06

    Given that you have dataframe as

    +----+----+----+
    |colA|colB|colC|
    +----+----+----+
    |1   |1   |23  |
    |1   |2   |63  |
    |1   |3   |31  |
    |2   |1   |32  |
    |2   |2   |56  |
    +----+----+----+
    

    You can Window functions by doing the following

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.expressions._
    df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
    

    Result:

    +----+----+----+------------+
    |colA|colB|colC|colD        |
    +----+----+----+------------+
    |1   |1   |23  |[23]        |
    |1   |2   |63  |[23, 63]    |
    |1   |3   |31  |[23, 63, 31]|
    |2   |1   |32  |[32]        |
    |2   |2   |56  |[32, 56]    |
    +----+----+----+------------+
    

    Similar is the result for collect_set as well. But the order of elements in the final set will not be in order as with collect_list

    df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
    +----+----+----+------------+
    |colA|colB|colC|colD        |
    +----+----+----+------------+
    |1   |1   |23  |[23]        |
    |1   |2   |63  |[63, 23]    |
    |1   |3   |31  |[63, 31, 23]|
    |2   |1   |32  |[32]        |
    |2   |2   |56  |[56, 32]    |
    +----+----+----+------------+
    

    If you remove orderBy as below

    df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false)
    

    result would be

    +----+----+----+------------+
    |colA|colB|colC|colD        |
    +----+----+----+------------+
    |1   |1   |23  |[23, 63, 31]|
    |1   |2   |63  |[23, 63, 31]|
    |1   |3   |31  |[23, 63, 31]|
    |2   |1   |32  |[32, 56]    |
    |2   |2   |56  |[32, 56]    |
    +----+----+----+------------+
    

    I hope the answer is helpful

    0 讨论(0)
提交回复
热议问题