How to use window functions in PySpark using DataFrames?

后端 未结 1 1085
借酒劲吻你
借酒劲吻你 2021-01-13 11:12

Trying to figure out how to use window functions in PySpark. Here\'s an example of what I\'d like to be able to do, simply count the number of times a user has an \"event\"

1条回答
  •  囚心锁ツ
    2021-01-13 11:35

    It throws an exception because you pass a list of columns. Signature of DataFrame.select looks as follows

    df.select(self, *cols)
    

    and an expression using a window function is a column like any other so what you need here is something like this:

    w = Window.partitionBy("id").orderBy("dt") # Just for clarity
    df.select("id","dt", count("dt").over(w).alias("count")).show()
    
    ## +---+---+-----+
    ## | id| dt|count|
    ## +---+---+-----+
    ## |234|  0|    1|
    ## |456|  0|    1|
    ## |456|  1|    2|
    ## |456|  2|    3|
    ## |123|  0|    1|
    ## |123|  1|    2|
    ## +---+---+-----+
    

    Generally speaking Spark SQL window functions behave exactly the same way as in any modern RDBMS.

    0 讨论(0)
提交回复
热议问题