Trying to figure out how to use window functions in PySpark. Here\'s an example of what I\'d like to be able to do, simply count the number of times a user has an \"event\"
It throws an exception because you pass a list of columns. Signature of DataFrame.select
looks as follows
df.select(self, *cols)
and an expression using a window function is a column like any other so what you need here is something like this:
w = Window.partitionBy("id").orderBy("dt") # Just for clarity
df.select("id","dt", count("dt").over(w).alias("count")).show()
## +---+---+-----+
## | id| dt|count|
## +---+---+-----+
## |234| 0| 1|
## |456| 0| 1|
## |456| 1| 2|
## |456| 2| 3|
## |123| 0| 1|
## |123| 1| 2|
## +---+---+-----+
Generally speaking Spark SQL window functions behave exactly the same way as in any modern RDBMS.