Using pyspark, I\'d like to be able to group a spark dataframe, sort the group, and then provide a row number. So
Group Date
A 2000
A 2002
Use window function:
from pyspark.sql.window import *
from pyspark.sql.functions import row_number
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))
The accepted solution almost has it right. Here is the solution based on the output requested in the question:
df = spark.createDataFrame([("A", 2000), ("A", 2002), ("A", 2007), ("B", 1999), ("B", 2015)], ["Group", "Date"])
+-----+----+
|Group|Date|
+-----+----+
| A|2000|
| A|2002|
| A|2007|
| B|1999|
| B|2015|
+-----+----+
# accepted solution above
from pyspark.sql.window import *
from pyspark.sql.functions import row_number
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))
# accepted solution above output
+-----+----+-------+
|Group|Date|row_num|
+-----+----+-------+
| B|1999| 1|
| B|2015| 2|
| A|2000| 1|
| A|2002| 2|
| A|2007| 3|
+-----+----+-------+
As you can see, the function row_number starts from 1 and not 0 and the requested question wanted to have the row_num starting from 0. Simple change like I have made below:
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))-1).show()
Output :
+-----+----+-------+
|Group|Date|row_num|
+-----+----+-------+
| B|1999| 0|
| B|2015| 1|
| A|2000| 0|
| A|2002| 1|
| A|2007| 2|
+-----+----+-------+
Then you can sort the "Group" column in whatever order you want. The above solution almost has it but it is important to remember that row_number begins with 1 and not 0.