PySpark - get row number for each row in a group

后端 未结 2 1576
借酒劲吻你
借酒劲吻你 2020-11-29 09:25

Using pyspark, I\'d like to be able to group a spark dataframe, sort the group, and then provide a row number. So

Group    Date
  A      2000
  A      2002
          


        
相关标签:
2条回答
  • 2020-11-29 09:59

    Use window function:

    from pyspark.sql.window import *
    from pyspark.sql.functions import row_number
    
    df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))
    
    0 讨论(0)
  • 2020-11-29 10:15

    The accepted solution almost has it right. Here is the solution based on the output requested in the question:

    df = spark.createDataFrame([("A", 2000), ("A", 2002), ("A", 2007), ("B", 1999), ("B", 2015)], ["Group", "Date"])
    
    +-----+----+
    |Group|Date|
    +-----+----+
    |    A|2000|
    |    A|2002|
    |    A|2007|
    |    B|1999|
    |    B|2015|
    +-----+----+
    
    # accepted solution above
    
    
    from pyspark.sql.window import *
    from pyspark.sql.functions import row_number
    
    df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))
    
    
    # accepted solution above output
    
    
    +-----+----+-------+
    |Group|Date|row_num|
    +-----+----+-------+
    |    B|1999|      1|
    |    B|2015|      2|
    |    A|2000|      1|
    |    A|2002|      2|
    |    A|2007|      3|
    +-----+----+-------+
    

    As you can see, the function row_number starts from 1 and not 0 and the requested question wanted to have the row_num starting from 0. Simple change like I have made below:

    df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))-1).show()
    

    Output :

    +-----+----+-------+
    |Group|Date|row_num|
    +-----+----+-------+
    |    B|1999|      0|
    |    B|2015|      1|
    |    A|2000|      0|
    |    A|2002|      1|
    |    A|2007|      2|
    +-----+----+-------+
    

    Then you can sort the "Group" column in whatever order you want. The above solution almost has it but it is important to remember that row_number begins with 1 and not 0.

    0 讨论(0)
提交回复
热议问题