Python Spark Cumulative Sum by Group Using DataFrame

前端 未结 2 444
遥遥无期
遥遥无期 2020-12-02 22:54

How do I compute the cumulative sum per group specifically using the DataFrame abstraction; and in PySpark?

With an example da

相关标签:
2条回答
  • 2020-12-02 23:21

    I have tried this way and it worked for me.

    from pyspark.sql import Window
    
    from pyspark.sql import functions as f
    
    import sys
    
    cum_sum = DF.withColumn('cumsum', f.sum('value').over(Window.partitionBy('class').orderBy('time').rowsBetween(-sys.maxsize, 0)))
    cum_sum.show()
    
    0 讨论(0)
  • 2020-12-02 23:47

    This can be done using a combination of a window function and the Window.unboundedPreceding value in the window's range as follows:

    from pyspark.sql import Window
    from pyspark.sql import functions as F
    
    windowval = (Window.partitionBy('class').orderBy('time')
                 .rangeBetween(Window.unboundedPreceding, 0))
    df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
    df_w_cumsum.show()
    
    +----+-----+-----+-------+
    |time|value|class|cum_sum|
    +----+-----+-----+-------+
    |   1|    3|    b|      3|
    |   2|    3|    b|      6|
    |   1|    2|    a|      2|
    |   2|    2|    a|      4|
    |   3|    2|    a|      6|
    +----+-----+-----+-------+
    
    0 讨论(0)
提交回复
热议问题