How to calculate rolling sum with varying window sizes in PySpark

前端未结

关注

 2  1226

隐瞒了意图╮ 2021-02-10 04:21

I have a spark dataframe that contains sales prediction data for some products in some stores over a time period. How do I calculate the rolling sum of Predictions for a window

2条回答

暗喜 (楼主)

2021-02-10 04:39

It might not be the best, but you can get distinct "N" column values and loop like below.

val arr = df.select("N").distinct.collect

for(n <- arr) df.filter(col("N") ===  n.get(0))
.withColumn("RollingSum",sum(col("Prediction"))
.over(Window.partitionBy("N").orderBy("N").rowsBetween(Window.currentRow, n.get(0).toString.toLong-1))).show

This will give you like:

+---------+-------+----------+----------+---+------------------+
|ProductId|StoreId|      Date|Prediction|  N|        RollingSum|
+---------+-------+----------+----------+---+------------------+
|        2|    200|2019-07-01|      1.39|  3|              3.94|
|        2|    200|2019-07-02|      1.22|  3|              4.16|
|        2|    200|2019-07-03|      1.33|  3|2.9400000000000004|
|        2|    200|2019-07-04|      1.61|  3|              1.61|
+---------+-------+----------+----------+---+------------------+

+---------+-------+----------+----------+---+----------+
|ProductId|StoreId|      Date|Prediction|  N|RollingSum|
+---------+-------+----------+----------+---+----------+
|        1|    100|2019-07-01|      0.92|  2|      1.54|
|        1|    100|2019-07-02|      0.62|  2|      1.51|
|        1|    100|2019-07-03|      0.89|  2|      1.46|
|        1|    100|2019-07-04|      0.57|  2|      0.57|
+---------+-------+----------+----------+---+----------+

Then you can do a union of all the dataframes inside the loop.

0 讨论(0)

查看其它2个回答