pyspark: rolling average using timeseries data

前端 未结 4 967
时光取名叫无心
时光取名叫无心 2020-12-02 11:38

I have a dataset consisting of a timestamp column and a dollars column. I would like to find the average number of dollars per week ending at the timestamp of each row. I

相关标签:
4条回答
  • 2020-12-02 11:55

    I figured out the correct way to calculate a moving/rolling average using this stackoverflow:

    Spark Window Functions - rangeBetween dates

    The basic idea is to convert your timestamp column to seconds, and then you can use the rangeBetween function in the pyspark.sql.Window class to include the correct rows in your window.

    Here's the solved example:

    %pyspark
    from pyspark.sql import functions as F
    from pyspark.sql.window import Window
    
    
    #function to calculate number of seconds from number of days
    days = lambda i: i * 86400
    
    df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00"),
                            (13, "2017-03-15T12:27:18+00:00"),
                            (25, "2017-03-18T11:27:18+00:00")],
                            ["dollars", "timestampGMT"])
    df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
    
    #create window by casting timestamp to long (number of seconds)
    w = (Window.orderBy(F.col("timestampGMT").cast('long')).rangeBetween(-days(7), 0))
    
    df = df.withColumn('rolling_average', F.avg("dollars").over(w))
    

    This results in the exact column of rolling averages that I was looking for:

    dollars   timestampGMT            rolling_average
    17        2017-03-10 15:27:18.0   17.0
    13        2017-03-15 12:27:18.0   15.0
    25        2017-03-18 11:27:18.0   19.0
    
    0 讨论(0)
  • 2020-12-02 12:04

    I will add a variation which I personally found very useful. I hope someone will find it useful as well:

    If you want to groupby then within the respective groups calculate the moving average:

    Example of the dataframe :

    from pyspark.sql.window import Window
    from pyspark.sql import functions as func
    
    
    df = spark.createDataFrame([("tshilidzi", 17.00, "2018-03-10T15:27:18+00:00"), 
      ("tshilidzi", 13.00, "2018-03-11T12:27:18+00:00"),   
      ("tshilidzi", 25.00, "2018-03-12T11:27:18+00:00"), 
      ("thabo", 20.00, "2018-03-13T15:27:18+00:00"), 
      ("thabo", 56.00, "2018-03-14T12:27:18+00:00"), 
      ("thabo", 99.00, "2018-03-15T11:27:18+00:00"), 
      ("tshilidzi", 156.00, "2019-03-22T11:27:18+00:00"), 
      ("thabo", 122.00, "2018-03-31T11:27:18+00:00"), 
      ("tshilidzi", 7000.00, "2019-04-15T11:27:18+00:00"),
      ("ash", 9999.00, "2018-04-16T11:27:18+00:00") 
      ],
      ["name", "dollars", "timestampGMT"])
    
    # we need this timestampGMT as seconds for our Window time frame
    df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
    
    df.show(10000, False)
    

    Output:

    +---------+-------+---------------------+
    |name     |dollars|timestampGMT         |
    +---------+-------+---------------------+
    |tshilidzi|17.0   |2018-03-10 17:27:18.0|
    |tshilidzi|13.0   |2018-03-11 14:27:18.0|
    |tshilidzi|25.0   |2018-03-12 13:27:18.0|
    |thabo    |20.0   |2018-03-13 17:27:18.0|
    |thabo    |56.0   |2018-03-14 14:27:18.0|
    |thabo    |99.0   |2018-03-15 13:27:18.0|
    |tshilidzi|156.0  |2019-03-22 13:27:18.0|
    |thabo    |122.0  |2018-03-31 13:27:18.0|
    |tshilidzi|7000.0 |2019-04-15 13:27:18.0|
    |ash      |9999.0 |2018-04-16 13:27:18.0|
    +---------+-------+---------------------+
    

    To calculate the moving average based on the name and still maintain all rows:

    #create window by casting timestamp to long (number of seconds)
    w = (Window()
         .partitionBy(col("name"))
         .orderBy(F.col("timestampGMT").cast('long'))
         .rangeBetween(-days(7), 0))
    
    df2 = df.withColumn('rolling_average', F.avg("dollars").over(w))
    
    df2.show(100, False)
    

    Output:

    +---------+-------+---------------------+------------------+
    |name     |dollars|timestampGMT         |rolling_average   |
    +---------+-------+---------------------+------------------+
    |ash      |9999.0 |2018-04-16 13:27:18.0|9999.0            |
    |tshilidzi|17.0   |2018-03-10 17:27:18.0|17.0              |
    |tshilidzi|13.0   |2018-03-11 14:27:18.0|15.0              |
    |tshilidzi|25.0   |2018-03-12 13:27:18.0|18.333333333333332|
    |tshilidzi|156.0  |2019-03-22 13:27:18.0|156.0             |
    |tshilidzi|7000.0 |2019-04-15 13:27:18.0|7000.0            |
    |thabo    |20.0   |2018-03-13 17:27:18.0|20.0              |
    |thabo    |56.0   |2018-03-14 14:27:18.0|38.0              |
    |thabo    |99.0   |2018-03-15 13:27:18.0|58.333333333333336|
    |thabo    |122.0  |2018-03-31 13:27:18.0|122.0             |
    +---------+-------+---------------------+------------------+
    
    0 讨论(0)
  • 2020-12-02 12:09

    It's worth noting, that if you don't care about the exact dates - but care to have the average of the last 30 days available you can use the rowsBetween function as follows:

    w = Window.orderBy('timestampGMT').rowsBetween(-7, 0)
    
    df = eurPrices.withColumn('rolling_average', F.avg('dollars').over(w))
    

    Since you order by the dates, it will take the last 7 occurrences. You save all the casting.

    0 讨论(0)
  • 2020-12-02 12:13

    Do you mean this :

    df = spark.createDataFrame([(17, "2017-03-11T15:27:18+00:00"),
                                (13, "2017-03-11T12:27:18+00:00"),
                                (21, "2017-03-17T11:27:18+00:00")],
                               ["dollars", "timestampGMT"])
    df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
    df = df.withColumn('rolling_average', f.avg("dollars").over(Window.partitionBy(f.window("timestampGMT", "7 days"))))
    

    Output:

    +-------+-------------------+---------------+                                   
    |dollars|timestampGMT       |rolling_average|
    +-------+-------------------+---------------+
    |21     |2017-03-17 19:27:18|21.0           |
    |17     |2017-03-11 23:27:18|15.0           |
    |13     |2017-03-11 20:27:18|15.0           |
    +-------+-------------------+---------------+
    
    0 讨论(0)
提交回复
热议问题