Apache Spark - Dealing with Sliding Windows on Temporal RDDs

前端 未结 1 943
夕颜
夕颜 2021-02-01 07:15

I\'ve been working quite a lot with Apache Spark the last few months but now I have received a pretty difficult task, to compute average/minimum/maximum etcetera on a sliding wi

相关标签:
1条回答
  • 2021-02-01 08:10

    If you convert to a DataFrame, this all gets a lot simpler -- you can just self-join the data back on itself and find the average. Say I have a series of data like this:

    tsDF.show
    date       amount
    1970-01-01 10.0
    1970-01-01 5.0
    1970-01-01 7.0
    1970-01-02 14.0
    1970-01-02 13.9
    1970-01-03 1.0
    1970-01-03 5.0
    1970-01-03 9.0
    1970-01-04 9.0
    1970-01-04 5.8
    1970-01-04 2.8
    1970-01-04 8.9
    1970-01-05 8.1
    1970-01-05 2.1
    1970-01-05 2.78
    1970-01-05 20.78
    

    Which rolls up as:

    tsDF.groupBy($"date").agg($"date", sum($"amount"), count($"date")).show
    date       SUM(amount) COUNT(date)
    1970-01-01 22.0        3
    1970-01-02 27.9        2
    1970-01-03 15.0        3
    1970-01-04 26.5        4
    1970-01-05 33.76       4
    

    I then would need to create a UDF to shift the date for the join condition (note I am only using a 2 day window by using offset = -2):

    def dateShift(myDate: java.sql.Date): java.sql.Date = {
      val offset = -2;
      val cal = Calendar.getInstance;
      cal.setTime(myDate);
      cal.add(Calendar.DATE, offset);
      new java.sql.Date(cal.getTime.getTime)
    }
    val udfDateShift = udf[java.sql.Date,java.sql.Date](dateShift)
    

    And then I could easily find a 2-day rolling average like this:

    val windowDF = tsDF.select($"date")
      .groupBy($"date")
      .agg($"date")
      .join(
        tsDF.select($"date" as "r_date", $"amount" as "r_amount"),
        $"r_date" > udfDateShift($"date") and $"r_date" <= $"date"
      )
      .groupBy($"date")
      .agg($"date",avg($"r_amount") as "2 day avg amount / record")
    
    val windowDF.show
    date       2 day avg amount / record
    1970-01-01 7.333333333333333
    1970-01-02 9.98
    1970-01-03 8.58
    1970-01-04 5.928571428571429
    1970-01-05 7.5325
    

    While this isn't exactly what you were trying to do, you see how you can use a DataFrame self-join to extract running averages from a data set. Hope you found this helpful.

    0 讨论(0)
提交回复
热议问题