I\'ve been working quite a lot with Apache Spark the last few months but now I have received a pretty difficult task, to compute average/minimum/maximum etcetera on a sliding wi
If you convert to a DataFrame, this all gets a lot simpler -- you can just self-join the data back on itself and find the average. Say I have a series of data like this:
tsDF.show
date amount
1970-01-01 10.0
1970-01-01 5.0
1970-01-01 7.0
1970-01-02 14.0
1970-01-02 13.9
1970-01-03 1.0
1970-01-03 5.0
1970-01-03 9.0
1970-01-04 9.0
1970-01-04 5.8
1970-01-04 2.8
1970-01-04 8.9
1970-01-05 8.1
1970-01-05 2.1
1970-01-05 2.78
1970-01-05 20.78
Which rolls up as:
tsDF.groupBy($"date").agg($"date", sum($"amount"), count($"date")).show
date SUM(amount) COUNT(date)
1970-01-01 22.0 3
1970-01-02 27.9 2
1970-01-03 15.0 3
1970-01-04 26.5 4
1970-01-05 33.76 4
I then would need to create a UDF to shift the date for the join condition (note I am only using a 2 day window by using offset = -2):
def dateShift(myDate: java.sql.Date): java.sql.Date = {
val offset = -2;
val cal = Calendar.getInstance;
cal.setTime(myDate);
cal.add(Calendar.DATE, offset);
new java.sql.Date(cal.getTime.getTime)
}
val udfDateShift = udf[java.sql.Date,java.sql.Date](dateShift)
And then I could easily find a 2-day rolling average like this:
val windowDF = tsDF.select($"date")
.groupBy($"date")
.agg($"date")
.join(
tsDF.select($"date" as "r_date", $"amount" as "r_amount"),
$"r_date" > udfDateShift($"date") and $"r_date" <= $"date"
)
.groupBy($"date")
.agg($"date",avg($"r_amount") as "2 day avg amount / record")
val windowDF.show
date 2 day avg amount / record
1970-01-01 7.333333333333333
1970-01-02 9.98
1970-01-03 8.58
1970-01-04 5.928571428571429
1970-01-05 7.5325
While this isn't exactly what you were trying to do, you see how you can use a DataFrame self-join to extract running averages from a data set. Hope you found this helpful.