Mapping timeseries data to previous datapoints and averages

后端 未结 1 334
臣服心动
臣服心动 2021-01-16 03:30

If I have an RDD with volumes per minute, e.g.

((\"12:00\" -> 124), (\"12:01\" -> 543), (\"12:02\" -> 102), ... )

And I want to go

相关标签:
1条回答
  • 2021-01-16 04:05

    Converting to a data frame and using window functions can cover lag, average and possible gaps:

    import com.github.nscala_time.time.Imports._
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.functions.{lag, avg, when}
    import org.apache.spark.sql.expressions.Window
    
    val fmt = DateTimeFormat.forPattern("HH:mm:ss")
    
    val rdd = sc.parallelize(Seq(
      ("12:00:00" -> 124), ("12:01:00" -> 543), ("12:02:00" -> 102),
      ("12:30:00" -> 100), ("12:31:00" -> 101)
    ).map{case (ds, vol) => (fmt.parseDateTime(ds), vol)})
    
    val df = rdd
      // Convert to millis for window range
      .map{case (dt, vol) => (dt.getMillis, vol)} 
      .toDF("ts", "volume")
    
    val w = Window.orderBy($"ts")
    
    val transformed = df.select(
      $"ts", $"volume",
      when(
        // Check if we have data from the previous minute
        (lag($"ts", 1).over(w) - $"ts").equalTo(-60000), 
        // If so get lag otherwise 0
        lag($"volume", 1).over(w)).otherwise(0).alias("previous_volume"),
      // Average over window 
      avg($"volume").over(w.rangeBetween(-300000, 0)).alias("average"))
    
    // Optionally go to back to RDD
    transformed.map{
      case Row(ts: Long, volume: Int, previousVolume: Int, average: Double) =>
        (new DateTime(ts) -> (volume, previousVolume, average))
    }
    

    Just be aware that window functions without window partitioning are quite inefficient.

    0 讨论(0)
提交回复
热议问题