How to get Last 1 hour data, every 5 minutes, without grouping?

问题

How to trigger every 5 minutes and get data for the last 1 hour? I came up with this but it does not seem to give me all the rows in the last 1 hr. My reasoning is :

Read the stream,
filter data for last 1 hr based on timestamp column, and
write/print using forEachbatch. And

watermark it so that it does not hold on to all the past data.

 spark.
 readStream.format("delta").table("xxx")
   .withWatermark("ts", "60 minutes")
   .filter($"ts" > current_timestamp - expr("INTERVAL 60 minutes"))
 .writeStream
   .format("console")
   .trigger(Trigger.ProcessingTime("5 minutes"))
   .foreachBatch{ (batchDF: DataFrame, batchId: Long) =>  batchDF.collect().foreach(println)
        }
 .start()

Or do I have to use a Window? But I can't seem to get rid of GroupBy if I use Window and I don't want to group.

spark.
  readStream.format("delta").table("xxx")
    .withWatermark("ts", "1 hour")
    .groupBy(window($"ts", "1 hour"))
    .count()
 .writeStream
    .format("console")
    .trigger(Trigger.ProcessingTime("5 minutes"))
    .foreachBatch{ (batchDF: DataFrame, batchId: Long) => 
         print("...entering foreachBatch...\n")
         batchDF.collect().foreach(println)
         }
 .start()

回答1:

Instead of using spark streaming to execution your spark code every 5 minutes, you should use either an external scheduler (cron, etc...) or API java.util.Timer if you want to schedule processing in your code

Why you shouldn't spark-streaming to schedule spark code execution

If you use spark-streaming to schedule code, you will have two issues.

First issue, spark-streaming processes data only once. So every 5 minutes, only the new records are loaded. You can think of bypassing this by using window function and retrieving aggregated list of rows by using collect_list, or an user defined aggregate function, but then you will meet the second issue.

Second issue, although your treatment will be triggered every 5 minutes, function inside foreachBatch will be executed only if there are new records to process. Without new records during the 5 minutes interval between two execution, nothing happens.

In conclusion, spark streaming is not designed to schedule spark code to be executed at specific time interval.

Solution with java.util.Timer

So instead of using spark streaming, you should use a scheduler, either external such as cron, oozie, airflow, etc... or in your code

If you need to do it in your code, you can use java.util.Timer as below:

import org.apache.spark.sql.functions.{current_timestamp, expr}
import spark.implicits._

val t = new java.util.Timer()
val task = new java.util.TimerTask {
  def run(): Unit = {
    spark.read.format("delta").table("xxx")
      .filter($"ts" > (current_timestamp() - expr("INTERVAL 60 minutes")))
      .collect()
      .foreach(println)
  }
}
t.schedule(task, 5*60*1000L, 5*60*1000L) // 5 minutes
task.run()

来源：https://stackoverflow.com/questions/63640604/how-to-get-last-1-hour-data-every-5-minutes-without-grouping

标签

apache-spark-sql

spark-streaming

spark-structured-streaming

delta-lake