How to get Last 1 hour data, every 5 minutes, without grouping?

妖精的绣舞 提交于 2020-12-30 02:59:06

问题


How to trigger every 5 minutes and get data for the last 1 hour? I came up with this but it does not seem to give me all the rows in the last 1 hr. My reasoning is :

  1. Read the stream,

  2. filter data for last 1 hr based on timestamp column, and

  3. write/print using forEachbatch. And

  4. watermark it so that it does not hold on to all the past data.

     spark.
     readStream.format("delta").table("xxx")
       .withWatermark("ts", "60 minutes")
       .filter($"ts" > current_timestamp - expr("INTERVAL 60 minutes"))
     .writeStream
       .format("console")
       .trigger(Trigger.ProcessingTime("5 minutes"))
       .foreachBatch{ (batchDF: DataFrame, batchId: Long) =>  batchDF.collect().foreach(println)
            }
     .start()
    

Or do I have to use a Window? But I can't seem to get rid of GroupBy if I use Window and I don't want to group.

spark.
  readStream.format("delta").table("xxx")
    .withWatermark("ts", "1 hour")
    .groupBy(window($"ts", "1 hour"))
    .count()
 .writeStream
    .format("console")
    .trigger(Trigger.ProcessingTime("5 minutes"))
    .foreachBatch{ (batchDF: DataFrame, batchId: Long) => 
         print("...entering foreachBatch...\n")
         batchDF.collect().foreach(println)
         }
 .start()

回答1:


Instead of using spark streaming to execution your spark code every 5 minutes, you should use either an external scheduler (cron, etc...) or API java.util.Timer if you want to schedule processing in your code

Why you shouldn't spark-streaming to schedule spark code execution

If you use spark-streaming to schedule code, you will have two issues.

First issue, spark-streaming processes data only once. So every 5 minutes, only the new records are loaded. You can think of bypassing this by using window function and retrieving aggregated list of rows by using collect_list, or an user defined aggregate function, but then you will meet the second issue.

Second issue, although your treatment will be triggered every 5 minutes, function inside foreachBatch will be executed only if there are new records to process. Without new records during the 5 minutes interval between two execution, nothing happens.

In conclusion, spark streaming is not designed to schedule spark code to be executed at specific time interval.

Solution with java.util.Timer

So instead of using spark streaming, you should use a scheduler, either external such as cron, oozie, airflow, etc... or in your code

If you need to do it in your code, you can use java.util.Timer as below:

import org.apache.spark.sql.functions.{current_timestamp, expr}
import spark.implicits._

val t = new java.util.Timer()
val task = new java.util.TimerTask {
  def run(): Unit = {
    spark.read.format("delta").table("xxx")
      .filter($"ts" > (current_timestamp() - expr("INTERVAL 60 minutes")))
      .collect()
      .foreach(println)
  }
}
t.schedule(task, 5*60*1000L, 5*60*1000L) // 5 minutes
task.run()


来源:https://stackoverflow.com/questions/63640604/how-to-get-last-1-hour-data-every-5-minutes-without-grouping

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!