Is there a way to dynamically stop Spark Structured Streaming?

丶灬走出姿态 提交于 2020-08-19 04:37:22

问题


In my scenario I have several dataSet that comes every now and then that i need to ingest in our platform. The ingestion processes involves several transformation steps. One of them being Spark. In particular I use spark structured streaming so far. The infrastructure also involve kafka from which spark structured streaming reads data.

I wonder if there is a way to detect when there is nothing else to consume from a topic for a while to decide to stop the job. That is i want to run it for the time it takes to consume that specific dataset and then stop it. For specific reasons we decided not to use the batch version of spark.

Hence is there any timeout or something that can be used to detect that there is no more data coming it and that everything has be processed.

Thank you


回答1:


Structured Streaming Monitoring Options

You can use query.lastProgress to get the timestamp and build logic around that. Don't forget to save your checkpoint to a durable, persistent, available store.




回答2:


Putting together a couple pieces of advice:

  1. As @Michael West pointed out, there are listeners to track progress
  2. From what I gather, Structured Streaming doesn't yet support graceful shutdown

So one option is to periodically check for query activity, dynamically shutting down depending on a configurable state (when you determine no further progress can/should be made):

// where you configure your spark job...
spark.streams.addListener(shutdownListener(spark))

// your job code starts here by calling "start()" on the stream...

// periodically await termination, checking for your shutdown state
while(!spark.sparkContext.isStopped) {
  if (shutdown) {
    println(s"Shutting down since first batch has completed...")
    spark.streams.active.foreach(_.stop())
    spark.stop()
  } else {
    // wait 10 seconds before checking again if work is complete
    spark.streams.awaitAnyTermination(10000)
  }
}

Your listener can dynamically shutdown in a variety of ways. For instance, if you're only waiting on a single batch, then just shutdown after the first update:

var shutdown = false
def shutdownListener(spark: SparkSession) = new StreamingQueryListener() {
  override def onQueryStarted(_: QueryStartedEvent): Unit = println("Query started: " + queryStarted.id)
  override def onQueryTerminated(_: QueryTerminatedEvent): Unit = println("Query terminated! " + queryTerminated.id)
  override def onQueryProgress(_: QueryProgressEvent): Unit = shutdown = true
}

Or, if you need to shutdown after more complicated state changes, you could parse the json body of the queryProgress.progress to determine whether or not to shutdown at a given onQueryUpdate event firing.



来源:https://stackoverflow.com/questions/52498622/is-there-a-way-to-dynamically-stop-spark-structured-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!