How to get progress of streaming query after awaitTermination?

非 Y 不嫁゛ 提交于 2020-12-31 06:01:09

问题


I am new to spark and was reading few things about monitoring the spark application. Basically, I want to know how many records were processed by spark application in given trigger time and progress of query. I know 'lastProgress' gives all those metrics but when I'm using awaitTermination with 'lastProgress' it always returns null.

 val q4s = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", brokers)
  .option("subscribe", topic)
  .option("startingOffsets", "earliest")
  .load()
  .writeStream
  .outputMode("append")
  .option("checkpointLocation", checkpoint_loc)
  .trigger(Trigger.ProcessingTime("10 seconds"))
  .format("console")
  .start()

  println("Query Id: "+ q4s.id.toString())
  println("QUERY PROGRESS.........")
println(q4s.lastProgress);
q4s.awaitTermination();

Output:

Query Id: efd6bc15-f10c-4938-a1aa-c81fdb2b33e3
QUERY PROGRESS.........
null

How can get progress of my query while using awaitTermination or how can I keep my query continuously running without using awaitTermination?

Thanks in advance.


回答1:


You have to start a separate thread with the reference to the streaming query to monitor (say q4s) and pull the progress regularly.

The thread that started the query (the main thread of your Spark Structured Streaming application) is usually awaitTermination so the daemon threads of the streaming queries it started could keep running.




回答2:


Using dedicated runnable thread

You can create a dedicated Thread continuously printing the last progress of your streaming query.

First, define a runnable Monitoring class which prints out the last Progress every 10 seconds (10000ms):

class StreamingMonitor(q: StreamingQuery) extends Runnable {
  def run {
    while(true) {
      println("Time: " + Calendar.getInstance().getTime())
      println(q.lastProgress)
      Thread.sleep(10000)
    }
  }
}

Second, implement this into your application code as below:

val q4s: StreamingQuery = df.writeStream
  [...]
  .start()

new Thread(new StreamingMonitor(q4s)).start()

q4s.awaitTermination()

Looping over query status

You could also have a while loop on the status of the query:

val q4s: StreamingQuery = df.writeStream
  [...]
  .start()

while(q4s.isActive) {
  println(q4s.lastProgress)
  Thread.sleep(10000)
}

q4s.awaitTermination()

Alternative Solution using StreamingQueryListener

An alternative solution to monitor your streaming query would be to use the StreamingQueryListener. Again, first define a Class extending the StreamingQueryListener:

import org.apache.spark.sql.streaming.{StreamingQueryListener, StreamingQueryProgress}
import org.apache.spark.sql.streaming.StreamingQueryListener.QueryProgressEvent


class MonitorListener extends StreamingQueryListener {

  override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = { }

  override def onQueryProgress(event: QueryProgressEvent): Unit = {
    println(s"""numInputRows: ${event.progress.numInputRows}""")
    println(s"""processedRowsPerSecond: ${event.progress.processedRowsPerSecond}""")
  }

  override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = { }
}

then registering it with your SparkSession:

spark.streams.addListener(new MonitorListener)


来源:https://stackoverflow.com/questions/54436822/how-to-get-progress-of-streaming-query-after-awaittermination

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!