问题
I am new to spark and was reading few things about monitoring the spark application. Basically, I want to know how many records were processed by spark application in given trigger time and progress of query. I know 'lastProgress' gives all those metrics but when I'm using awaitTermination with 'lastProgress' it always returns null.
val q4s = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
.writeStream
.outputMode("append")
.option("checkpointLocation", checkpoint_loc)
.trigger(Trigger.ProcessingTime("10 seconds"))
.format("console")
.start()
println("Query Id: "+ q4s.id.toString())
println("QUERY PROGRESS.........")
println(q4s.lastProgress);
q4s.awaitTermination();
Output:
Query Id: efd6bc15-f10c-4938-a1aa-c81fdb2b33e3
QUERY PROGRESS.........
null
How can get progress of my query while using awaitTermination or how can I keep my query continuously running without using awaitTermination?
Thanks in advance.
回答1:
You have to start a separate thread with the reference to the streaming query to monitor (say q4s
) and pull the progress regularly.
The thread that started the query (the main thread of your Spark Structured Streaming application) is usually awaitTermination
so the daemon threads of the streaming queries it started could keep running.
回答2:
Using dedicated runnable thread
You can create a dedicated Thread continuously printing the last progress of your streaming query.
First, define a runnable Monitoring class which prints out the last Progress every 10 seconds (10000ms):
class StreamingMonitor(q: StreamingQuery) extends Runnable {
def run {
while(true) {
println("Time: " + Calendar.getInstance().getTime())
println(q.lastProgress)
Thread.sleep(10000)
}
}
}
Second, implement this into your application code as below:
val q4s: StreamingQuery = df.writeStream
[...]
.start()
new Thread(new StreamingMonitor(q4s)).start()
q4s.awaitTermination()
Looping over query status
You could also have a while loop on the status of the query:
val q4s: StreamingQuery = df.writeStream
[...]
.start()
while(q4s.isActive) {
println(q4s.lastProgress)
Thread.sleep(10000)
}
q4s.awaitTermination()
Alternative Solution using StreamingQueryListener
An alternative solution to monitor your streaming query would be to use the StreamingQueryListener
. Again, first define a Class extending the StreamingQueryListener
:
import org.apache.spark.sql.streaming.{StreamingQueryListener, StreamingQueryProgress}
import org.apache.spark.sql.streaming.StreamingQueryListener.QueryProgressEvent
class MonitorListener extends StreamingQueryListener {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = { }
override def onQueryProgress(event: QueryProgressEvent): Unit = {
println(s"""numInputRows: ${event.progress.numInputRows}""")
println(s"""processedRowsPerSecond: ${event.progress.processedRowsPerSecond}""")
}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = { }
}
then registering it with your SparkSession:
spark.streams.addListener(new MonitorListener)
来源:https://stackoverflow.com/questions/54436822/how-to-get-progress-of-streaming-query-after-awaittermination