I use Spark 1.6.0 with Cloudera 5.8.3.
I have a DStream
object and plenty of transformations defined on top of it,
val stream = KafkaUtils.c
Using streaming listeners should solve the problem for you:
(sorry it's a java example)
ssc.addStreamingListener(new JobListener());
// ...
class JobListener implements StreamingListener {
@Override
public void onBatchCompleted(StreamingListenerBatchCompleted batchCompleted) {
System.out.println("Batch completed, Total delay :" + batchCompleted.batchInfo().totalDelay().get().toString() + " ms");
}
/*
snipped other methods
*/
}
https://gist.github.com/akhld/b10dc491aad1a2007183
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-streaming/spark-streaming-streaminglisteners.html
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener
Start a stream with name myStreamName
and wait for it to start up -
deltaStreamingQuery = (streamingDF
.writeStream
.format("delta")
.queryName(myStreamName)
.start(writePath)
)
untilStreamIsReady(myStreamName)
PySpark version wait for the stream to start up:
def getActiveStreams():
try:
return spark.streams.active
except:
print("Unable to iterate over all active streams - using an empty set instead.")
return []
def untilStreamIsReady(name, progressions=3):
import time
queries = list(filter(lambda query: query.name == name, getActiveStreams()))
while (len(queries) == 0 or len(queries[0].recentProgress) < progressions):
time.sleep(5) # Give it a couple of seconds
queries = list(filter(lambda query: query.name == name, getActiveStreams()))
print("The stream {} is active and ready.".format(name))
Spark Scala version wait for the stream to start up:
def getActiveStreams():Seq[org.apache.spark.sql.streaming.StreamingQuery] = {
return try {
spark.streams.active
} catch {
case e:Throwable => {
// In extream cases, this funtion may throw an ignorable error.
println("Unable to iterate over all active streams - using an empty set instead.")
Seq[org.apache.spark.sql.streaming.StreamingQuery]()
}
}
}
def untilStreamIsReady(name:String, progressions:Int = 3):Unit = {
var queries = getActiveStreams().filter(_.name == name)
while (queries.length == 0 || queries(0).recentProgress.length < progressions) {
Thread.sleep(5*1000) // Give it a couple of seconds
queries = getActiveStreams().filter(_.name == name)
}
println("The stream %s is active and ready.".format(name))
}
To the original question.. add another version of this function - wait for the stream first to start up and then wait another time (just add a negative condition on the wait state) for it to finish, so the complete version would look something like this -
untilStreamIsReady(myStreamName)
untilStreamIsDone(myStreamName) // reverse of untilStreamIsReady - wait when myStreamName will not be in the list