问题
My code is something like:
sc = SparkContext()
ssc = StreamingContext(sc, 30)
initRDD = sc.parallelize('path_to_data')
lines = ssc.socketTextStream('localhost', 9999)
res = lines.transform(lambda x: x.join(initRDD))
res.pprint()
And my question is that initRDD
need to be updated every day in midnight.
I try to this way:
sc = SparkContext()
ssc = StreamingContext(sc, 30)
lines = ssc.socketTextStream('localhost', 9999)
def func(rdd):
initRDD = rdd.context.parallelize('path_to_data')
return rdd.join(initRDD)
res = lines.transform(func)
res.pprint()
But it seems that initRDD
will be updated per 30s which same to batchDuration
Is there any good ideal
回答1:
One option would be to check for a deadline before the transform
. The check is a simple comparison and hence cheap to do at each batch interval:
def nextDeadline() : Long = {
// assumes midnight on UTC timezone.
LocalDate.now.atStartOfDay().plusDays(1).toInstant(ZoneOffset.UTC).toEpochMilli()
}
// Note this is a mutable variable!
var initRDD = sparkSession.read.parquet("/tmp/learningsparkstreaming/sensor-records.parquet")
// Note this is a mutable variable!
var _nextDeadline = nextDeadline()
val lines = ssc.socketTextStream("localhost", 9999)
// we use the foreachRDD as a scheduling trigger.
// We don't use the data, only the execution hook
lines.foreachRDD{ _ =>
if (System.currentTimeMillis > _nextDeadline) {
initRDD = sparkSession.read.parquet("/tmp/learningsparkstreaming/sensor-records.parquet")
_nextDeadline = nextDeadline()
}
}
// if the rdd was updated, it will be picked up in this stage.
val res = lines.transform(rdd => rdd.join(initRDD))
来源:https://stackoverflow.com/questions/45031215/how-to-update-rdd-periodically-in-spark-streaming