问题
I have currently the following setup:
Application writes data to Kafka -> SparkStreaming reads the stored data (always reading from earliest entry) and does conversions to the stream -> Application needs a RDD of this result to train an mllib model.
I want to basically achieve something similar to https://github.com/keiraqz/anomaly-detection - but my data does not come from file but from kafka and needs some reprocessing in Spark to extract the training data from the input.
Reading the data and processing it in the stream is no problem. However providing it to the main Thread for further processing does not work at all.
Is there a simple way for the stream to consume data for a certain amount of time, write everything it reads in this time to some kind of data structure and afterwards using this data structure for further processing?
What I tried until now is to set a RDD outside of the stream and then use:
spanDurationVectorStream.foreachRDD { rdd =>
if(rdd.count()==0){
flag = 1
}
bufferRdd.union(rdd)
}
Logger.getRootLogger.setLevel(rootLoggerLevel)
ssc.start()
while (flag == 0) {
Thread.sleep(1)
}
Thread.sleep(1)
However there is never something added to the bufferRdd - it remains with the single entry that I needed to initialize it.
I am running all the needed Spark Libraries on Version 2.1.1 with scala 2.11
If you need any further information I will do my best to provide you with everything you need.
Any help would be greatly appreciated.
EDIT: A quick summary of the amazing hints of @maasg - when he gives my the possibility to accept them as an answer I will happily do that:
First: To fix the issue with RDD the code can be changed to the following:
spanDurationVectorStream.foreachRDD { rdd =>
if(rdd.count()==0){
flag = 1
}
bufferRdd = bufferRdd.union(rdd)
}
Logger.getRootLogger.setLevel(rootLoggerLevel)
ssc.start()
while (flag == 0) {
}
As the RDD is immutable each rdd.union will return a new RDD that has to be saved (How history RDDs are preserved for further use in the given code). The Thread.sleep(1)
calls are simply unnecessary. With this setup I am able to use the RDD to train the model.
However @maasg added that he would recommend for the training scenario to not use Spark Streaming but basic Spark as described in Read Kafka topic in a Spark batch job
The only piece that remains unclear to me as of now is how to efficiently get the earliest and latest offset to receive the full content that is stored in Kafka at the moment of execution.
来源:https://stackoverflow.com/questions/44625956/sparkstreaming-read-kafka-stream-and-provide-it-as-rdd-for-further-processing