问题
My context is that I have a spark custom receiver that receives data stream from a http endpoint. The httpend point is updated every 30 seconds with new data. So it does not make sense for my spark streaming application to aggregate the entire data in the 30 second time frame as it obviously leads to duplicate data (when i save the dstream as a file, each part file that represents an rdd is exactly the same).
In order to avoid this de-duplication process,I want a 5 second slice of this window. I want to use the slice function in the DStream api. There are two ways to use this function 1. slice(fromTime: Time, toTime: Time) 2. slice(interval: Interval)
Although the second option is a public method, the Interval class is private. I've raised a question on spark's jira but that is another issue (https://issues.apache.org/jira/browse/SPARK-27206)
My question is specific to the first option. I do the following
val sparkSession = getSparkSession(APP_NAME)
val batchInterval:Duration = Durations.seconds(30)
val windowDuration:Duration = Durations.seconds(60)
val slideDuration:Duration = Durations.seconds(30)
val ssc = new StreamingContext(sparkSession.sparkContext, batchInterval)
ssc.checkpoint("some path")
val memTotal:ReceiverInputDStream[String] = ssc.receiverStream(new MyReceiver("http endpoint", true))
val dstreamMemTotal = memTotal.window(windowDuration, slideDuration)
All is well till this point. However when i add the slide function such as the following
val a = dstreamMemTotal.slice(currentTime, currentTime.+(Durations.seconds(5)))
I get the following error.
exception in thread "main" org.apache.spark.SparkException: org.apache.spark.streaming.dstream.WindowedDStream@62315f22 has not been initialized
at org.apache.spark.streaming.dstream.DStream$$anonfun$slice$2.apply(DStream.scala:880)
at org.apache.spark.streaming.dstream.DStream$$anonfun$slice$2.apply(DStream.scala:878)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:699)
at org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:265)
at org.apache.spark.streaming.dstream.DStream.slice(DStream.scala:878)
Any pointers please?
来源:https://stackoverflow.com/questions/55257884/creating-a-slice-of-dstream-window