Not able to persist the DStream for use in next batch

余生颓废 提交于 2019-12-01 11:56:50

It's conceptually not possible to "remember" a DStream. DStreams are time-bound and on each clock-tick (called "batch interval") the DStream represents the observed data in the stream during that period of time.

Hence, we cannot have an "old" DStream saved to join with a "new" DStream. All DStreams live in the "now".

The underlying data structure of DStreams is the RDD: Each batch interval, our DStream will have 1 RDD of the data for that interval. RDDs represent a distributed collection of data. RDDs are immutable and permanent, for as long as we have a reference to them.

We can combine RDDs and DStreams to create the "history roll over" that's required here.

It looks pretty similar to the approach on the question, but only using the history RDD.

Here's a high-level view of the suggested changes:

var history: RDD[(String, List[String]) = sc.emptyRDD()

val dstream1 = ...
val dstream2 = ...

val historyDStream = dstream1.transform(rdd => rdd.union(history))
val joined = historyDStream.join(dstream2)

... do stuff with joined as above, obtain dstreamFiltered ...

dstreamFiltered.foreachRDD{rdd =>
   val formatted = rdd.map{case (k,(v1,v2)) => (k,v1)} // get rid of the join info
   history.unpersist(false) // unpersist the 'old' history RDD
   history = formatted // assign the new history
   history.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
   history.count() //action to materialize this transformation
}

This is only a starting point. There're additional considerations with regards to checkpointing. Otherwise the lineage of the history RDD will grow unbounded until some StackOverflow happens. This blog is quite complete on this particular technique: http://www.spark.tc/stateful-spark-streaming-using-transform/

I also recommend you using Scala instead of Java. The Java syntax is too verbose to use with Spark Streaming.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!