Caching DStream in Spark Streaming

家住魔仙堡 提交于 2019-12-24 07:40:04

问题


I have a Spark streaming process which reads data from kafka, into a DStream.

In my pipeline I do two times (one after another):

DStream.foreachRDD( transformations on RDD and inserting into destination).

(each time I do different processing and insert data to different destination).

I was wondering how would ​DStream.cache​, right after I read data from Kafka work? Is it possible to do it?

Is the process now actually reading data two times from Kafka?

Please keep in mind, that it is not possible to put two foreachRDDs into one (because two paths are quite different, there are statefull transformations there - which need to be appliend on DStream...)

Thanks for your help


回答1:


There're two options:

  • Use Dstream.cache() to mark the underlying RDDs as cached. Spark Streaming will take care of unpersisting the RDDs after a timeout, controlled by the spark.cleaner.ttl configuration.

  • Use additional foreachRDD to apply cache() and unpersist(false) side-effecting operations to the RDDs in the DStream:

e.g:

val kafkaDStream = ???
val targetRDD = kafkaRDD
                       .transformation(...)
                       .transformation(...)
                       ...
// Right before the lineage fork mark the RDD as cacheable:
targetRDD.foreachRDD{rdd => rdd.cache(...)}
targetRDD.foreachRDD{do stuff 1}
targetRDD.foreachRDD{do stuff 2}
targetRDD.foreachRDD{rdd => rdd.unpersist(false)}

Note that you could incorporate the cache as the first statement of do stuff 1 if that's an option.

I prefer this option because it gives me fine-grained control over the cache lifecycle and lets me cleanup stuff as soon as it's needed instead of depending of a ttl.



来源:https://stackoverflow.com/questions/37684506/caching-dstream-in-spark-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!