Caching DStream in Spark Streaming

问题

I have a Spark streaming process which reads data from kafka, into a DStream.

In my pipeline I do two times (one after another):

DStream.foreachRDD( transformations on RDD and inserting into destination).

(each time I do different processing and insert data to different destination).

I was wondering how would DStream.cache, right after I read data from Kafka work? Is it possible to do it?

Is the process now actually reading data two times from Kafka?

Please keep in mind, that it is not possible to put two foreachRDDs into one (because two paths are quite different, there are statefull transformations there - which need to be appliend on DStream...)

Thanks for your help

回答1:

There're two options:

Use Dstream.cache() to mark the underlying RDDs as cached. Spark Streaming will take care of unpersisting the RDDs after a timeout, controlled by the spark.cleaner.ttl configuration.
Use additional foreachRDD to apply cache() and unpersist(false) side-effecting operations to the RDDs in the DStream:

e.g:

val kafkaDStream = ???
val targetRDD = kafkaRDD
                       .transformation(...)
                       .transformation(...)
                       ...
// Right before the lineage fork mark the RDD as cacheable:
targetRDD.foreachRDD{rdd => rdd.cache(...)}
targetRDD.foreachRDD{do stuff 1}
targetRDD.foreachRDD{do stuff 2}
targetRDD.foreachRDD{rdd => rdd.unpersist(false)}

Note that you could incorporate the cache as the first statement of do stuff 1 if that's an option.

I prefer this option because it gives me fine-grained control over the cache lifecycle and lets me cleanup stuff as soon as it's needed instead of depending of a ttl.

来源：https://stackoverflow.com/questions/37684506/caching-dstream-in-spark-streaming

标签

apache-spark

spark-streaming

kafka-consumer-api