In Apache Spark, can I incrementally cache an RDD partition?
问题 I was under the impression that both RDD execution and caching are lazy: Namely, if an RDD is cached, and only part of it was used, then the caching mechanism will only cache that part, and the other part will be computed on-demand. Unfortunately, the following experiment seems to indicate otherwise: val acc = new LongAccumulator() TestSC.register(acc) val rdd = TestSC.parallelize(1 to 100, 16).map { v => acc add 1 v } rdd.persist() val sliced = rdd .mapPartitions { itr => itr.slice(0, 2) }