问题
I was under the impression that both RDD execution and caching are lazy: Namely, if an RDD is cached, and only part of it was used, then the caching mechanism will only cache that part, and the other part will be computed on-demand.
Unfortunately, the following experiment seems to indicate otherwise:
val acc = new LongAccumulator()
TestSC.register(acc)
val rdd = TestSC.parallelize(1 to 100, 16).map { v =>
acc add 1
v
}
rdd.persist()
val sliced = rdd
.mapPartitions { itr =>
itr.slice(0, 2)
}
sliced.count()
assert(acc.value == 32)
Running it yields the following exception:
100 did not equal 32
ScalaTestFailureLocation:
Expected :32
Actual :100
Turns out the entire RDD was computed instead of only the first 2 items in each partition. This is very inefficient in some cases (e.g. when you need to determine whether the RDD is empty quickly). Ideally, the caching manager should allow the caching buffer to be incrementally written and accessed randomly, does this feature exists? If not, what should I do to make it happen? (preferrably using existing memory & disk caching mechanism)
Thanks a lot for your opinion
UPDATE 1 It appears that Spark already has 2 classes:
- ExternalAppendOnlyMap
- ExternalAppendOnlyUnsafeRowArray
that supports more granular caching of many values. Even better, they don't rely on StorageLevel, instead make its own decision which storage device to use. I'm however surprised that they are not options for RDD/Dataset caching directly, rather than for co-group/join/streamOps or accumulators.
回答1:
In hindsight interesting, here is my take:
You cannot cache incrementally. So the answer to your question is No.
The
persist
is RDD for all partitions of that RDD, used for multiple Actions or single Action with multiple processing from same common RDD phase onwards.The rdd Optimizer does not look to see how that could be optimized as you state - if you use the
persist
. You issued that call, method, api, so it executes it.But, if you do not use the
persist
, the lazy evaluation and fusing of code within Stage, seems to tie the slice cardinality and the acc together. That is clear. Is it logical, yes as there is no further reference elsewhere as part of another Action. Others may see it as odd or erroneous. But it does not imply imo incremental persistence / caching.
So, imho, interesting observation I would not have come up with, and not convinced it proves anything about partial caching.
来源:https://stackoverflow.com/questions/61259136/in-apache-spark-can-i-incrementally-cache-an-rdd-partition