问题
watching this very good video on spark internals the presenter says that unless one performs an action on ones RDD after caching it caching will not really happen.
I never see count() being called in any other circumstances. So, I'm guessing that he is only calling count() after cache() to force persistence in the simple example he is giving. It is not necessary to do this every time one calls cache() or persist() in one's code. Is this right ?
回答1:
unless one performs an action on ones RDD after caching it caching will not really happen.
This is 100% true. The methods cache
/persist
will just mark the RDD for caching. The items inside the RDD are cached whenever an action is called on the RDD.
...only calling count() after cache() to force persistence in the simple example he is giving. It is not necessary to do this every time one calls cache() or persist() in one's code. Is this right ?
You are 100% right again. But I'll elaborate on this a bit.
For easy understanding, consider below example.
rdd.cache()
rdd.map(...).flatMap(...) //and so on
rdd.count() //or any other action
Assume you have 10 documents in your RDD. When the above snippet is run, each document goes through these tasks:
- cached
- map function
- flatMap function
On the other hand,
rdd.cache().count()
rdd.map(...).flatMap(...) //and so on
rdd.count() //or any other action
When the above snippet is run, all the 10 documents are cached first(the whole RDD). Then map function and the flatMap function are applied.
Both are right and are used as per the requirements. Hope this is makes the things more clear.
回答2:
Both .cache()
and .persist()
are transformations (not actions), so when you do call them you add the in the DAG. As you can see in the following image, a cached/persisted rdd/dataframe has a green colour in the dot.
When you have an action (.count()
, .save()
, .show()
etc.) after a lot of transformations it doesn't matter is you have also another action immediately.
Based on the example of @code:
// 1 CASE: cache/persist the initial rdd
rdd.cache()
rdd.count() // It forces the cache but it DOESNT need because we have the 2nd count.
rdd.map(...).flatMap(...) # Transformations
rdd.count() //or any other action
// 2 CASE: cache/persist the transformed rdd
rdd.map(...).flatMap(...) # Transformations
rdd.cache()
rdd.count() //or any other action
My opinion is, don't force the caching/persistence if you don't need the result of the action, because you compute something useless.
来源:https://stackoverflow.com/questions/43728505/in-spark-streaming-must-i-call-count-after-cache-or-persist-to-force-cachi