in spark streaming must i call count() after cache() or persist() to force caching/persistence to really happen?

问题

watching this very good video on spark internals the presenter says that unless one performs an action on ones RDD after caching it caching will not really happen.

I never see count() being called in any other circumstances. So, I'm guessing that he is only calling count() after cache() to force persistence in the simple example he is giving. It is not necessary to do this every time one calls cache() or persist() in one's code. Is this right ?

回答1:

unless one performs an action on ones RDD after caching it caching will not really happen.

This is 100% true. The methods cache/persist will just mark the RDD for caching. The items inside the RDD are cached whenever an action is called on the RDD.

...only calling count() after cache() to force persistence in the simple example he is giving. It is not necessary to do this every time one calls cache() or persist() in one's code. Is this right ?

You are 100% right again. But I'll elaborate on this a bit.

For easy understanding, consider below example.

rdd.cache()
rdd.map(...).flatMap(...) //and so on
rdd.count() //or any other action

Assume you have 10 documents in your RDD. When the above snippet is run, each document goes through these tasks:

cached
map function
flatMap function

On the other hand,

rdd.cache().count()  
rdd.map(...).flatMap(...)  //and so on
rdd.count()  //or any other action

When the above snippet is run, all the 10 documents are cached first(the whole RDD). Then map function and the flatMap function are applied.

Both are right and are used as per the requirements. Hope this is makes the things more clear.

回答2:

Both .cache() and .persist() are transformations (not actions), so when you do call them you add the in the DAG. As you can see in the following image, a cached/persisted rdd/dataframe has a green colour in the dot.

When you have an action (.count(), .save(), .show() etc.) after a lot of transformations it doesn't matter is you have also another action immediately. Based on the example of @code:

// 1 CASE: cache/persist the initial rdd
rdd.cache()
rdd.count() // It forces the cache but it DOESNT need because we have the 2nd count.
rdd.map(...).flatMap(...) # Transformations
rdd.count()  //or any other action

// 2 CASE: cache/persist the transformed rdd
rdd.map(...).flatMap(...) # Transformations
rdd.cache()
rdd.count()  //or any other action

My opinion is, don't force the caching/persistence if you don't need the result of the action, because you compute something useless.

来源：https://stackoverflow.com/questions/43728505/in-spark-streaming-must-i-call-count-after-cache-or-persist-to-force-cachi

标签

caching

apache-spark

rdd