I\'m building a generic function which receives a RDD and does some calculations on it. Since I run more than one calculation on the input RDD I would like to cache it. For
just test on my cluster, Zohar is right, nothing happens, it will just cache the RDD for once. The reason, I think, is that every RDD has an id
internally, spark will use the id
to mark whether a RDD have been cached or not. so cache one RDD for multiple times will do nothing.
bellow is my code and screenshot:
updated [ add code as required ]
### cache and count, then will show the storage info on WEB UI
raw_file = sc.wholeTextFiles('hdfs://10.21.208.21:8020/user/mercury/names', minPartitions=40)\
.setName("raw_file")\
.cache()
raw_file.count()
### try to cache and count again, then take a look at the WEB UI, nothing changes
raw_file.cache()
raw_file.count()
### try to change rdd's name and cache and count again, to see will it cache a new rdd as the new name again, still
### nothing changes, so I think maybe it is using the RDD id as a mark, for more we need to take a detailed read on
### the document even then source code
raw_file.setName("raw_file_2")
raw_file.cache().count()
Nothing. If you call cache
on a cached RDD, nothing happens, RDD will be cached (once). Caching, like many other transformations, is lazy:
cache
, the RDD's storageLevel
is set to MEMORY_ONLY
cache
again, it's set to the same value (no change)storageLevel
and if it requires caching, it will cache it. So you're safe.