What happens if I cache the same RDD twice in Spark

后端 未结 2 585
感动是毒
感动是毒 2021-01-05 16:40

I\'m building a generic function which receives a RDD and does some calculations on it. Since I run more than one calculation on the input RDD I would like to cache it. For

相关标签:
2条回答
  • 2021-01-05 17:09

    just test on my cluster, Zohar is right, nothing happens, it will just cache the RDD for once. The reason, I think, is that every RDD has an id internally, spark will use the id to mark whether a RDD have been cached or not. so cache one RDD for multiple times will do nothing.

    bellow is my code and screenshot:

    updated [ add code as required ]


    ### cache and count, then will show the storage info on WEB UI
    
    raw_file = sc.wholeTextFiles('hdfs://10.21.208.21:8020/user/mercury/names', minPartitions=40)\
                     .setName("raw_file")\
                     .cache()
    raw_file.count()
    
    ### try to cache and count again, then take a look at the WEB UI, nothing changes
    
    raw_file.cache()
    raw_file.count()
    
    ### try to change rdd's name and cache and count again, to see will it cache a new rdd as the new name again, still 
    ### nothing changes, so I think maybe it is using the RDD id as a mark, for more we need to take a detailed read on 
    ### the document even then source code
    
    raw_file.setName("raw_file_2")
    raw_file.cache().count()
    
    0 讨论(0)
  • 2021-01-05 17:21

    Nothing. If you call cache on a cached RDD, nothing happens, RDD will be cached (once). Caching, like many other transformations, is lazy:

    • When you call cache, the RDD's storageLevel is set to MEMORY_ONLY
    • When you call cache again, it's set to the same value (no change)
    • Upon evaluation, when underlying RDD is materialized, Spark will check the RDD's storageLevel and if it requires caching, it will cache it.

    So you're safe.

    0 讨论(0)
提交回复
热议问题