Un-persisting all dataframes in (py)spark

前端 未结 3 1905
猫巷女王i
猫巷女王i 2020-12-29 20:33

I am a spark application with several points where I would like to persist the current state. This is usually after a large step, or caching a state that I would like to use

相关标签:
3条回答
  • 2020-12-29 20:54

    When you use cache on dataframe it is one of the transformation and gets evaluated lazily when you perform any action on it like count(),show() etc.

    In your case after doing first cache you are calling show() that is the reason the dataframe is cached in memory. Now then you are again performing transformation on the dataframe to add additional column and again caching the new dataframe and then calling the action command show again and this would cache the second dataframe in memory. In case if size of your dataframe is big enough to just hold one dataframe then when you cache the second dataframe it would remove the first dataframe from the memory as it does not have enough space to hold the second dataframe.

    Thing to keep in mind: You should not cache a dataframe unless you are using it in multiple actions otherwise it would be an overload in terms of performance as caching itself is costlier operation.

    0 讨论(0)
  • 2020-12-29 21:06

    Spark 2.x

    You can use Catalog.clearCache:

    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.getOrCreate
    ...
    spark.catalog.clearCache()
    

    Spark 1.x

    You can use SQLContext.clearCache method which

    Removes all cached tables from the in-memory cache.

    from pyspark.sql import SQLContext
    from pyspark import SparkContext
    
    sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
    ...
    sqlContext.clearCache()
    
    0 讨论(0)
  • 2020-12-29 21:09

    We use this quite often

    for (id, rdd) in sc._jsc.getPersistentRDDs().items():
        rdd.unpersist()
        print("Unpersisted {} rdd".format(id))
    

    where sc is a sparkContext variable.

    0 讨论(0)
提交回复
热议问题