Drop spark dataframe from cache

≯℡__Kan透↙ 提交于 2019-12-20 11:14:42

问题


I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution;

df1.cache()
df2.cache()

Once use of certain dataframe is over and is no longer needed how can I drop DF from memory (or un-cache it??)?

For example, df1 is used through out the code while df2 is utilized for few transformations and after that, it is never needed. I want to forcefully drop df2 to release more memory space.


回答1:


just do the following:

df1.unpersist()
df2.unpersist()

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.




回答2:


If the dataframe registered as a table for SQL operations, like

df.createGlobalTempView(tableName) // or some other way as per spark verision

then the cache can be dropped with following commands, off-course spark also does it automatically

Spark >= 2.x

Here spark is an object of SparkSession

  • Drop a specific table/df from cache

    spark.catalog.uncacheTable(tableName)
    
  • Drop all tables/dfs from cache

    spark.catalog.clearCache()
    

Spark <= 1.6.x

  • Drop a specific table/df from cache

    sqlContext.uncacheTable(tableName)
    
  • Drop all tables/dfs from cache

    sqlContext.clearCache()
    


来源:https://stackoverflow.com/questions/32218769/drop-spark-dataframe-from-cache

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!