问题
I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution;
df1.cache()
df2.cache()
Once use of certain dataframe is over and is no longer needed how can I drop DF from memory (or un-cache it??)?
For example, df1
is used through out the code while df2
is utilized for few transformations and after that, it is never needed. I want to forcefully drop df2
to release more memory space.
回答1:
just do the following:
df1.unpersist()
df2.unpersist()
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
回答2:
If the dataframe registered as a table for SQL operations, like
df.createGlobalTempView(tableName) // or some other way as per spark verision
then the cache can be dropped with following commands, off-course spark also does it automatically
Spark >= 2.x
Here spark
is an object of SparkSession
Drop a specific table/df from cache
spark.catalog.uncacheTable(tableName)
Drop all tables/dfs from cache
spark.catalog.clearCache()
Spark <= 1.6.x
Drop a specific table/df from cache
sqlContext.uncacheTable(tableName)
Drop all tables/dfs from cache
sqlContext.clearCache()
来源:https://stackoverflow.com/questions/32218769/drop-spark-dataframe-from-cache