Spark - Scope, Data Frame, and memory management

问题

I am curious about how scope works with Data Frame and Spark. In the example below, I have a list of file, each independently loaded in a Data Frame, some operation is performed, then, we write dfOutput to disk.

val files = getListOfFiles("outputs/emailsSplit")

for (file <- files){

   val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("delimiter","\t")          // Delimiter is tab
      .option("parserLib", "UNIVOCITY")  // Parser, which deals better with the email formatting
      .schema(customSchema)              // Schema of the table
      .load(file.toString)                        // Input file


   val dfOutput = df.[stuff happens]

    dfOutput.write.format("com.databricks.spark.csv").mode("overwrite").option("header", "true").save("outputs/sentSplit/sentiment"+file.toString+".csv") 

}

Is each Data Frame inside the for loop discarded when a loop is done, or do they stay in memory?
If they are not discarded, what is a better way to do memory management at this point?

回答1:

DataFrame objects are tiny. However they can reference data in cache on Spark executors, and they can reference shuffle files on Spark executors. When the DataFrame is garbage collected that also causes the cache and shuffle files to be deleted on the executors.

In your code there are no references to the DataFrames past the loop. So they are eligible garbage collection. Garbage collection typically happens in response to memory pressure. If you worry about shuffle files filling up disk, it may make sense to trigger an explicit GC to make sure shuffle files are deleted for DataFrames that are no longer references.

Depending on what you do with the DataFrame ([stuff happens]) it may be that no data is ever stored in memory. This is the default mode of operation in Spark. If you just want to read some data, transform it, and write out back out, it will all happen line-by-line, never storing any of it in memory. (Caching only happens when you explicitly ask for it.)

With all that, I suggest not worrying about memory management until you have problems.

来源：https://stackoverflow.com/questions/38023349/spark-scope-data-frame-and-memory-management

标签

scala

apache-spark

spark-dataframe