Apache Spark does not delete temporary directories

前端 未结 6 503
庸人自扰
庸人自扰 2020-11-27 15:48

After a spark program completes, there are 3 temporary directories remain in the temp directory. The directory names are like this: spark-2e389487-40cc-4a82-a5c7-353c0feefbb

相关标签:
6条回答
  • 2020-11-27 16:33

    I don't know how to make Spark cleanup those temporary directories, but I was able to prevent the creation of the snappy-XXX files. This can be done in two ways:

    1. Disable compression. Properties: spark.broadcast.compress, spark.shuffle.compress, spark.shuffle.spill.compress. See http://spark.apache.org/docs/1.3.1/configuration.html#compression-and-serialization
    2. Use LZF as a compression codec. Spark uses native libraries for Snappy and lz4. And because of the way JNI works, Spark has to unpack these libraries before using them. LZF seems to be implemented natively in Java.

    I'm doing this during development, but for production it is probably better to use compression and have a script to clean up the temp directories.

    0 讨论(0)
  • 2020-11-27 16:41

    I assume you are using the "local" mode only for testing purposes. I solved this issue by creating a custom temp folder before running a test and then I delete it manually (in my case I use local mode in JUnit so the temp folder is deleted automatically).

    You can change the path to the temp folder for Spark by spark.local.dir property.

    SparkConf conf = new SparkConf().setMaster("local")
                                    .setAppName("test")
                                    .set("spark.local.dir", "/tmp/spark-temp");
    

    After the test is completed I would delete the /tmp/spark-temp folder manually.

    0 讨论(0)
  • 2020-11-27 16:42

    I do not think cleanup is supported for all scenarios. I would suggest to write a simple windows scheduler to clean up nightly.

    0 讨论(0)
  • 2020-11-27 16:45

    You need to call close() on the spark context that you created at the end of the program.

    0 讨论(0)
  • 2020-11-27 16:51

    Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference: from Spark Doc

    • spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.

    • spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.

    • spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.

    0 讨论(0)
  • 2020-11-27 16:54

    for spark.local.dir, it will only move spark temp files, but the snappy-xxx file will still exists in /tmp dir. Though didn't find way to make spark automatically clear it, but you can set JAVA option:

    JVM_EXTRA_OPTS=" -Dorg.xerial.snappy.tempdir=~/some-other-tmp-dir"
    

    to make it move to another dir, as most system has small /tmp size.

    0 讨论(0)
提交回复
热议问题