When performing a shuffle my Spark job fails and says \"no space left on device\", but when I run df -h
it says I have free space left! Why does this happen, a
Some other workarounds:
Explicitly removing the intermidiate shuffe files. If you don't want to keep the rdd for later computation, you can call .unpersist() which will flag the intermidiate shuffle files for removal (you can also re-assign the rdd variable to None).
Use more workers, adding more workers will reduce on average the number of intermidiate suffle file needed / worker.
More about the "No space left on device" error on this databricks thread: https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html
You need to also monitor df -i
which shows how many inodes are in use.
on each machine, we create M * R temporary files for shuffle, where M = number of map tasks, R = number of reduce tasks.
https://spark-project.atlassian.net/browse/SPARK-751
If you do indeed see that disks are running out of inodes to fix the problem you can:
coalesce
with shuffle = false
).spark.shuffle.consolidateFiles
and see https://spark-project.atlassian.net/secure/attachment/10600/Consolidating%20Shuffle%20Files%20in%20Spark.pdf. EDIT
Consolidating files has been removed from spark since version 1.6. https://issues.apache.org/jira/browse/SPARK-9808
On the worker machine, set the environment variable "SPARK_LOCAL_DIRS" to the place you have free space. Setting the configuration variable "spark.local.dir" doesn't work from Spark 1.0 and later.
By default Spark
uses the /tmp
directory to store intermediate data. If you actually do have space left on some device -- you can alter this by creating the file SPARK_HOME/conf/spark-defaults.conf
and adding the line. Here SPARK_HOME
is wherever you root directory for the spark install is.
spark.local.dir SOME/DIR/WHERE/YOU/HAVE/SPACE
I encountered a similar problem. By default, spark uses "/tmp" to save intermediate files. When the job is running, you can tab df -h
to see the used space of fs mounted at "/" growing up. When the space of the dev is runned out of, this exception is thrown. To solve the problem, I set the SPARK_LOCAL_DIRS
in the SPARK_HOME/conf/spark_defaults.conf with a path in a fs leaving enough space.
What space is this?
Spark actually writes temporary output files from “map” tasks and RDDs to external storage called “scratch space”, and by default, “scratch space” is on local machine’s /tmp directory.
/tmp is usually the operating system’s (OS) temporary output directory, accessed by OS users, and /tmp is typically small and on a single disk. So when Spark runs lots of jobs, long jobs, or complex jobs, /tmp can fill up quickly, forcing Spark to throw “No space left on device” exceptions.
Because Spark constantly writes to and reads from its scratch space, disk IO can be heavy and can slow down your workload. The best way to resolve this issue and to boost performance is to give as many disks as possible to handle scratch space disk IO. To achieve both, explicitly define parameter spark.local.dir
in spark-defaults.conf
configuration file, as follows:
spark.local.dir
/data1/tmp,/data2/tmp,/data3/tmp,/data4/tmp,/data5/tmp,/data6/tmp,/data7/tmp,/data8/tmp
The above comma-delimited setting will spread out Spark scratch space onto 8 disks (make sure each /data* directory is configured on a separate physical data disk), and under the /data*/tmp directories. You can create any sub directory names instead of ‘tmp’.
Source: https://developer.ibm.com/hadoop/2016/07/18/troubleshooting-and-tuning-spark-for-heavy-workloads/