When performing a shuffle my Spark job fails and says \"no space left on device\", but when I run df -h
it says I have free space left! Why does this happen, a
Another scenario for this error:
Problem:
My job throwing error "No space left on device". As you can see my job requires so many shuffling, So to counter this problem I have used 20-nodes initially then increased to 40-nodes. Somehow the problem was still happening. I tried all other stuff like changing the spark.local.dir
, repartitioning, Custom partitions, and parameter tuning(compression, spiling, memory, memory fraction, etc.) as much I could do. Also, I used instance type r3.2xlarge which has 1 x 160 SSD but the problem still happening.
Solution:
I logged into one of the nodes, and executed df -h /
I found the node has only one mounted EBS volume(8GB) but there was no SSD(160GB). Then I looked into ls /dev/
and SSD was attached. This problem was not happening for all the nodes in the cluster. The error "No space left on device" happening for only those nodes which do not have SSD mounted. As they are dealing with only 8GB(EBS) and out of that ~4 GB space was available.
I created another bash script which launches the spark cluster using the spark-ec2 script then mount the disk after formatting it.
ec2-script
to launch clusterMASTER_HOST = <ec2-script> get-master $CLUSTER_NAME
ssh -o StrictHostKeyChecking=no root@$MASTER_HOST "cd /root/spark/sbin/ && ./slaves.sh mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/sdb && ./slaves.sh mount -o defaults,noatime,nodiratime /dev/sdb /mnt"
Please change the SPARK_HOME directory, as we have to give the directory which has more space available for running our job smoothly.