Spark on Google's Dataproc failed due to java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/

问题

I've been using Spark/Hadoop on Dataproc for months both via Zeppelin and Dataproc console but just recently I got the following error.

Caused by: java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1530998908050_0001/blockmgr-9d6a2308-0d52-40f5-8ef3-0abce2083a9c/21/temp_shuffle_3f65e1ca-ba48-4cb0-a2ae-7a81dcdcf466 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

First, I got this type of error on Zeppelin notebook and thought it was Zeppelin issue. This error however, seems to occur randomly. I suspect It has something to do with one of the Spark workers not being able to write in that path. So, I googled and was suggested to delete files under /hadoop/yarn/nm-local-dir/usercache/ on each Spark worker and check if there are available disk space on each worker. After doing so, I still sometimes had this error. I also ran a Spark job on Dataproc, this similar error also occurred. I'm on Dataproc image version 1.2.

thanks

Peeranat F.

回答1:

Ok. We faced the same issue on GCP and the reason for this is resource preemption.

In GCP, resource preemption can be done by following two strategies,

Node preemption - removing nodes in cluster and replacing them
Container preemption - removing yarn containers.

This setting is done in GCP by your admin/ dev ops person to optimize cost and resource utilization of cluster, specially if it is being shared.

What you're stack trace tells me is that its node preemption. This error occurs randomly because some times the node that get preempted is your driver node that causes the app to fail all together.

You can see which nodes are preemptable in your GCP console.

回答2:

The following could be other possible causes:

The cluster uses preemptive workers (they can be deleted at any time), so their work is not completed and could cause inconsistent behaviors.
There exist resizing in the nodes during the spark job execution that causes to restart tasks/containers/executors.
Memory issues. The shuffle operations are usually done in-memory, but if the memory resources are exceeded, will spill over to disk.
Disk space in the workers is full due to a big amount of shuffle operations, or any other process that uses disk at the workers, for example logs.
Yarn kill tasks to make room for failed attempts.

So, I summarize the following actions as possible workarounds:

1.- Increase memory of the workers and master, this will discard if you face memory problems.

2.- Change image version of Dataproc.

3.- Change cluster properties to tune your cluster especially for mapreduce and spark.

来源：https://stackoverflow.com/questions/51229580/spark-on-googles-dataproc-failed-due-to-java-io-filenotfoundexception-hadoop

标签

apache-spark

Hadoop

google-cloud-storage

google-cloud-dataproc