What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

前端 未结 2 1627
独厮守ぢ
独厮守ぢ 2021-02-04 00:08

I\'m deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily

相关标签:
2条回答
  • 2021-02-04 00:50

    Check your log if you get an error similar to this.

    ERROR 2015-05-12 17:29:16,984 Logging.scala:75 - Lost executor 13 on node-xzy: remote Akka client disassociated
    

    Every time you get this error is because you lose an executor. As why you lost an executor, that is another story, again check your log for clues.

    One thing Yarn can kill your job, if it thinks that see you are using "too much memory"

    Check for something like this:

    org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl  - Container [<edited>] is running beyond physical memory limits. Current usage: 18.0 GB of 18 GB physical memory used; 19.4 GB of 37.8 GB virtual memory used. Killing container.
    

    Also see: http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html

    The current state of the art is to increase spark.yarn.executor.memoryOverhead until the job stops failing. We do have plans to try to automatically scale this based on the amount of memory requested, but it will still just be a heuristic.

    0 讨论(0)
  • 2021-02-04 00:53

    I was also getting error

    org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
    

    and looking further in log I found

    Container killed on request. Exit code is 143
    

    After searching for the exit code, I realized that's its mainly related to memory allocation. So I checked the amount of memory I have configured for executors. I found that by mistake I had configured 7g to driver and only 1g for executor. After increasing the memory of executor my spark job ran successfully.

    0 讨论(0)
提交回复
热议问题