Spark ExecutorLostFailure

后端未结

关注

 6  1865

悲&欢浪女 2021-02-04 18:02

I\'m trying to run spark 1.5 on mesos in cluster mode. I\'m able to launch the dispatcher and to run the spark-submit. But when I do so, the spark driver fails with the followin

6条回答

猫巷女王i (楼主)

2021-02-04 18:32
I was getting similar issues and used some trial-and-error to find the cause and solution. I might not be able to give the 'real' reason but trying it out the below way can help you resolve it.

Try launching spark-shell with memory and core parameters:
```
spark-shell 
--driver-memory=2g 
--executor-memory=7g 
--num-executors=8 
--executor-cores=4 
--conf "spark.storage.memoryFraction=1" // important
--conf "spark.akka.frameSize=200" // keep it sufficiently high, maybe higher than 100 is a good thing
--conf "spark.default.parallelism=100" 
--conf "spark.core.connection.ack.wait.timeout=600" 
--conf "spark.yarn.executor.memoryOverhead=2048" // (in mb) not really valid for shell, but good thing for spark-submit
--conf "spark.yarn.driver.memoryOverhead=400" // not really valid for shell, but good thing for spark-submit. minimum 384 (in mb)
```
Now, if total memory (driver memory + num executors * executor memory) goes beyond available memory, it's going to throw error. I believe that's not the case for you.

Executor cores, keep it small, say, 2 or 4.

executor memory = (total memory - driver memory)/ number of executors .. actually a little less.
- Try increasing number of executors while reducing the executor memory to keep the memory under control.
- Once spark-shell starts, go to the job in job monitor and check the 'executors' tab and you can that even if you put, say, 20 executors, only 10 are getting created. That's an indication of how far you can go.
- Reduce the number of executors to a suitable number below that max number and change the 'executor memory' parameter accordingly.
- Once you reach an executor number that you're putting in spark-shell and you're getting the same number of executors, you're 'almost' good.
Next is to run the code in the spark-shell prompt and check how much memory is getting utilized in the Executors tab.
- If you find that the last few 'collection' steps are taking a lot of time, the executor memory needs to increase.
- If increasing the executor memory goes beyond the limit as we calculated earlier, then decrease the number of executors and assign more memory to each.
What I understood (empirical though) is that, following type of problems can occur:
- a reduce/ shuffle operation running for long, doing a time-out
- long running thread creating non-responsive actors
- not-enough akka frames to watch over too many threads (tasks)
I hope this helps you to get the right configuration. Once that's set, you can use the same configurations during submitting a spark-submit job.

Note: I got a cluster with a lot of resource constraints and multiple users using it in ad-hoc ways .. making resources uncertain and therefore calculations have to be in the 'safer' limit. This resulted in a lot of iterative experiments.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...