I\'m trying to run spark 1.5 on mesos in cluster mode. I\'m able to launch the dispatcher and to run the spark-submit. But when I do so, the spark driver fails with the followin
I was getting similar issues and used some trial-and-error to find the cause and solution. I might not be able to give the 'real' reason but trying it out the below way can help you resolve it.
Try launching spark-shell with memory and core parameters:
spark-shell
--driver-memory=2g
--executor-memory=7g
--num-executors=8
--executor-cores=4
--conf "spark.storage.memoryFraction=1" // important
--conf "spark.akka.frameSize=200" // keep it sufficiently high, maybe higher than 100 is a good thing
--conf "spark.default.parallelism=100"
--conf "spark.core.connection.ack.wait.timeout=600"
--conf "spark.yarn.executor.memoryOverhead=2048" // (in mb) not really valid for shell, but good thing for spark-submit
--conf "spark.yarn.driver.memoryOverhead=400" // not really valid for shell, but good thing for spark-submit. minimum 384 (in mb)
Now, if total memory (driver memory + num executors * executor memory) goes beyond available memory, it's going to throw error. I believe that's not the case for you.
Executor cores, keep it small, say, 2 or 4.
executor memory = (total memory - driver memory)/ number of executors .. actually a little less.
Next is to run the code in the spark-shell prompt and check how much memory is getting utilized in the Executors tab.
What I understood (empirical though) is that, following type of problems can occur:
I hope this helps you to get the right configuration. Once that's set, you can use the same configurations during submitting a spark-submit job.
Note: I got a cluster with a lot of resource constraints and multiple users using it in ad-hoc ways .. making resources uncertain and therefore calculations have to be in the 'safer' limit. This resulted in a lot of iterative experiments.
Your executors might be getting lost because of many different reasons, but the information you're getting (and showing) is not enough to understand why.
Even if I have no experience with Mesos in cluster mode, it seems to me that what you show as the executor logs are somehow incomplete: if you could get their complete logs you will see they are very helpful to determine the cause of such failures. I took a look at:
http://mesos.apache.org/documentation/latest/configuration/
and you should get the logs you're looking for from either their stderr
(maybe you're just showing their stdout
?). You could also try to use the parameter --log_dir=VALUE
to dump their logs and understand better the situation.
Setting the parallelism number helps. Try increase the parallelism in your cluster using the following parameter:
--conf "spark.default.parallelism=100"
ExecutorLostFailure (executor 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 lost) thorws when task failed because the executor that it was running on was lost. This may happen because the task crashed the JVM.
In your event logs or UI check for large GC times. If you have a persist, removing it can free up more memory for your executors (at the expense of running stages more than once). If you are using a broadcast, see if you can reduce its footprint. Or just add more memory.
Almost always when I had 'executor lost' failures in Spark adding more memory solved these problems. Try increasing values for --executor-memory and/or --driver-memory options that you may pass to spark-submit.