Spark ExecutorLostFailure

后端 未结 6 1837
悲&欢浪女
悲&欢浪女 2021-02-04 18:02

I\'m trying to run spark 1.5 on mesos in cluster mode. I\'m able to launch the dispatcher and to run the spark-submit. But when I do so, the spark driver fails with the followin

相关标签:
6条回答
  • 2021-02-04 18:32

    I was getting similar issues and used some trial-and-error to find the cause and solution. I might not be able to give the 'real' reason but trying it out the below way can help you resolve it.

    Try launching spark-shell with memory and core parameters:

    spark-shell 
    --driver-memory=2g 
    --executor-memory=7g 
    --num-executors=8 
    --executor-cores=4 
    --conf "spark.storage.memoryFraction=1" // important
    --conf "spark.akka.frameSize=200" // keep it sufficiently high, maybe higher than 100 is a good thing
    --conf "spark.default.parallelism=100" 
    --conf "spark.core.connection.ack.wait.timeout=600" 
    --conf "spark.yarn.executor.memoryOverhead=2048" // (in mb) not really valid for shell, but good thing for spark-submit
    --conf "spark.yarn.driver.memoryOverhead=400" // not really valid for shell, but good thing for spark-submit. minimum 384 (in mb)
    

    Now, if total memory (driver memory + num executors * executor memory) goes beyond available memory, it's going to throw error. I believe that's not the case for you.

    Executor cores, keep it small, say, 2 or 4.

    executor memory = (total memory - driver memory)/ number of executors .. actually a little less.

    • Try increasing number of executors while reducing the executor memory to keep the memory under control.
    • Once spark-shell starts, go to the job in job monitor and check the 'executors' tab and you can that even if you put, say, 20 executors, only 10 are getting created. That's an indication of how far you can go.
    • Reduce the number of executors to a suitable number below that max number and change the 'executor memory' parameter accordingly.
    • Once you reach an executor number that you're putting in spark-shell and you're getting the same number of executors, you're 'almost' good.

    Next is to run the code in the spark-shell prompt and check how much memory is getting utilized in the Executors tab.

    • If you find that the last few 'collection' steps are taking a lot of time, the executor memory needs to increase.
    • If increasing the executor memory goes beyond the limit as we calculated earlier, then decrease the number of executors and assign more memory to each.

    What I understood (empirical though) is that, following type of problems can occur:

    • a reduce/ shuffle operation running for long, doing a time-out
    • long running thread creating non-responsive actors
    • not-enough akka frames to watch over too many threads (tasks)

    I hope this helps you to get the right configuration. Once that's set, you can use the same configurations during submitting a spark-submit job.

    Note: I got a cluster with a lot of resource constraints and multiple users using it in ad-hoc ways .. making resources uncertain and therefore calculations have to be in the 'safer' limit. This resulted in a lot of iterative experiments.

    0 讨论(0)
  • 2021-02-04 18:34

    Your executors might be getting lost because of many different reasons, but the information you're getting (and showing) is not enough to understand why.

    Even if I have no experience with Mesos in cluster mode, it seems to me that what you show as the executor logs are somehow incomplete: if you could get their complete logs you will see they are very helpful to determine the cause of such failures. I took a look at:

    http://mesos.apache.org/documentation/latest/configuration/

    and you should get the logs you're looking for from either their stderr (maybe you're just showing their stdout?). You could also try to use the parameter --log_dir=VALUE to dump their logs and understand better the situation.

    0 讨论(0)
  • 2021-02-04 18:40

    Setting the parallelism number helps. Try increase the parallelism in your cluster using the following parameter:

    --conf "spark.default.parallelism=100" 
    
    0 讨论(0)
  • 2021-02-04 18:41

    ExecutorLostFailure (executor 2bbe0c3b-433b-45e0-938b-f4d4532df129-S31 lost) thorws when task failed because the executor that it was running on was lost. This may happen because the task crashed the JVM.

    0 讨论(0)
  • 2021-02-04 18:54

    In your event logs or UI check for large GC times. If you have a persist, removing it can free up more memory for your executors (at the expense of running stages more than once). If you are using a broadcast, see if you can reduce its footprint. Or just add more memory.

    0 讨论(0)
  • 2021-02-04 18:57

    Almost always when I had 'executor lost' failures in Spark adding more memory solved these problems. Try increasing values for --executor-memory and/or --driver-memory options that you may pass to spark-submit.

    0 讨论(0)
提交回复
热议问题