I\'m running spark cluster in standalone mode and application using spark-submit. In spark UI stage section I found executing stage with large execution time ( > 10h, when usual
Likely the interesting part of the log is this:
16/11/25 10:06:13 INFO Worker: Executor app-20161109161724-0045/1 finished with state KILLED exitStatus 137
Exit 137
strongly suggest a resource issue, either memory or cpu cores.
Given that you can fix your issues by rerunning the stage it could be that somehow all cores are already allocated (maybe you also have some Spark shell running?).
This is a common issue with standalone Spark setups (everything on one host).
Either way I would try the following things in order:
spark.storage.memoryFraction
to pre-allocate more memory for storage and prevent the system OOM killer to randomly give you that 137
on a big stage.spark.deploy.defaultCores
, set it to 3 or even 2 (on an intel quad-core assuming 8 vcores)spark.executor.memory
needs to go up.export SPARK_JAVA_OPTS +="-Dspark.kryoserializer.buffer.mb=10 -Dspark.cleaner.ttl=43200"
to the end your spark-env.sh
might do the trick by forcing the meta data cleanup to run more frequentlyOne of these should do the trick in my opinion.