Spark application kills executor

前端 未结 2 1259
孤街浪徒
孤街浪徒 2021-02-14 20:44

I\'m running spark cluster in standalone mode and application using spark-submit. In spark UI stage section I found executing stage with large execution time ( > 10h, when usual

相关标签:
2条回答
  • 2021-02-14 20:50

    Likely the interesting part of the log is this:

    16/11/25 10:06:13 INFO Worker: Executor app-20161109161724-0045/1 finished with state KILLED exitStatus 137
    

    Exit 137 strongly suggest a resource issue, either memory or cpu cores. Given that you can fix your issues by rerunning the stage it could be that somehow all cores are already allocated (maybe you also have some Spark shell running?). This is a common issue with standalone Spark setups (everything on one host).

    Either way I would try the following things in order:

    1. Raise the storage memory faction spark.storage.memoryFraction to pre-allocate more memory for storage and prevent the system OOM killer to randomly give you that 137 on a big stage.
    2. Set a lower number of cores for your application to rule out something pre-allocating those cores before your stage is ran. You can do this via spark.deploy.defaultCores, set it to 3 or even 2 (on an intel quad-core assuming 8 vcores)
    3. Outright allocate more RAM to Spark -> spark.executor.memory needs to go up.
    4. Maybe you run into an issue with meta data cleanup here, also not unheard of in local deployments, in this case adding
      export SPARK_JAVA_OPTS +="-Dspark.kryoserializer.buffer.mb=10 -Dspark.cleaner.ttl=43200" to the end your spark-env.sh might do the trick by forcing the meta data cleanup to run more frequently

    One of these should do the trick in my opinion.

    0 讨论(0)
  • 2021-02-14 21:07

    Armin's answer is very good. I just wanted to point to what worked for me.

    The same problem went away when I increased the parameter:

    spark.default.parallelism from 28 (which was the number of executors that I had) to 84 (which is the number of available cores).

    NOTE: this is not a rule for setting this parameter, this is only what worked for me.

    UPDATE: This approach is also backed by Spark's documentation:

    Sometimes, you will get an OutOfMemoryError not because your RDDs don’t fit in memory, but because the working set of one of your tasks, such as one of the reduce tasks in groupByKey, was too large. Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large. The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller. Spark can efficiently support tasks as short as 200 ms, because it reuses one executor JVM across many tasks and it has a low task launching cost, so you can safely increase the level of parallelism to more than the number of cores in your clusters.

    0 讨论(0)
提交回复
热议问题