Spark tasks blockes randomly on standalone cluster

问题

We are having a quite complex application that runs on Spark Standalone. In some cases the tasks from one of the workers blocks randomly for an infinite amount of time in the RUNNING state.

Extra info:

there aren't any errors in the logs
ran with logger in debug and i didn't saw any relevant messages (i see when the tasks starts but then there is not activity for it)
the jobs are working ok if i have just only 1 worker
the same job may execute the second time without any issues, in a proper amount of time
i don't have any really big partitions that could cause delays for some of the tasks.
in spark 2.0 i've moved from RDD to Datasets and i have the same issue
in spark 1.4 i was able to overcome the issue by turning on speculation, but in spark 2.0 the blocking tasks are from different workers (while in 1.4 i have blocking tasks on only 1 worker) so speculation isn't fixing my issue.
i have the issue on more environments so i don't think it's hardware related.

Did anyone experienced something similar? Any suggestions on how could i identify the issue?

Thanks a lot!

Later Edit: I think i'm facing the same issue described here: Spark Indefinite Waiting with "Asked to send map output locations for shuffle" and here: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-td6067.html but both are without a working solution.

The last thing in the log repeated infinitely is: [dispatcher-event-loop-18] DEBUG org.apache.spark.scheduler.TaskSchedulerImpl - parentName: , name: TaskSet_2, runningTasks: 6

回答1:

The issue was fixed for me by allocating just one core per executor. If I have executors with more then 1 core the issue appears again. I didn't yet understood why is this happening but for the ones having similar issue they can try this.

来源：https://stackoverflow.com/questions/39465093/spark-tasks-blockes-randomly-on-standalone-cluster

标签

apache-spark