How to deal with tasks running too long (comparing to others in job) in yarn-client?

后端未结

关注

 2  1478

醉酒成梦

We use a Spark cluster as yarn-client to calculate several business, but sometimes we have a task run too long time:

We don\'t set timeout but I th

相关标签:

2条回答

生来不讨喜

2021-02-04 08:58
There is no way for spark to kill its tasks if its taking too long.

But I figured out a way to handle this using speculation,

This means if one or more tasks are running slowly in a stage, they will be re-launched.
```
spark.speculation                  true
spark.speculation.multiplier       2
spark.speculation.quantile         0
```
Note: spark.speculation.quantile means the "speculation" will kick in from your first task. So use it with caution. I am using it because some jobs get slowed down due to GC over time. So I think you should know when to use this - its not a silver bullet.

Some relevant links: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-always-wait-for-stragglers-to-finish-running-td14298.html and http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAPmMX=rOVQf7JtDu0uwnp1xNYNyz4xPgXYayKex42AZ_9Pvjug@mail.gmail.com%3E

Update

I found a fix for my issue (might not work for everyone). I had a bunch of simulations running per task, so I added timeout around the run. If a simulation is taking longer (due to a data skew for that specific run), it will timeout.
```
ExecutorService executor = Executors.newCachedThreadPool();
Callable<SimResult> task = () -> simulator.run();

Future<SimResult> future = executor.submit(task);
try {
    result = future.get(1, TimeUnit.MINUTES);
} catch (TimeoutException ex) {
    future.cancel(true);
    SPARKLOG.info("Task timed out");
}
```
Make sure you handle an interrupt inside the simulator's main loop like:
```
if(Thread.currentThread().isInterrupted()){
    throw new InterruptedException();
} 
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2021-02-04 09:15

The trick here is to login directly to the worker node and kill the process. Usually you can find the offending process with a combination of top, ps, and grep. Then just do a kill pid.

0 讨论(0)
发布评论:

提交评论
- 加载中...