How to deal with tasks running too long (comparing to others in job) in yarn-client?

后端 未结 2 1478
醉酒成梦
醉酒成梦 2021-02-04 08:12

We use a Spark cluster as yarn-client to calculate several business, but sometimes we have a task run too long time:

We don\'t set timeout but I th

相关标签:
2条回答
  • 2021-02-04 08:58

    There is no way for spark to kill its tasks if its taking too long.

    But I figured out a way to handle this using speculation,

    This means if one or more tasks are running slowly in a stage, they will be re-launched.

    spark.speculation                  true
    spark.speculation.multiplier       2
    spark.speculation.quantile         0
    

    Note: spark.speculation.quantile means the "speculation" will kick in from your first task. So use it with caution. I am using it because some jobs get slowed down due to GC over time. So I think you should know when to use this - its not a silver bullet.

    Some relevant links: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-always-wait-for-stragglers-to-finish-running-td14298.html and http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAPmMX=rOVQf7JtDu0uwnp1xNYNyz4xPgXYayKex42AZ_9Pvjug@mail.gmail.com%3E

    Update

    I found a fix for my issue (might not work for everyone). I had a bunch of simulations running per task, so I added timeout around the run. If a simulation is taking longer (due to a data skew for that specific run), it will timeout.

    ExecutorService executor = Executors.newCachedThreadPool();
    Callable<SimResult> task = () -> simulator.run();
    
    Future<SimResult> future = executor.submit(task);
    try {
        result = future.get(1, TimeUnit.MINUTES);
    } catch (TimeoutException ex) {
        future.cancel(true);
        SPARKLOG.info("Task timed out");
    }
    

    Make sure you handle an interrupt inside the simulator's main loop like:

    if(Thread.currentThread().isInterrupted()){
        throw new InterruptedException();
    } 
    
    0 讨论(0)
  • 2021-02-04 09:15

    The trick here is to login directly to the worker node and kill the process. Usually you can find the offending process with a combination of top, ps, and grep. Then just do a kill pid.

    0 讨论(0)
提交回复
热议问题