Condor Timeout for idle jobs

随声附和 提交于 2019-12-22 09:23:40

问题


I'm running jobs on a condor cluster, but some get hung in an idle state and never seem to start, let alone finish. Short of manually doing condor_wait -wait n logfile, then condor_rm, is there a more graceful (and automatic, built in) way of terminating a hung job?

Conversely, since these jobs are in a dagman, is there a way to timeout a job in a dagman so that the later jobs can run?


回答1:


Here are two ways to cause a job to be automatically removed after being idle for too long (24 hours in this example).

  1. Put the following in the submit file for the job:

    periodic_remove = JobStatus == 1 && CurrentTime-EnteredCurrentStatus > 3600*24

  2. Or put the following in the condor configuration file on the submit machine:

    SYSTEM_PERIODIC_REMOVE = JobStatus == 1 && CurrentTime-EnteredCurrentStatus > 3600*24

Of course, it would be better to understand why the jobs are remaining in the idle state. To do that, you may find condor_q -analyze jobid helpful.



来源:https://stackoverflow.com/questions/10763866/condor-timeout-for-idle-jobs

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!