Condor Timeout for idle jobs

问题

I'm running jobs on a condor cluster, but some get hung in an idle state and never seem to start, let alone finish. Short of manually doing condor_wait -wait n logfile, then condor_rm, is there a more graceful (and automatic, built in) way of terminating a hung job?

Conversely, since these jobs are in a dagman, is there a way to timeout a job in a dagman so that the later jobs can run?

回答1:

Here are two ways to cause a job to be automatically removed after being idle for too long (24 hours in this example).

Put the following in the submit file for the job:

periodic_remove = JobStatus == 1 && CurrentTime-EnteredCurrentStatus > 3600*24
Or put the following in the condor configuration file on the submit machine:

SYSTEM_PERIODIC_REMOVE = JobStatus == 1 && CurrentTime-EnteredCurrentStatus > 3600*24

Of course, it would be better to understand why the jobs are remaining in the idle state. To do that, you may find condor_q -analyze jobid helpful.

来源：https://stackoverflow.com/questions/10763866/condor-timeout-for-idle-jobs

标签

condor

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!