问题
I'm running jobs on a condor cluster, but some get hung in an idle state and never seem to start, let alone finish. Short of manually doing condor_wait -wait n logfile
, then condor_rm
, is there a more graceful (and automatic, built in) way of terminating a hung job?
Conversely, since these jobs are in a dagman, is there a way to timeout a job in a dagman so that the later jobs can run?
回答1:
Here are two ways to cause a job to be automatically removed after being idle for too long (24 hours in this example).
Put the following in the submit file for the job:
periodic_remove = JobStatus == 1 && CurrentTime-EnteredCurrentStatus > 3600*24
Or put the following in the condor configuration file on the submit machine:
SYSTEM_PERIODIC_REMOVE = JobStatus == 1 && CurrentTime-EnteredCurrentStatus > 3600*24
Of course, it would be better to understand why the jobs are remaining in the idle state. To do that, you may find condor_q -analyze jobid
helpful.
来源:https://stackoverflow.com/questions/10763866/condor-timeout-for-idle-jobs