How fast can one submit consecutive and independent jobs with qsub?

*爱你&永不变心* 提交于 2019-12-03 08:36:21

I see two issues in the tracejob output from the failed job.

First it is Exit_status=135. This exit status is not a Torque error code, but an exit status returned by the script which is x_analyse.py. Python does not have a convention on the use of sys.exit() function and the source of the 135 code might be in one of the modules used in the script.

The second issue is the failure of post job file processing. This might indicate a misconfigured node.

From now on I am guessing. Since a successful job takes about 00:00:16, it is probably true that with a delay of 50 seconds you have all your jobs land onto the first available node. With a smaller delay you get more nodes involved and eventually hit a misconfigured node or get two scripts execute concurrently on a single node. I would modify the submit script adding a line

  'echo $PBS_JOBID :: $PBS_O_HOST >> debug.log',

to the python script that generates the .sub file. This would add the names of the execution hosts to the debug.log which would reside on a common filesystem if I understood your setup correctly.

Then you (or the Torque admin) might want to look for the unprocessed output files in the MOM spool directory on the failing node to get some info for further diagnosis.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!