This question is related to pbs job no output when busy. i.e Some of the jobs I submit produce no output when PBS/Torque is \'busy\'. I imagine that it is busier when many jobs
I see two issues in the tracejob
output from the failed job.
First it is Exit_status=135
. This exit status is not a Torque error code, but an exit status returned by the script which is x_analyse.py
. Python does not have a convention on the use of sys.exit()
function and the source of the 135
code might be in one of the modules used in the script.
The second issue is the failure of post job file processing. This might indicate a misconfigured node.
From now on I am guessing. Since a successful job takes about 00:00:16, it is probably true that with a delay of 50 seconds you have all your jobs land onto the first available node. With a smaller delay you get more nodes involved and eventually hit a misconfigured node or get two scripts execute concurrently on a single node. I would modify the submit script adding a line
'echo $PBS_JOBID :: $PBS_O_HOST >> debug.log',
to the python script that generates the .sub
file. This would add the names of the execution hosts to the debug.log which would reside on a common filesystem if I understood your setup correctly.
Then you (or the Torque admin) might want to look for the unprocessed output files in the MOM spool
directory on the failing node to get some info for further diagnosis.