How fast can one submit consecutive and independent jobs with qsub?

后端 未结 1 1171
栀梦
栀梦 2021-02-10 22:08

This question is related to pbs job no output when busy. i.e Some of the jobs I submit produce no output when PBS/Torque is \'busy\'. I imagine that it is busier when many jobs

1条回答
  •  别跟我提以往
    2021-02-10 22:28

    I see two issues in the tracejob output from the failed job.

    First it is Exit_status=135. This exit status is not a Torque error code, but an exit status returned by the script which is x_analyse.py. Python does not have a convention on the use of sys.exit() function and the source of the 135 code might be in one of the modules used in the script.

    The second issue is the failure of post job file processing. This might indicate a misconfigured node.

    From now on I am guessing. Since a successful job takes about 00:00:16, it is probably true that with a delay of 50 seconds you have all your jobs land onto the first available node. With a smaller delay you get more nodes involved and eventually hit a misconfigured node or get two scripts execute concurrently on a single node. I would modify the submit script adding a line

      'echo $PBS_JOBID :: $PBS_O_HOST >> debug.log',
    

    to the python script that generates the .sub file. This would add the names of the execution hosts to the debug.log which would reside on a common filesystem if I understood your setup correctly.

    Then you (or the Torque admin) might want to look for the unprocessed output files in the MOM spool directory on the failing node to get some info for further diagnosis.

    0 讨论(0)
提交回复
热议问题