问题
So, right now I'm submitting jobs on a cluster with qsub
, but they seem to always run on a single node. I currently run them by doing
#PBS -l walltime=10
#PBS -l nodes=4:gpus=2
#PBS -r n
#PBS -N test
range_0_total = $(seq 0 $(expr $total - 1))
for i in $range_0_total
do
$PATH_TO_JOB_EXEC/job_executable &
done
wait
I would be incredibly grateful if you could tell me if I'm doing something wrong, or if it's just that my test tasks are too small.
回答1:
With your approach, you need to have your for loop go through all of the entries in the file pointed to by $PBS_NODEFILE and then inside of you loop you would need "ssh $i $PATH_TO_JOB_EXEC/job_executable &".
The other, easier way to do this would be to replace the for loop and wait with:
pbsdsh $PATH_TO_JOB_EXEC/job_executable
This would run a copy of your program on each core assigned to your job. If you need to modify this behavior take a look at the options available in the pbsdsh man page.
来源:https://stackoverflow.com/questions/30881147/pbs-torque-how-do-i-submit-a-parallel-job-on-multiple-nodes