I am using gnu parallel to launch code on a high performance (HPC) computing cluster that has 2 CPUs per node. The cluster uses TORQUE portable batch system (PBS). My question is to clarify how the --jobs option for GNU parallel works in this scenario.
When I run a PBS script calling GNU parallel without the --jobs option, like this:
#PBS -lnodes=2:ppn=2
...
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40
it looks like it only uses one CPU per core, and also provides the following error stream:
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles087 (). Using 1.
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles108 (). Using 1.
This looks like one error for each node. I don't understand the first part (bash: parallel: command not found
), but the second part tells me it's using one node.
When I add the option -j2
to the parallel call, the errors go away, and I think that it's using two CPUs per node. I am still a newbie to HPC, so my way of checking this is to output date-time stamps from my code (the dummy matlab code takes 10's of seconds to complete). My questions are:
- Am I using the
--jobs
option correctly? Is it correct to specify-j2
because I have 2 CPUs per node? Or should I be using-jN
where N is the total number of CPUs (number of nodes multiplied by number of CPUs per node)? - It appears that GNU parallel attempts to determine the number of CPUs per node on it's own. Is there a way that I can make this work properly?
- Is there any meaning to the
bash: parallel: command not found
message?
- Yes: -j is the number of jobs per node.
- Yes: Install 'parallel' in your $PATH on the remote hosts.
- Yes: It is a consequence from
parallel
missing from the $PATH.
GNU Parallel logs into the remote machine; tries to determine the number of cores (using parallel --number-of-cores
) which fails and then defaults to 1 CPU core per host. By giving -j2
GNU Parallel will not try to determine the number of cores.
Did you know that you can also give the number of cores in the --sshlogin as: 4/myserver ? This is useful if you have a mix of machines with different number of cores.
This is not an answer to the 3 primary questions, but I'd like to point out some other problems with the parallel statement in the first code block.
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40
The shell expands the $PBS_O_WORKDIR prior to executing parallel. This means two things happen (1) the --env sees a filename rather than an environment variable name and essentially does nothing and (2) expands as part command string eliminating the need to pass $PBS_O_WORKDIR which is why there wasn't an error.
The latest version of parallel 20151022 has a workdir option (although the tutorial lists it as alpha testing) which is probably the easiest solution. The parallel command line would look something like:
parallel --workdir $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodisplay -r "primes1({})" :::: 10 20 30 40
Final note, PBS_NODEFILE may contain hosts listed multiple times if more than one processor is requested by qsub. This many have implications for number of jobs run, etc.
来源:https://stackoverflow.com/questions/22236337/gnu-parallel-jobs-option-using-multiple-nodes-on-cluster-with-multiple-cpus-pe