I am using gnu parallel to launch code on a high performance (HPC) computing cluster that has 2 CPUs per node. The cluster uses TORQUE portable batch system (PBS). My questi
This is not an answer to the 3 primary questions, but I'd like to point out some other problems with the parallel statement in the first code block.
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40
The shell expands the $PBS_O_WORKDIR prior to executing parallel. This means two things happen (1) the --env sees a filename rather than an environment variable name and essentially does nothing and (2) expands as part command string eliminating the need to pass $PBS_O_WORKDIR which is why there wasn't an error.
The latest version of parallel 20151022 has a workdir option (although the tutorial lists it as alpha testing) which is probably the easiest solution. The parallel command line would look something like:
parallel --workdir $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodisplay -r "primes1({})" :::: 10 20 30 40
Final note, PBS_NODEFILE may contain hosts listed multiple times if more than one processor is requested by qsub. This many have implications for number of jobs run, etc.
parallel
missing from the $PATH.GNU Parallel logs into the remote machine; tries to determine the number of cores (using parallel --number-of-cores
) which fails and then defaults to 1 CPU core per host. By giving -j2
GNU Parallel will not try to determine the number of cores.
Did you know that you can also give the number of cores in the --sshlogin as: 4/myserver ? This is useful if you have a mix of machines with different number of cores.