GNU parallel --jobs option using multiple nodes on cluster with multiple cpus per node

吃可爱长大的小学妹 提交于 2019-11-30 09:25:21
  1. Yes: -j is the number of jobs per node.
  2. Yes: Install 'parallel' in your $PATH on the remote hosts.
  3. Yes: It is a consequence from parallel missing from the $PATH.

GNU Parallel logs into the remote machine; tries to determine the number of cores (using parallel --number-of-cores) which fails and then defaults to 1 CPU core per host. By giving -j2 GNU Parallel will not try to determine the number of cores.

Did you know that you can also give the number of cores in the --sshlogin as: 4/myserver ? This is useful if you have a mix of machines with different number of cores.

This is not an answer to the 3 primary questions, but I'd like to point out some other problems with the parallel statement in the first code block.

parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
  matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40

The shell expands the $PBS_O_WORKDIR prior to executing parallel. This means two things happen (1) the --env sees a filename rather than an environment variable name and essentially does nothing and (2) expands as part command string eliminating the need to pass $PBS_O_WORKDIR which is why there wasn't an error.

The latest version of parallel 20151022 has a workdir option (although the tutorial lists it as alpha testing) which is probably the easiest solution. The parallel command line would look something like:

parallel --workdir $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
  matlab -nodisplay -r "primes1({})" :::: 10 20 30 40

Final note, PBS_NODEFILE may contain hosts listed multiple times if more than one processor is requested by qsub. This many have implications for number of jobs run, etc.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!