why does mpirun behave as it does when used with slurm?

别说谁变了你拦得住时间么 提交于 2019-12-25 01:45:42

问题


I am using Intel MPI and have encountered some confusing behavior when using mpirun in conjunction with slurm.

If I run (in a login node)

mpirun -n 2 python -c "from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank())"

then I get as output the expected 0 and 1 printed out.

If however I salloc --time=30 --nodes=1 and run the same mpirun from the interactive compute node, I get two 0s printed out instead of the expected 0 and 1.

Then, if I change -n 2 to -n 3 (still in compute node), I get a large error from slurm saying srun: error: PMK_KVS_Barrier task count inconsistent (2 != 1) (plus a load of other stuff), but I am not sure how to explain this either...

Now, based on this OpenMPI page, it seems these kind of operations should be supported at least for OpenMPI:

Specifically, you can launch Open MPI's mpirun in an interactive SLURM allocation (via the salloc command) or you can submit a script to SLURM (via the sbatch command), or you can "directly" launch MPI executables via srun.

Maybe the Intel MPI implementation I was using just doesn't have the same support and is not designed to be used directly in a slurm environment (?), but I am still wondering: what is the underlying nature of mpirun and slurm (salloc) that this is the behavior produced? Why would it print two 0s in the first "case," and what are the inconsistent task counts it talks about in the second "case"?

来源:https://stackoverflow.com/questions/51299949/why-does-mpirun-behave-as-it-does-when-used-with-slurm

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!