问题
I am using Intel MPI and have encountered some confusing behavior when using mpirun
in conjunction with slurm.
If I run (in a login node)
mpirun -n 2 python -c "from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank())"
then I get as output the expected 0 and 1 printed out.
If however I salloc --time=30 --nodes=1
and run the same mpirun
from the interactive compute node, I get two 0s printed out instead of the expected 0 and 1.
Then, if I change -n 2
to -n 3
(still in compute node), I get a large error from slurm saying srun: error: PMK_KVS_Barrier task count inconsistent (2 != 1)
(plus a load of other stuff), but I am not sure how to explain this either...
Now, based on this OpenMPI page, it seems these kind of operations should be supported at least for OpenMPI:
Specifically, you can launch Open MPI's mpirun in an interactive SLURM allocation (via the salloc command) or you can submit a script to SLURM (via the sbatch command), or you can "directly" launch MPI executables via srun.
Maybe the Intel MPI implementation I was using just doesn't have the same support and is not designed to be used directly in a slurm environment (?), but I am still wondering: what is the underlying nature of mpirun
and slurm (salloc
) that this is the behavior produced? Why would it print two 0s in the first "case," and what are the inconsistent task counts it talks about in the second "case"?
来源:https://stackoverflow.com/questions/51299949/why-does-mpirun-behave-as-it-does-when-used-with-slurm