This whole issue can be summarized that I\'m trying replicate the behaviour of a call to system
(or fork
), but in an mpi environ
Each MPI job in Open MPI starts with some set of slots distributed over one or more hosts. Those slots are consumed by both the initial MPI processes and by any process spawned as part of a child MPI job. In your case, the hosts could be provided in a hostfile similar to this:
host1 slots=2 max_slots=2
host2 slots=2 max_slots=2
host3 slots=2 max_slots=2
...
slots=2 max_slots=2
restricts Open MPI to running only two processes per host.
The initial job launch should specify one process per host, otherwise MPI will fill up all slots with processes from the parent job. --map-by ppr:1:node
does the trick:
mpiexec --hostfile hosts --map-by ppr:1:node ./parent
Now, the problem is that Open MPI will continue filling the slots on a first come first served basis as new child jobs are spawned, therefore there is no guarantee that the child process will be started on the same host as its parent. To enforce this, set as advised by Gilles Gouaillardet the host
key of the info
argument to the hostname as returned by MPI_Get_processor_name
:
character(len=MPI_MAX_PROCESSOR_NAME) :: procn
integer :: procl
integer :: info
call MPI_Get_processor_name(procn, procl, ierr)
call MPI_Info_create(info, ierr)
call MPI_Info_set(info, 'host', trim(procn), ierr)
call MPI_Comm_spawn('./child', MPI_ARGV_NULL, 1, info, 0, &
...
It is possible that your MPI jobs abort with the following message:
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
It basically means that the requested host is either full (all slots already filled) or the host is not in the original host list and therefore no slots were allocated on it. The former is obviously not the case since the hostfile lists two slots per host and the parent job only uses one. The hostname as provided in the host
key-value pair must match exactly the entry in the initial list of hosts. It is often the case that the hostfile contains only unqualified host names, like in the sample hostfile in the first paragraph, while MPI_Get_processor_name
returns the FQDN if the domain part is set, e.g., node1.some.domain.local
, node2.some.domain.local
, etc. The solution is to use FQDNs in the hostfile:
host1.example.local slots=2 max_slots=2
host2.example.local slots=2 max_slots=2
host3.example.local slots=2 max_slots=2
...
If the allocation is instead provided by a resource manager such as SLURM, the solution is to transform the result from MPI_Get_processor_name
to match what the RM provides.
Note that the man page for MPI_Comm_spawn lists the add-host
key, which is supposed to add the hostname in the value to the list of hosts for the job:
add-host char * Add the specified host to the list of
hosts known to this job and use it for
the associated process. This will be
used similarly to the -host option.
In my experience, this has never worked (tested with Open MPI up to 1.10.4).