Controlling node mapping of MPI_COMM_SPAWN

后端 未结 2 953
被撕碎了的回忆
被撕碎了的回忆 2021-01-26 00:57

The context:

This whole issue can be summarized that I\'m trying replicate the behaviour of a call to system (or fork), but in an mpi environ

相关标签:
2条回答
  • 2021-01-26 01:11

    Each MPI job in Open MPI starts with some set of slots distributed over one or more hosts. Those slots are consumed by both the initial MPI processes and by any process spawned as part of a child MPI job. In your case, the hosts could be provided in a hostfile similar to this:

    host1 slots=2 max_slots=2
    host2 slots=2 max_slots=2
    host3 slots=2 max_slots=2
    ...
    

    slots=2 max_slots=2 restricts Open MPI to running only two processes per host.

    The initial job launch should specify one process per host, otherwise MPI will fill up all slots with processes from the parent job. --map-by ppr:1:node does the trick:

    mpiexec --hostfile hosts --map-by ppr:1:node ./parent
    

    Now, the problem is that Open MPI will continue filling the slots on a first come first served basis as new child jobs are spawned, therefore there is no guarantee that the child process will be started on the same host as its parent. To enforce this, set as advised by Gilles Gouaillardet the host key of the info argument to the hostname as returned by MPI_Get_processor_name:

    character(len=MPI_MAX_PROCESSOR_NAME) :: procn
    integer :: procl
    integer :: info
    
    call MPI_Get_processor_name(procn, procl, ierr)
    
    call MPI_Info_create(info, ierr)
    call MPI_Info_set(info, 'host', trim(procn), ierr)
    
    call MPI_Comm_spawn('./child', MPI_ARGV_NULL, 1, info, 0, &
    ...
    

    It is possible that your MPI jobs abort with the following message:

    --------------------------------------------------------------------------
    All nodes which are allocated for this job are already filled.
    --------------------------------------------------------------------------
    

    It basically means that the requested host is either full (all slots already filled) or the host is not in the original host list and therefore no slots were allocated on it. The former is obviously not the case since the hostfile lists two slots per host and the parent job only uses one. The hostname as provided in the host key-value pair must match exactly the entry in the initial list of hosts. It is often the case that the hostfile contains only unqualified host names, like in the sample hostfile in the first paragraph, while MPI_Get_processor_name returns the FQDN if the domain part is set, e.g., node1.some.domain.local, node2.some.domain.local, etc. The solution is to use FQDNs in the hostfile:

    host1.example.local slots=2 max_slots=2
    host2.example.local slots=2 max_slots=2
    host3.example.local slots=2 max_slots=2
    ...
    

    If the allocation is instead provided by a resource manager such as SLURM, the solution is to transform the result from MPI_Get_processor_name to match what the RM provides.

    Note that the man page for MPI_Comm_spawn lists the add-host key, which is supposed to add the hostname in the value to the list of hosts for the job:

    add-host               char *   Add the specified host to the list of
                                    hosts known to this job and use it for
                                    the associated process. This will be
                                    used similarly to the -host option.
    

    In my experience, this has never worked (tested with Open MPI up to 1.10.4).

    0 讨论(0)
  • 2021-01-26 01:21

    From the Open MPI MPI_Comm_spawn man page

       The following keys for info are recognized in Open MPI. (The reserved values mentioned in Section 5.3.4 of the MPI-2 standard are not implemented.)
    
       Key                    Type     Description
       ---                    ----     -----------
    
       host                   char *   Host on which the process should be
                                       spawned.  See the orte_host man
                                       page for an explanation of how this
                                       will be used.
    

    and you can use MPI_Get_processor_name() in order to get the hostname a MPI task is running on.

    0 讨论(0)
提交回复
热议问题