How is barrier implemented in message passing systems?

前端 未结 2 1980
难免孤独
难免孤独 2021-02-08 19:41

What I understand is, that one master process sends a message to all other processes. All the other processes in return send a message to the master process. Would this be enoug

2条回答
  •  一整个雨季
    2021-02-08 20:01

    Let's have a look at OpenMPI's implementation of barrier. While other implementations may differ slightly, the general communication pattern should be identical.

    First thing to note is that MPI's barrier has no setup costs: A process reaching an MPI_Barrier call will block until all other members of the group have also called MPI_Barrier. Note that MPI does not require them to reach the same call, just any call to MPI_Barrier. Hence, since the total number of nodes in the group is already known to each process, no additional state needs to be distributed for initializing the call.

    Now, let's look at some code:

    /*
     * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
     *                         University Research and Technology
     *                         Corporation.  All rights reserved.
     * Copyright (c) 2004-2005 The University of Tennessee and The University
     *                         of Tennessee Research Foundation.  All rights
     *                         reserved.
     * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
     *                         University of Stuttgart.  All rights reserved.
     * Copyright (c) 2004-2005 The Regents of the University of California.
     *                         All rights reserved.
     * Copyright (c) 2012      Oak Ridge National Labs.  All rights reserved.
     * [...]
     */
    
    [...]
    
    /*
     *  barrier_intra_lin
     *
     *  Function:   - barrier using O(N) algorithm
     *  Accepts:    - same as MPI_Barrier()
     *  Returns:    - MPI_SUCCESS or error code
     */
    int
    mca_coll_basic_barrier_intra_lin(struct ompi_communicator_t *comm,
                                     mca_coll_base_module_t *module)
    {
        int i;
        int err;
        int size = ompi_comm_size(comm);
        int rank = ompi_comm_rank(comm);
    

    First all nodes (except the one with rank 0, the root node) send a notification that they have reached the barrier to the root node:

        /* All non-root send & receive zero-length message. */
    
        if (rank > 0) {
            err =
                MCA_PML_CALL(send
                             (NULL, 0, MPI_BYTE, 0, MCA_COLL_BASE_TAG_BARRIER,
                              MCA_PML_BASE_SEND_STANDARD, comm));
            if (MPI_SUCCESS != err) {
                return err;
            }
    

    After that they block awaiting notification from the root:

            err =
                MCA_PML_CALL(recv
                             (NULL, 0, MPI_BYTE, 0, MCA_COLL_BASE_TAG_BARRIER,
                              comm, MPI_STATUS_IGNORE));
            if (MPI_SUCCESS != err) {
                return err;
            }
        }
    

    The root node implements the other side of the communication. First it blocks until it received n-1 notifications (one from every node in the group, except himself, since he is inside the barrier call already):

    else {
            for (i = 1; i < size; ++i) {
                err = MCA_PML_CALL(recv(NULL, 0, MPI_BYTE, MPI_ANY_SOURCE,
                                        MCA_COLL_BASE_TAG_BARRIER,
                                        comm, MPI_STATUS_IGNORE));
                if (MPI_SUCCESS != err) {
                    return err;
                }
            }
    

    Once all notifications have arrived, it sends out the messages that every node is waiting for, signalling that everyone has reached the barrier, after which it leaves the barrier call itself:

            for (i = 1; i < size; ++i) {
                err =
                    MCA_PML_CALL(send
                                 (NULL, 0, MPI_BYTE, i,
                                  MCA_COLL_BASE_TAG_BARRIER,
                                  MCA_PML_BASE_SEND_STANDARD, comm));
                if (MPI_SUCCESS != err) {
                    return err;
                }
            }
        }
    
        /* All done */
    
        return MPI_SUCCESS;
    }
    

    So the communication pattern is first an n:1 from all nodes to the root and then a 1:n from the root back to all nodes. To avoid overloading the root node with requests, OpenMPI allows use of a tree-based communication pattern, but the basic idea is the same: All nodes notify the root when entering the barrier, while the root aggregates the results and inform everyone once they are ready to continue.

提交回复
热议问题