What I understand is, that one master process sends a message to all other processes. All the other processes in return send a message to the master process. Would this be enoug
Let's have a look at OpenMPI's implementation of barrier. While other implementations may differ slightly, the general communication pattern should be identical.
First thing to note is that MPI's barrier has no setup costs: A process reaching an MPI_Barrier
call will block until all other members of the group have also called MPI_Barrier
. Note that MPI does not require them to reach the same call, just any call to MPI_Barrier
. Hence, since the total number of nodes in the group is already known to each process, no additional state needs to be distributed for initializing the call.
Now, let's look at some code:
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2012 Oak Ridge National Labs. All rights reserved.
* [...]
*/
[...]
/*
* barrier_intra_lin
*
* Function: - barrier using O(N) algorithm
* Accepts: - same as MPI_Barrier()
* Returns: - MPI_SUCCESS or error code
*/
int
mca_coll_basic_barrier_intra_lin(struct ompi_communicator_t *comm,
mca_coll_base_module_t *module)
{
int i;
int err;
int size = ompi_comm_size(comm);
int rank = ompi_comm_rank(comm);
First all nodes (except the one with rank 0, the root node) send a notification that they have reached the barrier to the root node:
/* All non-root send & receive zero-length message. */
if (rank > 0) {
err =
MCA_PML_CALL(send
(NULL, 0, MPI_BYTE, 0, MCA_COLL_BASE_TAG_BARRIER,
MCA_PML_BASE_SEND_STANDARD, comm));
if (MPI_SUCCESS != err) {
return err;
}
After that they block awaiting notification from the root:
err =
MCA_PML_CALL(recv
(NULL, 0, MPI_BYTE, 0, MCA_COLL_BASE_TAG_BARRIER,
comm, MPI_STATUS_IGNORE));
if (MPI_SUCCESS != err) {
return err;
}
}
The root node implements the other side of the communication. First it blocks until it received n-1
notifications (one from every node in the group, except himself, since he is inside the barrier call already):
else {
for (i = 1; i < size; ++i) {
err = MCA_PML_CALL(recv(NULL, 0, MPI_BYTE, MPI_ANY_SOURCE,
MCA_COLL_BASE_TAG_BARRIER,
comm, MPI_STATUS_IGNORE));
if (MPI_SUCCESS != err) {
return err;
}
}
Once all notifications have arrived, it sends out the messages that every node is waiting for, signalling that everyone has reached the barrier, after which it leaves the barrier call itself:
for (i = 1; i < size; ++i) {
err =
MCA_PML_CALL(send
(NULL, 0, MPI_BYTE, i,
MCA_COLL_BASE_TAG_BARRIER,
MCA_PML_BASE_SEND_STANDARD, comm));
if (MPI_SUCCESS != err) {
return err;
}
}
}
/* All done */
return MPI_SUCCESS;
}
So the communication pattern is first an n:1
from all nodes to the root and then a 1:n
from the root back to all nodes. To avoid overloading the root node with requests, OpenMPI allows use of a tree-based communication pattern, but the basic idea is the same: All nodes notify the root when entering the barrier, while the root aggregates the results and inform everyone once they are ready to continue.
No, that's not enough. Once the master process has sent a message to all other processes informing them that it has reached the barrier, and all other processes have responded to say that they too have reached the barrier, only the master process knows that all processes have reached the barrier. In this scenario another message from the master to the other processes would be necessary.
I make no claim about the actual implementation of MPI barriers in any library, in particular I am not suggesting that the sequence of messages outlined is used in practice, just that it is deficient in theory.