Check if adjacent slave process is ended in MPI

问题

In my MPI program, I want to send and receive information to adjacent processes. But if a process ends and doesn't send anything, its neighbors will wait forever. How can I resolve this issue? Here is what I am trying to do:

if (rank == 0) {
    // don't do anything until all slaves are done
} else {
    while (condition) {
        // send info to rank-1 and rank+1
        // if can receive info from rank-1, receive it, store received info locally
        // if cannot receive info from rank-1, use locally stored info
        // do the same for process rank+1
        // MPI_Barrier(slaves); (wait for other slaves to finish this iteration)
    }
}

I am going to check the boundaries of course. I won't check rank-1 when process number is 1 and I won't check rank+1 when process is the last one. But how can I achieve this? Should I wrap it with another while? I am confused.

回答1:

I'd start by saying that MPI wasn't originally designed with your use case in mind. In general, MPI applications all start together and all end together. Not all applications fit into this model though, so don't lose hope!

There are two relatively easy ways of doing this and probably thousands of hard ones:

Use RMA to set flags on neighbors.

As has been pointed out in the comments, you can set up a tiny RMA window that exposes a single value to each neighbor. When a process is done working, it can do an MPI_Put on each neighbor to indicate that it's done and then MPI_Finalize. Before sending/receiving data to/from the neighbors, check to see if the flag is set.

Use a special tag when detecting shutdowns.

The tag value often gets ignored when sending and receiving messages, but this is a great time to use it. You can have two flags in your application. The first (we'll call it DATA) just indicates that this message contains data and you can process it as normal. The second (DONE) indicates that the process is done and is leaving the application. When receiving messages, you'll have to change the value for tag from whatever you're using to MPI_ANY_TAG. Then, when the message is received, check which tag it is. If it's DONE, then stop communicating with that process.

There's another problem with the pseudo-code that you posted however. If you expect to perform an MPI_Barrier at the end of every iteration, you can't have processes leaving early. When that happens, the MPI_Barrier will hang. There's not much you can do to avoid this unfortunately. However, given the code you posted, I'm not sure that the barrier is really necessary. It seems to me that the only inter-loop dependency is between neighboring processes. If that's the case, then the sends and receives will accomplish all of the necessary synchronization.

If you still need a way to track when all of the ranks are done, you can have each process alert a single rank (say rank 0) when it leaves. When rank 0 detects that everyone is done, it can just exit. Or, if you want to leave after some other number of processes is done, you can have rank 0 send out a message to all other ranks with a special tag like above (but add MPI_ANY_SOURCE so you can receive from rank 0).

来源：https://stackoverflow.com/questions/34706087/check-if-adjacent-slave-process-is-ended-in-mpi

标签

c++

parallel-processing

synchronization

mpi