fault tolerance in MPICH/OpenMPI

后端 未结 1 394
隐瞒了意图╮
隐瞒了意图╮ 2020-12-17 17:48

I have two questions-

Q1. Is there a more efficient way to handle the error situation in MPI, other than check-point/rollback? I see that if a node

相关标签:
1条回答
  • 2020-12-17 18:29

    MPI - all implementations - have had the ability to continue after an error for a while. The default is to die - that is, the default error handler is MPI_ERRORS_ARE_FATAL - but that can be set (eg, see the discussion here). But the standard doesn't currently much beyond that; that is, it's hard to recover and continue after such an error. If your program is sufficiently simple - some sort of master-worker type of setup - it may be possible to continue this way.

    The MPI forum is currently working on what will become MPI-3, and error handling and fault tolerance will be an important component of the new standard (there's a working group dedicated to the topic). Until that work is complete, however, the only way to get stronger fault tolerance out of MPI is to use earlier, nonstandard, extensions. FT-MPI was a project that developed a very robust MPI, but unfortuantely it's based on MPI1.2; a very early version of the standard. The claim here is that they're now working with OpenMPI, but I don't know what's become of that. There's MPICH-V, based on MPI2, but that's more checkpoint-restart based than what I think you're looking for.

    Updated to add: The fault tolerance didn't make it into MPI-3, but the working group continues its work and the expectation is that something will result from that before too long.

    0 讨论(0)
提交回复
热议问题