I am trying to understand how three-phase commit avoids blocking
Consider the following two failure scenarios:
Scenario 1: In phase 2 the coordinator sends preCo
In scenario 1 :
During recovery: All cohorts ,except A, will be in PRECOMMIT state. This tells the recovery node that all cohorts had voted for the commit and moved forward . So A and Coordinator should be in PRECOMMIT state. Since this is a non-final state the transaction is aborted .
In scenario 2 :
During recovery: All cohorts ,except A, will be in PRECOMMIT state. This tells the recovery node that all cohorts had voted for the commit and moved ahead .But since A received the doCommit message it is in COMMITTED state.Had a crash not happened the recovery would ask all cohorts to commit since at least one cohort has committed. Since a crash (A crashed) happened the recovery node sees no live cohort with the committed state and hence it deduces that no cohort got the doCommit message. Hence the transaction will be aborted and all cohorts will be asked to release resources.
when A returns from the crash and begins recovery it will find that all the other cohorts aborted the transaction and it too will abort the transaction .
(source: swturner at regal.csep.umflint.edu)
In the two-phase commit the coordinator sends a prepare message to all participants (nodes) and waits for their answers. The coordinator then sends their answers to all other sites. Every participant waits for these answers from the coordinator before committing to or aborting the transaction.
The two-phase commit protocol also has limitations in that it is a blocking protocol
. For example, participants will block resource processes while waiting for a message from the coordinator. If for any reason this fails, the participant will continue to wait and may never resolve its transaction. Therefore the resource could be blocked indefinitely. On the other hand, a coordinator will also block resources while waiting for replies from participants. In this case, a coordinator can also block in
definitely if no acknowledgement is received from the participant.
However, the three-phase protocol introduces a third phase called the pre-commit
. The aim of this is to 'remove the uncertainty period for participants that have committed and are waiting for the global abort or commit message from the coordinator.
When receiving a pre-commit message, participants know that all others have voted to commit.
If a pre-commit message has not been received the participant will abort and release any blocked resources.
The thing that helped me understand the non-blocking property was to realize that after the first round of messages both protocols are in essentially the same state; all participants have agreed that they can commit and are waiting for confirmation to do so.
Now, consider what a participant knows after it replied "ok to commit" in the first round.
Moving on, after the second round of messages and confirmations in 3PC we are guaranteed that all participants know that the group decision is to commit.
This means that there is never a time in 3PC when a participant does a commit action that another participant is not anticipating.
Three-phase commit isn't magic; it's just more resilient than two-phase commit. In particular, 3PC is resilient against single-point failure, but not all kinds of multiple-point failure. Both scenarios in the question posit two-point failures. In other words, the premise of the question is misguided; it's asking more of 3PC than it's capable of.
For further reading, here's a sentence from the abstract of a paper on the subject, Analysis and Verification of Two-Phase Commit & Three-Phase Commit Protocols, by Muhammad Atif, to whet your appetite:
We also apply our method to its “amended” variant, the Three-Phase Commit Protocol (3PC) and prove it to be erroneous for simultaneous site failures
I found this paper to provide a foothold into the literature. There's no small amount of it on this subject, if you want to delve in.