How to classify a failure detector?

问题

I understand that failure detectors in asynchronous systems are basically classified as (eventually)perfect/(eventually)strong and how those classes are defined, but I kind of struggle to get the intuition behind it.

Suppose I have a concrete implementation of a failure detector, which periodically listens for heartbeat messages from each process. If a process hasn't sent its heartbeat message for a while, the process will be added to a list of suspects until a message is received from the process.

Now, how do I know which class is this implementation of an FD? Would that require a formal proof of the FD's completeness/accuracy properties? If a perfect FD can be implemented, why bother studying other (weaker) ones? Or are the classes only "assumed" when designing fault-tolerant distributed algorithms?

I am a bit puzzled by this (how to actually classify a given (concrete) FD). I will appreciate any answers.

回答1:

You first need to model the synchrony of the processes and of the links between them; for example: "all processes can eventually communicate in a timely manner, messages are transmitted within a known time bound, and processes execute deadlines within a known time bound". Once you define such a model, you can analyze a specific algorithm and determine its class (and prove it).

The different classes of failure detectors are useful to encapsulate and abstract away from such underlying assumptions when designed higher-level algorithms. They can also be used to determine what problems (consensus, broadcast, weak leader election, etc) are harder/easier to solve depending on the required failure detector class.

In contrast to what is stated in your question, a perfect FD cannot be implemented in any system model. Actually, one active area of research is in finding the minimal synchrony requirements such that, e.g., an omega failure detector can be implemented (see "Omega meets Paxos" paper).

You can imagine diverse scenarios where synchrony is only partial, e.g., some links are too unreliable, some processes are behind firewalls (outgoing messages allowed, but no ingoing messages), etc. When you model the synchrony of concrete deployments and then answer the question of what FD can be built on such a model, you are at the same time answering what problems can be solved in that model (and consequently in that deployment).

来源：https://stackoverflow.com/questions/29065697/how-to-classify-a-failure-detector

标签

distributed-computing

distributed-system

distributed-algorithm