Intuitive explanation of the FLP impossibility result
FLP impossibility result states that there does not exist a fault-tolerant, safe, and live consensus protocol for an asynchronous distributed system.
- A fault-tolerant protocol is one that can handle failure of some nodes in the system.
- A safe protocol is one in which all the nodes reach the same consensus value.
- A live protocol is one that terminates in a finite number of steps.
- An asynchronous distributed system is one in which messages can take arbitrarily long to be delivered (but are eventually delivered and delivered in-order). Note that this subsumes the model wherein there is no bounds on how much time a node can take to do some processing because a slow node is indistinguishable from a fast node with slow network to other nodes.
The FLP paper has supposedly elegant proofs, but I found it rather tough to follow the proofs and even tougher to build an intuition. It is probably just a me thing, but nevertheless, in this post I will try to explain the FLP theorem in a "simpler" manner (or at least in a manner more intuitive to me).
High level intuition
The high level intuition is as follows. Consider a node that is the decider for consensus. If there are many such nodes, consider any one of them. Say this node is the tie-breaker between agreeing upon 0
or 1
. Since the protocol is fault tolerant, it should be able to handle the failure of this node. If the node fails, how can the protocol proceed? Either it can wait indefinitely for the node to come back up (potentially doing some other work), in which case it is not live. Or it can decide on a value, but at a later point this node can come back up and decide a different value (after all this node was the decider), which makes the protocol not safe. So we can't have a fault-tolerant, safe, and live consensus protocol.
Two step proof from the paper
Now let's look at the two step proof described in the paper. The first step of the proof is showing that there is at least one initial state from which both 0
and 1
are possible agreed upon values.
The paper proves this step by contradiction. Assume that each initial state could lead to only one value, either 0
or 1
. All the initial states can be arranged in a way that neighbors differ only in a single nodes' initial state. So there are two initial states that lead to different final values while only differing in a single nodes' initial state. Because the protocol is fault tolerant, it should be able to reach a decision even without the differing node being up. So there must exist some sequence of steps that we can take from one of the initial state and reach consensus. However, we can apply this exact same sequence to the other initial state as well. And because the initial state and sequence of steps for the participating nodes is the same for both scenarios, both should reach the same consensus. But we started with assuming that the two initial states reach different consensus values, which is a contradiction.
Here is another intuitive way to reason about this step. If each initial state leads to exactly one decision, then a reasonable protocol would be to have all the nodes learn the initial state of other nodes. Next consider a node that is the decider for the initial state to result mapping. If there are multiple such nodes, consider any single one. Because the protocol is fault tolerant, it must be able to reach a conclusion without knowing this node's initial state. This is a contradiction because this node was the decider. The protocol could either wait (violating liveness) or decide (and risk a different decision from this node, violating safety).
The second step in the proof is to show that any protocol that starts from an undecided state can continue to be in an undecided state. The paper again shows this by contradiction, which we won't go into here. The basic idea is that the protocol can continue to spin its wheels indefinitely jumping from one undecided state to another. This is because any step that moves the system from an undecided to a decided state can be indefinitely delayed in our asynchronous distributed system model. Hence the protocol would not terminate (violating liveness). And there is always some other step (even if that is a no-op step) that the system can take that moves it from one undecided state to another undecided state.