Paxos

2025-05-01

In the previous post, we discussed how replicated deterministic state machines can be used for availability and fault tolerance. However, it requires a way to ensure that all the replicas agree on the sequence of inputs. This problem can be broken into sub-problems of agreeing on one input value (say the first input). The problem of getting all replicas to agree on a value is called the consensus problem. Paxos is a solution to the consensus problem.

Basic idea

We know that a value is chosen or agreed upon when a majority of the replicas (a write quorum) accept that value. Let's start with the simple case in which there is only one replica that wants to propose a value to be accepted (this is called a proposer replica). It must get agreement from a majority of replicas for its proposal. So the proposer sends out a message asking all the replicas if they are willing to accept its proposal. Replicas that receive this message are called the acceptor replicas. Note that the proposer replica could itself be an acceptor. If the proposer hears back from a quorum of replicas promising that they are ready to accept the proposal, the proposer sends out a second message asking to replicas to actually accept its proposal. Assuming that a quorum of acceptors honor their promise, they accept the proposal by writing the the proposed value in their durable storage. If a quorum of replicas write down the value, it is chosen.

The above simplistic description hand-waves over a few issues: (i) under what conditions do the replicas promise to accept a proposal and/or honor their promise?; (ii) how does a proposer actually choose a value; (iii) how does Paxos ensure that a chosen value remains chosen; and (iv) how do the replicas get to know that a value has actually been chosen? Let's tackle them one by one.

When do acceptors promise to and/or accept a proposal?

To answer the first question (the conditions under which replicas promise to and/or accept a proposal), we need to consider the scenario wherein there are more than one proposers. If there is a single proposer, each replica can always promise/accept its proposal. But if there are multiple proposers, replicas need a way to determine which proposal to promise/accept. Paxos requires each proposer to include a proposal number with each of its proposal and for each acceptor to promise to and eventually accept the highest numbered proposal it receives. This mechanism provides a way to choose between multiple proposals (either from different proposers or by the same proposer).

Note that it is possible that a proposer receives a promise from a quorum (because it had the highest numbered proposal) but its value is not actually chosen because the replicas receive a higher numbered proposal after the initial promise but before they accept the initial proposal's value.

How do proposers decide on the value to propose and maintain a chosen value?

Now let's consider the second and third questions of determining the value and ensuring that a chosen value remains chosen. First note that the replicas don't particularly care about what value is chosen but only that a value is chosen. To get an intuitive understanding of this, consider a replicated KV store wherein each replica has received a client command and all the replicas need to agree on what client command to process next. The exact command that is chosen does not matter (it could be a command received by either of the replicas) so long as exactly one of the commands is agreed upon by all the replicas.

Paxos restricts the value that a proposer can propose to ensure that a chosen value (if it exists) remains chosen. Paxos assumes no restrictions on when proposers can start a proposal. This is essential when modeling an asynchronous distributed system wherein each replica can operate at a different speed and hence their operations cannot be time-boxed. So to ensure a chosen value (if it exists) remains chosen, Paxos constraints the value that any proposer can propose. Now let's look at how this constraint is determined.

Consider the scenario wherein a proposer can start a new proposal after a value has already been chosen. Further assume that the proposer uses a proposal number higher than the ones seen so far (this is trivial because the proposer can keep on increasing the proposal number because there are no restrictions on the number of proposal a proposer can make either). The acceptors are bound to promise and accept this proposal because it is the highest numbered proposal. To ensure that a previously chosen value remains chosen, Paxos requires acceptors to inform the new proposer about the current highest numbered proposal and the corresponding value that they have accepted. The proposer is in turn required to choose a value corresponding to the highest numbered proposal that it sees in its responses.

Because any quorum of replicas is bound to include one of the replicas from the quorum that accepted the chosen value, this guarantees that the new proposer would see this chosen value in one of its responses. Furthermore, we can show that this value would correspond to the highest proposal number in the responses by the way of contradiction. Assume that there is a replica that accepted a higher numbered proposal with a different value. This would have required a quorum of replicas to promise to accept this value (different from the chosen value). However, any such quorum would have at least one replica from the quorum that accepted the chosen value, and that replica would have informed the proposer of the chosen value and required the proposer to use that value itself. Hence the two values cannot be different and the highest numbered proposal would always correspond to the chosen value if there is such a chosen value.

The above constraints are also followed when there is no chosen value, but some replicas in a quorum have accepted some (potentially different) values. This could happen if a proposer had gotten a promise to accept a value but the value was not accepted by a quorum (maybe because some of the nodes in the quorum failed). Even in such a scenario, requiring a new proposer to choose the highest numbered proposal's value is safe because no value has been chosen thus far, making it safe for the proposer to propose any value.

To sum, the constraint of requiring the proposer to propose the value corresponding to the highest numbered proposal from its responses ensures that a chosen value remains chosen. Proposers are required to follow this constraint even when no value has been chosen yet because they don't know whether this is the case. This brings us to the more general question below.

How do replicas learn that a value has been chosen?

Now let's consider the fourth and final question of how the replicas get to know that a value has been chosen. There are two ways to go about this. First, the replicas could act as clients and try to read the value from a quorum of users. The second option is for the replicas to start a new proposal and learn about the chosen value following the above procedure. Note that the first option (reading from a quorum) is actually the first step in the second option (proposing a new value). The first step of the proposal would require the proposer to hear from a quorum of replicas. If the quorum responds with the same value, then the proposer knows that the value has already been chosen. If not (either because a value hasn't been chosen of because the proposer hears back from a different quorum than the choosing quorum), the proposer would either fail to get a value chosen (in which case it can continue repeat the process to learn the chosen value), or get a value chosen (either the previously chosen value or a new value, either of which is fine).

Some other subtleties

There are a couple of other interesting aspects about how this all comes together.

Live-locks: It is possible that the system becomes and remains live-locked if multiple replicas keep sending out proposals with increasingly higher proposal numbers. This lack of a guarantee of progress is actually a fundamental property of asynchronous distributed systems (i.e., systems wherein processes operate at different speeds, can fail, and messages can be delayed arbitrarily) as described in the FLP paper. The only way to avoid live-locks is to choose a leader replica which acts as the sole proposer while it remains the leader. However, choosing a leader requires all the replicas to agree on the leader which in itself is a consensus problem.

If a majority of the replicas remain available for a certain duration, leader election is guaranteed to converge. This is not a contradiction of the FLP theorem because replicas being available for a certain duration (determined by the message delay) is a stricter requirement than the a general asynchronous distributed system. One way to choose a replica in such a setting is for all replicas to send out proposals with themselves as the leader. All replicas share the understanding that the replica with the smallest (or largest) IP is supposed to be the leader. Any replica that isn't supposed to be the leader stops sending out its proposal after it receives a proposal from the supposed leader. If a majority of replicas remain available for a certain duration, only the supposed leader would send out the proposals and eventually be elected as the leader.

Membership: Paxos depends on knowing the set of replicas that are part of the system. However, nodes can come in and leave a distributed system, which necessitate a way to determine the membership set. However, all nodes agreeing on the memberships set is again a consensus problem. The proposal in Paxos for solving this is that the membership set should be chosen before choosing the value. In practice, Paxos is applied repeatedly (to determine a sequence of inputs to the replicated state machines) and the proposal was to include the membership set for the next Paxos iteration in the current Paxos' iteration. This does leave the system vulnerable to failures if the membership set for the next iteration does not hold true by the time of the next iteration.

Some meta thoughts about understanding Paxos

The original Part Time Parliament and Paxos Made Simple papers provide a mathematical formulation of the problem and solution. One of the challenges I had with the papers is that while I could reasonably follow the theorems, it was hard to relate the theoretical concepts to real systems and build an intuitive understanding of why Paxos works. Mahesh Balakrishnan's blog posts describe Paxos using the abstraction it provides, which helped build some intuition. Another exercise that helped get get an intuitive understanding was to think of various scenarios (number of machines, failures, message delays, etc) and follow the Paxos algorithm to reach consensus. The challenge with this approach is that the state space of possible configurations (different machine and message states) grows exponentially with each step in the algorithm. This is also why a verbose description of Paxos, like I have attempted above, cannot be comprehensive and often leaves room for misinterpretation. However, seeing how the Paxos' invariants addressed certain scenarios helped build an understanding of why those invariants were required and made it easier to follow the original papers.

I am still looking for a single framework (mathematical formulation, abstraction, simulating scenarios, something else ...) that can help learn and explain Paxos.

#Paxos #computer-science #distributed-systems #state-machine-replication