State machine replication for fault tolerance

2025-05-01

State machine replication is the general idea of replicating systems for availability and fault tolerance.

Consider a key-value store (KV store) as an example. A single node running the KV store is neither available (node can go offline/reboot) nor fault tolerant (disk corruption on the node could lead to data loss). If the same KV store instance was replicated on multiple nodes, we can improve its availability and fault tolerance. If we have two replicas of the KV store, both would have to fail for the database to become unavailable or lose data. We will come to the exact number of replicas needed to survive a node failure in a bit but the general idea is to use replication for availability and fault tolerance.

However, replicating the KV store introduces a new problem: how do we keep the replicas in sync? We can assume that the replicas start from the same initial state. But if the two instances perform different operations (or perform the same operation in different order), they would start diverging. As a simple example, say the KV store client issues two requests -- set x = 10 and set x = 20. Depending on the order in which these two are executed, the final state of the KV store would be different.

The way to address this issue is to ensure that the replicas process instructions in the same order. Another requirement is that the processing is deterministic (e.g., it does not depend on the IP address of the machine or the current time). If replicas start from the same state, and perform the same deterministic operations in the same order, they are guaranteed to be in sync at all points. This is an example of state machine replication.

More generally, stateful systems can achieve availability and fault tolerance by implementing replicated deterministic state machines. As long as the order of inputs to the state machines are the same, the state machine replicas stay in sync.

We will discuss how to ensure the same order of inputs to each of the replicas in a future post. For now, consider that we want to design a system that can tolerate F failures. How many replicas do we need?

Let's consider the concrete case of F = 1. Clearly, a single replica cannot tolerate one failure. How about 2 replicas? We can write updates to both the replicas and read from either replica. Let's walk through what happens when one of the replica nodes fails. Say n0 and n1 were the replica nodes and n1 fails. We can either continue updating/reading from n0 or not. If we can't continue, then two replicas are not enough because a single node failure brings the system to a halt. So let's say we continue updating and reading from the remaining replica n0. At a later point n1 comes back up with a lagging state due to the missed updates while it was down. Further consider that immediately after n1 came back up, n0 failed. We are now left with a single node with outdated state which cannot be used for reads. Hence two replicas don't suffice to handle a single failure.

What about three replicas (n0, n1, and n2)? Say we perform updates to all the replicas and read from any of the replicas. If one of the replicas (say n1) fails, we continue writing to the remaining two replicas (n0 and n2) and reading from either of these. Now n1 comes back up and n0 goes down immediately after that. We can continue writing to both n1 and n2. But we can't read from n1 until it catches up with the previous updates that it missed. So our original idea of reading from either of the replicas does not work. Instead, we need to read from a majority of the replicas (2 in this case) and accept the read value only if both the replicas agree. In this case, when we try to read from both n1 and n2, they will differ on the updates that n1 missed and we would know that we cannot trust the value coming from either of n1 or n2 (because we as the client don't know which of n1 or n2 had failed and missed the updates). Given some time, n1 would catch up with the missed updated (reading them from n2) and we can start reading them again. This need to read from two replicas also requires that we ensure that we write to at least two replicas.

We have just shown that we can tolerate 1 failure with 3 replicas and requiring that reads and writes go to at least 2 of the replicas. More generally, to handle F failures, we need 2F + 1 replicas and we need to read and write from a majority of those replicas (F+1) to guarantee fault tolerance. Quorum is the term used to describe a majority of replicas. The intuitive reasoning behind quorums providing fault tolerance is that any two quorums are bound to have at least one replica in common. So if one quorum has updated its state, any other quorum would have at least one node which would enforce that the system does not provide incorrect responses based on an outdated state.

Note that the above discussion assumes that none of the nodes are malicious (i.e., a node does not provide incorrect data that differs from its state). Malicious nodes are referred to have Byzantine faults and require a Byzantine failure model wherein and 2F + 1 nodes are not enough to handle F failures in a Byzantine system. We will discuss Byzantine failures in a different post.

#computer-science #distributed-systems #fault-tolerance #state-machine-replication