Question 11: Fast failure recovery

2024-07-09

Jay is building a distributed storage system. He has 10 disks: D0, D1, ..., D9. For durability, he decides to replicate each piece of data on two disks. He creates disk groups in the following manner:

G0: D0, D1 G1: D2, D3 G2: D4, D5 G3: D6, D7 G4: D8, D9

For any given piece of data, Jay first chooses a group at random and then writes it to both the disks in the group.

Jay's boss, Gloria, is concerned that this design is not good for recovering data in case of a disk failure. She instead suggests choosing two disks at random for any given piece of data and writing data to both those disks. That is, there are no fixed groups.

Which approach do you think is better for recovering data for a failed disk?

Solution coming up in the next post.

Solution for uncontended contention:

Mitch should use a form of optimistic concurrency control. Each process can read its desired data, perform the computations without any locking. When it tries to write the data, it should confirm that the current state of the data matches what it read. If there is a mismatch, the process should discard its computation and start again.

In the common case that there is no actual contention, most processes will be able to proceed without any locking. In the rare cases that there is a contention (two processes worked on the same piece of data), the first writer would win.

This is a canonical reference for optimistic concurrency control methods. This is an example of implementing optimistic concurrency in AWS DynamoDB using conditional writes (i.e., writes which check for certain condition before succeeding).

#qna #storage-systems