Question 11: Fast failure recovery
Jay is building a distributed storage system. He has 10 disks: D0, D1, ..., D9. For durability, he decides to replicate each piece of data on two disks. He creates disk groups in the following manner:
G0: D0, D1 G1: D2, D3 G2: D4, D5 G3: D6, D7 G4: D8, D9
For any given piece of data, Jay first chooses a group at random and then writes it to both the disks in the group.
Jay's boss, Gloria, is concerned that this design is not good for recovering data in case of a disk failure. She instead suggests choosing two disks at random for any given piece of data and writing data to both those disks. That is, there are no fixed groups.
Which approach do you think is better for recovering data for a failed disk?
Solution
Gloria's approach of choosing two disks at random to replicate any piece of data is a better approach for recovering from a disk failure.
When a disk fails, a new disk is introduced to take the place of the failed disk. This new disk needs to be populated with all the data that was in the failed disk. Moreover, this recovery process should ideally be done without stopping any other non-recovery related reads.
In Jay's solution, following a disk failure, the new replacement disk would need to populated using data from the other disk in its replica. This would place a high demand for read bandwidth on the functioning disk in the group. Not only will it slow down the recovery process, it might also hurt the performance of non-recovery reads for data belonging to that group of disks. This is because all reads would be bottlenecked by the read bandwidth of the surviving disk in the group.
In contrast, Gloria's solution spreads out the replicas across disk. So to reconstruct data for a failed disk, some parts of the data would be read from each of the remaining functioning disks as opposed to a single disk.
This idea is similar to spreading out parity across disks in RAID-5 to improve write performance, as discussed in this blog post.