Question 5: Fault tolerant locks

2024-06-28

John has built a distributed system that consists of multiple processes that serve client requests. However, all the processes need to update a global shared state which is stored in a database. The database is hosted on a separate server. John decided to use locks. Whenever a process wants to write to the database, it first requests the database server for a lock. The server granted lock to only one process at a time. After a process was done writing to the database, it would release the lock. The server would then grant the lock to the next process that requests it.

This worked well for a while, but then John noticed that sometimes a process would die (due to a bug, or a server failure) while holding the lock. This would stall progress for all the processes.

Can you help John design a more fault-tolerant locking mechanism?

Solution coming up in the next post!

Solution for fault tolerant writes:

Andrew is correct in reasoning that writing the metadata before the upload completes is not fault tolerant. Consider a scenario in which the video upload to S3 fails (for whatever reason, e.g., network connection error). Their app would have an inconsistent state wherein it will incorrectly show the video as uploaded.

This is as example of a more generalization rule for fault-tolerant or crash-consistent updates: write data before the metadata. Soft updates describes the rules to follow for crash-consistent writes in a file system. This particular rule (write data before the metadata) is relevant more generally as well.

#concurrency #distributed-systems #fault-tolerance #qna