Question 5: Fault tolerant locks
John has built a distributed system that consists of multiple processes that serve client requests. However, all the processes need to update a global shared state which is stored in a database. The database is hosted on a separate server. John decided to use locks. Whenever a process wants to write to the database, it first requests the database server for a lock. The server granted lock to only one process at a time. After a process was done writing to the database, it would release the lock. The server would then grant the lock to the next process that requests it.
This worked well for a while, but then John noticed that sometimes a process would die (due to a bug, or a server failure) while holding the lock. This would stall progress for all the processes.
Can you help John design a more fault-tolerant locking mechanism?
Solution
John should use leases, which are expiring locks. Processes acquire locks with a time limit and keep refreshing the lock (if they are still alive and using the database). If the database server does not receive a refresh request from a process and the lease has expired, it can grant the lease to another process. If the process that had originally held the lease comes back and tries to access the database, the database server can deny that request. (This is typically implemented by having the process present its lease for every access, so that the database can verify the lease holder).
This is the canonical reference paper for leases. Although similar ideas were explored in other concurrent research as well. This blog post from AWS on how to implement leases is a more recent (and maybe easier to digest) explanation of how leases work.