Question 4: Fault tolerant writes

2024-06-27

Andrew is building a video storage app. Users can upload their videos to his app and then browse them later. During the upload process, Andrew first stores the video to S3 and then updates a database containing metadata about the video (e.g., title, upload date). This database is used show all the uploaded videos of a user.

The upload process take a while to complete because storing the video on S3 takes a while. Andrew's boss, Ram, suggests that they update the database first and upload the video to S3 in a background thread. This can reduce the perceived duration of the upload process because the user would see the video's information (e.g., title) in their app. Andrew disagrees with Ram citing fault tolerance reasons.

Do you agree with Andrew or Ram?

Solution coming up in the next post!

Solution to high tail latency for reads from an SSD:

The reason Jenny is seeing high tail latency for reads is that reads (which take 10s of microseconds) can get queued behind writes (100s of milliseconds) or erasures (single digit seconds) in SSDs.

SSDs consist of multiple concurrently operating planes. However, within a place, operations (reads, writes, erasures) are serialized. If a read gets stuck behind a write or an erasure on a place, the end-to-end latency for that read would be much higher than a typical read. This manifests as the high read tail latency.

Tiny-Tail Flash is a good read on this topic. Interruptible writes and erasures are another solution to this problem. Last I had checked, there were no papers talking about interruptible erasures, only patents, but a recent search showed this paper (I haven't read it yet though).

#concurrency #distributed-systems #fault-tolerance #qna #storage-systems