Question 16: Too many videos
Daphne has built a video sharing app. A user can create and upload their videos on the app. Users can also share videos with other users. Her app allows each user to view all the videos accessible to them, the ones that the user created as well as the ones that someone else has shared with the user. She uses a key-value store in the backend to implement this library. Whenever a new video shows up for a user (either self created or shared), the app stores the video in the key-value store using the key {userID}-{monotonically-increasing-per-user-video-count}
.
She noticed that the storage used by the key-value store is bloating up. She suspects that it is because of storing duplicate videos for each user for shared videos. Can you suggest a mechanism to avoid the storage bloat?
Solution
Daphne should to deduplicate her storage, i.e., instead of storing duplicate copies of the shared videos, she should share a single copy and have multiple references to it.
One way to incorporate deduplication would be to use a two level key-value design. The first level key-value store uses the same key but points to another key in the second level key-value store. The second level key-value store has a key derived from the video (e.g., a SHA-256 hash of the video file) and points to the actual video. This way duplicated videos will be stored only once while allowing for per-user video listings.
Venti is one of the seminal papers that described deduplicated storage.