Question 9: Slow scanning

2024-07-05

Gloria is building a single-node database management system (DBMS). Her DBMS maintains some metadata for each page at the top of the page. Periodically, a background thread in the DBMS goes through all the pages and reads this metadata and takes some action based on it. She would like to reduce the amount of time it takes for her DBMS to go through all the pages. Can you suggest some mechanisms to help with that?

Solution coming up in the next post!

Solution for gaming duration:

Phil should use a columnar database. Columnar databases arrange data on disk in a column-first format. That is, all entries from a column are stored contiguously, followed by all entries from another column, and so on. This is in contrast to row-based databases that store data in a row-first format (all entries from one row after the another).

Column-first storage makes reads for all entries in a column faster (because they are contiguous on disk). This is the access pattern that Phil would have for summarizing the stats (e.g., average gaming duration).

AWS Redshift documentation has a good primer on columnar storage. If you want a more thorough and academic treatment, this paper is a good starting point.

#qna