Background
Very early in my career, I was working on a product that enabled hospital staff to ask "a relational database" questions such as "How many patients older than 60 were admitted in the past month? How long did they stay? Which surgeries were common? Which machines were underutilized? etc".
I had never ventured beyond "SELECT x, y, z from T" till then and this product compelled me to understand stored procedures, triggers, basics of data normalization and some other stuff in a nonsuperficial way.
In characteristic youthful enthusiasm, I wanted to "dig deeper". So I purchased a highly recommended book on SQL performance tuning. I read the first few pages, fell asleep midway and then promptly forgot what I read. The book I am sure was very well written but I was not prepared.
Conjecture
Years later I realized that like most things in software, databases cannot be understood as a "set of features" but instead should be seen through the lens of a "What is the problem to be solved here?" and "How does technology (database) tackle that specific problem?". Let me illustrate this with a problem...
Problem
Let's consider the need for storing the physiological data of patients in a database.
Intensive Care Units in Hospitals have Patient Monitors that continuously capture physiological data such as blood pressure, heart rate, blood oxygen saturation etc at a rapid rate.
The physiological data has the following traits:
Every data point is associated with a timestamp which indicates when the measurement was taken.
Measurements are never updated once captured and stored.
The measurements can come in very rapidly.
Patient Monitors have different capabilities. So, one model of Patient Monitor might capture blood pressure, heart rate and SpO2 and another might capture blood pressure, heart rate, SpO2, respiration rate and say, body temperature.
Possible solutions for storing physiological data
I can think of the following ways of storing the data in a PostgreSQL database...
Option 1: Each measurement type as a column
timestamp | bp | heart-rate |
2022-10-30T05:29:44 | 120/80 | 60 |
2022-10-30T05:29:46 | 112/79 | 70 |
This is a very simple data model to understand. However, if a certain model of Patient Monitor does not capture all parameters, then some of the columns would always contain nulls.
Any new sensor attached to the Patient Monitor with an additional measurement type would require a schema change.
Option 2: Measurements as opaque blobs
timestamp | measurement |
2022-10-30T05:29:44 | byte representation of measurement. |
2022-10-30T05:29:46 | byte representation of measurement. |
This is again a very simple data model since pretty much the measurement column is being used as an object store.
Whether a sensor captures two parameters or multiple parameters does not matter, since the byte column can accommodate anything.
There is no data type validation since we are just dumping blobs into a column.
Also, we have no way to query for data such as - Average blood pressure value in the past month, All heart rates above 100 etc. Every query would involve reading the bytes in memory, decoding them and running aggregation, filtering and whatnot.
Option 3: Dedicated table for each measurement type
Blood Pressure Table
timestamp | value |
2022-10-30T05:29:44 | 120/80 |
2022-10-30T05:29:46 | 112/79 |
Heart Rate Table
timestamp | value |
2022-10-30T05:29:44 | 60 |
2022-10-30T05:29:46 | 70 |
Having a dedicated table for each measurement type gives us a pretty schema.
However, this also results in an explosion in the number of tables for modeling physiological parameters. Pretty much everything that can be measured about a human body would end up being a table. (Imagine parameters such as Cardiac output, blood glucose rate, carbon dioxide etc)
Option 4: Store data in JSON columns
timestamp | value |
2022-10-30T05:29:44 | {“bp”:”120/80”,”hr”:60} |
2022-10-30T05:29:46 | {“bp”:”112/79”,”hr”:70} |
- A database such as PostgreSQL has rich querying capabilities for JSONB columns. So queries such as average blood pressure value in the past month, and all heart rates greater than 100 are straightforward.
Option 5: EAV Model
timestamp | measurement-type | data-type | value |
2022-10-30T05:29:44 | systolic | integer | 120 |
2022-10-30T05:29:44 | diastolic | integer | 80 |
2022-10-30T05:29:44 | hr | integer | 60 |
Though the model is extremely flexible, it results in an explosion of rows. Essentially this turns an RDBMS into a key-value store.
All aggregations, and filtering now have to consider datatype before querying. (Lots of datatype casts)
Inserts and querying gets complicated.
Five different options each with its pros and cons just for storing the data! And I am sure there are other ways that I have not considered.
Pretty soon you will realize that there are other issues to be considered such as...
Old data is not very useful compared to most recent data. So do we delete it or archive it?
Who should enforce data retention & purging policies? The database or the application?
If the physiological parameters are being captured every few seconds, then is there a possibility of ending up with a lot of data too quickly? If so, does the database compress the data?
Database Vortex
Databases are a very interesting abstraction. It's designed by brilliant engineers and provides a rock-solid abstraction. Mortals like me can accomplish a lot armed with just a basic understanding of SQL and some creative googling. However, once you cross the event horizon, the innards of the database suck you in. Pretty soon you will start devouring information about how the data is laid out on disk, the type of indexes used to retrieve the data, whether the database needs to be sharded, replicated, repeatable reads and so on...
How should you go about studying databases then?
The first company that I worked for did not allow our machines to be connected to the Internet. So we had to rely on MSDN documentation on the intranet for solving any technical issues. I would come across articles like "An In-Depth Look into the Win32 Portable Executable File Format" accidentally. The writing was exemplary, thought-provoking, deep, comprehensive and could be grokked only after multiple readings. And in the end, I would feel happy. (A feeling that is hard to come by nowadays when everything is just a Google search away or summarized by Large Language Models).
For quite some time I have been looking for the database equivalent of Matt Pietrek's articles. For those of us whose day job is not database research, the following could be a map of how to understand databases better.
Just to be clear, I cannot write page long queries in SQL, nor have I worked on projects where I had to squeeze the last drop of performance from a database. So please keep in mind that the "mental map" below is for a database dilettante.
Mental Map
What is the data model offered by the database? -> Relational, Document Oriented, Key-Value, Graph etc.
How does one interact with the data model? -> SQL abstraction, database-specific querying etc.
What are the durability guarantees for data?
A consistent view of data or eventually consistent?
What is the on-disk layout? -> BTree, LSM? Row-based data layout or columnar?
In a clustered setup can applications write/read from any data node? Or should writes always go to a master?
Is the database optimized for analytic queries or run-of-the-mill CRUD use cases?
Resources
I know life is too short to focus on esoteric database knowledge, there is always a pile of shows to catch up on on Netflix, distant lands to be visited and conquered as Instagram photos etc. However, If you are willing to spare one hour, definitely watch the History of Databases talk. I wish I had a cool professor like Andy Pavlo.
If you are inspired after watching the above go through his Carnegie Mellon database lectures. They are pure gold. The bonus is guest lectures covering modern databases such as RedShift, and DuckDB from the original designers.
I have watched only a few lectures from Bruce Momjian. They are very comprehensive given his involvement in the PostgreSQL community. The advantage of studying an open-source database such as PostgreSQL is that one can always go straight to the source and understand what he is talking about. For example, this picture from Bruce's talk can be mapped to the source code without much difficulty.
Not every database is a Relational database and often understanding something completely different also helps. For instance, Prometheus a non-relational database used for monitoring has to solve similar problems as TimeScaleDB (which is built on top of PostgreSQL) because both of them are designed for storing time series data. The problem of not losing data "in memory" when the database crashes is equally applicable to Redis and Kafka.
Red Book - A collection of papers that describe the ideas & trends in database systems.
A fantastic book that brings all the key ideas in distributed systems together - Designing Data-Intensive Applications.
Very well-written article covering the "on disk" format of SQLite.
I will keep updating the above section as I come across helpful material.