How Databases Store and Retrieve Millions of Records Instantly
Databases use B-tree indexes, buffer pools, and query planners to retrieve records in microseconds. Learn how relational and NoSQL engines actually store and find data.
Finding One Record in a Billion
Amazon's DynamoDB handles over 89 million requests per second at peak, returning results in single-digit milliseconds. PostgreSQL running on commodity hardware can perform a lookup across 100 million rows in under a millisecond. These aren't acts of raw computational power — they're the result of decades of research into how data is physically organized on disk and in memory, and how query engines navigate that organization. The mechanisms behind these numbers reshape what software can do at scale.
How Data Lives on Disk
Storage in relational databases is organized around pages — fixed-size blocks of data, typically 8KB (PostgreSQL) or 16KB (MySQL/InnoDB). All I/O between disk and memory happens at the page granularity, never byte-by-byte. A table's rows are packed into heap pages; as rows are inserted they fill pages sequentially. A page holds a page header, line pointer array (offsets to each row's start), and the actual row data growing from the other end.
This paging architecture mirrors how operating systems manage memory and disk I/O. Fetching one row requires loading its entire page into memory — which is why locating 100 rows scattered across 100 different pages (a full-table scan over sparse results) is far more expensive than fetching 100 contiguous rows from a few pages.
The B-Tree Index: The Core Data Structure
Without indexes, every query requires scanning every row in a table. For a million-row table, that means reading every page on every query — unacceptable for any responsive application. Indexes solve this by creating auxiliary data structures that map column values to row locations.
The dominant index structure in relational databases is the B-tree (Balanced Tree), specifically the B+ tree variant. A B+ tree has these properties:
- All data (row pointers) resides in leaf nodes; internal nodes contain only keys for navigation.
- Leaf nodes are linked, allowing efficient range scans by following the linked list rather than traversing the tree repeatedly.
- The tree is always balanced — all leaf nodes are at the same depth, guaranteeing O(log n) lookup time regardless of data distribution.
- For a table of 1 billion rows with a B-tree of order 500, the tree has roughly 3 levels (log₅₀₀(1,000,000,000) ≈ 3). Three page reads locate any row.
Creating an index on a column inserts every value into this structure along with a pointer (either physical page address or primary key, depending on the engine) to the corresponding row. Each INSERT, UPDATE, or DELETE to an indexed column must also update the B-tree — the performance cost of indexes during writes.
Buffer Pool: Keeping Hot Data in Memory
Disk access is measured in milliseconds; RAM access in nanoseconds — roughly 100,000× faster. Database engines maintain a buffer pool (InnoDB) or shared buffer cache (PostgreSQL) — a region of RAM that caches recently or frequently accessed pages. When a query needs a page, the engine checks the buffer pool first; a cache hit avoids disk I/O entirely.
Buffer pools use eviction policies (typically LRU variants, sometimes with adaptive adjustments) to decide which pages to retain when memory fills. PostgreSQL's default shared_buffers is 128MB, though production configurations typically allocate 25% of total RAM. A well-configured database serving a typical web application achieves buffer pool hit rates of 95–99%, meaning most queries never touch disk.
Query Planning and Optimization
When you submit a SQL query, the database engine doesn't execute it directly. It parses the query into an abstract syntax tree, analyzes it, and runs it through the query planner/optimizer — a component that evaluates multiple possible execution plans and picks the cheapest one based on cost estimates.
| Execution Strategy | When Used | Cost Profile |
|---|---|---|
| Sequential scan | Retrieving large fraction of table | High I/O, predictable |
| Index scan | Fetching few rows by indexed column | Low I/O for sparse results |
| Index-only scan | All needed columns in index | No heap access required |
| Bitmap heap scan | Moderate row count from index | Batch heap access, efficient |
| Hash join | Equi-joins on larger tables | Memory-intensive, fast |
| Merge join | Sorted input, equi or range joins | CPU-efficient, requires sorted data |
The planner uses table statistics — row counts, value distribution histograms, null fractions — stored in system catalogs. ANALYZE (PostgreSQL) or ANALYZE TABLE (MySQL) refreshes these statistics. Outdated statistics cause the planner to choose suboptimal plans, which is a common source of query performance regressions after bulk data loads.
Transactions and ACID Properties
Databases guarantee correctness through ACID properties — Atomicity, Consistency, Isolation, Durability. The mechanisms enforcing these guarantees have direct performance implications:
- Write-Ahead Log (WAL): Every change is first written to a sequential log file before modifying data pages. WAL enables durability (log survives crash) and atomicity (uncommitted changes can be rolled back). Sequential log writes are far faster than random page writes.
- MVCC (Multi-Version Concurrency Control): Rather than locking rows for reads, PostgreSQL and InnoDB maintain multiple versions of rows. Readers see a consistent snapshot without blocking writers. Each row stores transaction IDs indicating visibility; old versions are periodically cleaned up by VACUUM or purge threads.
NoSQL Approaches to the Same Problem
NoSQL databases trade SQL's flexibility for targeted performance gains.
| NoSQL Type | Example | Storage Structure | Strength |
|---|---|---|---|
| Key-Value | Redis, DynamoDB | Hash table / LSM tree | Sub-millisecond single-key lookup |
| Document | MongoDB | B-tree with BSON documents | Flexible schema, nested data |
| Column-family | Cassandra, HBase | LSM tree, SSTable files | Write-heavy workloads, wide rows |
| Graph | Neo4j | Adjacency list with native pointers | Traversing relationships |
LSM trees (Log-Structured Merge Trees), used by Cassandra, RocksDB, and LevelDB, replace B-trees for write-heavy workloads. Writes go first to an in-memory structure (memtable), then flush to immutable sorted files (SSTables). Background compaction merges SSTables and removes deleted records. This converts random writes to sequential writes — dramatically faster on both HDDs and SSDs — at the cost of slower reads that may need to consult multiple SSTables.
The choice between storage engines, index types, and database paradigms ultimately depends on workload shape: read-heavy vs. write-heavy, point lookups vs. range scans, strict consistency vs. eventual consistency. No single architecture excels at everything — which is why most large-scale applications use multiple database systems simultaneously.
Related Articles
software
APIs Explained: How Software Systems Talk to Each Other
Learn what APIs are, how REST, GraphQL, and gRPC work, key concepts like authentication, rate limiting, and versioning, and why APIs are the internet's building blocks.
9 min read
software
How Chess Engines Outthink Human Grandmasters at Every Level
Stockfish evaluates millions of positions per second using minimax and alpha-beta pruning. AlphaZero learned from scratch with neural networks. Here's how engines surpass human play.
9 min read
software
How Fiber Optic Cables Transmit Data at Light Speed
Fiber optic cables carry 99% of international data through hair-thin glass strands using total internal reflection. Explore single-mode vs multi-mode, submarine networks, and WDM technology.
9 min read
software
How Memory Chips Store and Retrieve Information
DRAM uses capacitor cells; NAND flash uses floating gates. Learn how SSDs differ from HDDs, why Moore's Law is slowing, and how 3D NAND stacking keeps storage density growing.
9 min read