Part 3: Scaling & Operations
"A database that performs well at 1,000 users is not the same database at 1,000,000 users. The code is the same. The operations are entirely different."
Overview
Four chapters that cover the full operational lifecycle of a production database — from ensuring high availability through replication, to scaling writes with sharding, to squeezing every millisecond out of your query plans, to recovering from the worst-case scenario:
| # | Chapter | Core Concept | Difficulty | Time |
|---|---|---|---|---|
| 09 | Replication & High Availability | Streaming replication, failover automation, multi-region | Intermediate | 45 min |
| 10 | Sharding & Partitioning | Consistent hashing, shard keys, Vitess/Citus | Advanced | 50 min |
| 11 | Query Optimization & Performance | EXPLAIN ANALYZE, join strategies, connection pooling | Intermediate | 55 min |
| 12 | Backup, Migration & Disaster Recovery | PITR, CDC, zero-downtime migrations, RPO/RTO | Advanced | 50 min |
Key Themes
- Availability is engineered, not assumed — replication and failover must be tested before the outage, not during it
- Sharding is a last resort — partition tables first, shard only when vertical scaling and partitioning are exhausted
- The query plan is the source of truth — EXPLAIN ANALYZE tells you what the database actually does, not what you think it does
- Recovery is a feature — an untested backup is not a backup; disaster recovery drills are mandatory
What You'll Be Able to Do
After completing Part 3 you will be able to:
- Set up PostgreSQL streaming replication with Patroni for automatic failover under 30 seconds
- Choose a shard key that avoids hotspots and enables efficient scatter-gather queries
- Read and interpret a full EXPLAIN ANALYZE output for a multi-table JOIN query
- Implement a zero-downtime schema migration using the expand-contract pattern
- Design a backup strategy that meets specific RPO/RTO SLAs
- Set up a Debezium CDC pipeline from PostgreSQL to Kafka
Prerequisites
This part assumes you have completed:
- Part 1 — Foundations: storage engines, WAL mechanics, transaction isolation
- Part 2 — Engine Deep Dives: PostgreSQL and MySQL internals
Continue to Part 4
After finishing these four chapters, continue to Part 4 — Real-World Design to see how Instagram, Discord, and Uber applied these patterns at scale.

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.