Chapter 16: Database Selection Framework

"There is no best database. There is only the right database for your access pattern, consistency requirement, and operational capacity."

Mind Map

Prerequisites

This chapter synthesizes concepts from the entire guide. You should be familiar with the full database taxonomy from Ch01 and consistency models from Ch04. The real-world case studies (Ch13–Ch15) illustrate the consequences of selection decisions made under this framework.

The Decision Framework

Database selection is an architectural decision that is expensive to reverse. A startup that chooses MongoDB in year 1 and needs to migrate to PostgreSQL in year 3 faces weeks of engineering time, significant risk, and possible downtime — exactly the pattern Discord experienced in Chapter 14.

The framework below applies four sequential filters. Stop when you have a clear answer.

Decision Factors: Priority Order

These five factors determine database selection in priority order. Do not skip a factor with a clear answer to reach a preferred answer.

Factor 1: Consistency Requirement

Consistency Need	Implication	Example
Strong (ACID)	All writes must be immediately visible; no lost updates	Bank balance, inventory count, order state
Read-your-writes	A user sees their own writes immediately	Social media profile update
Eventual	Writes propagate eventually; brief inconsistency acceptable	Like counts, view counts, follower counts
Causal	Operations with causal dependency are ordered correctly	Comment threads, version history

Consistency is Non-Negotiable for Financial Data

Never choose eventual consistency for data where inconsistency has monetary consequences. A user's account balance must use ACID transactions. "We'll reconcile later" has ended companies. Start with the consistency requirement and let it constrain your options — do not start with a preferred database and rationalize the consistency story.

Factor 2: Read/Write Ratio

Pattern	Typical Ratio	Recommended Strategy
Read-heavy (content)	100:1 reads	Primary + many read replicas; aggressive caching
Write-heavy (logging)	1:100 reads	LSM store (Cassandra, ScyllaDB); avoid B-tree
Balanced (OLTP)	10:1 reads	PostgreSQL or MySQL; PgBouncer pooling
Mixed OLTP + OLAP	Complex	Separate OLTP and OLAP stores; CDC pipeline

Instagram's feed (100:1 read ratio) drove their aggressive denormalization and replica strategy, covered in Chapter 13. Discord's messages (heavily write-biased, append-only) drove the move to an LSM-based store in Chapter 14.

Factor 3: Data Model

Data Model	Best Fit	Do Not Use When
Relational	Structured, normalized, JOIN-heavy	Data has no natural foreign-key relationships
Document	Nested, variable schema, entity-centric	You need to query across entity boundaries
Wide-column	Append-heavy, high-cardinality partition key	You need random updates or row-level locking
Key-Value	O(1) lookup by known key, no range queries	You need to query by value or range scan
Time-Series	Monotonically increasing timestamp, same schema	Schema varies across measurements
Graph	Relationship traversal 3+ hops deep	Most queries access by node ID, not edge traversal
Vector	Semantic similarity, embedding-based retrieval	You need exact match, not approximate

Factor 4: Scale Projection

Scale projections should be 12-month forecasts, not 10-year fantasies. Over-engineering for scale that never arrives wastes engineering effort and adds operational complexity that slows development.

The most common mistake: choosing Tier 3 or Tier 4 infrastructure for a Tier 1 workload. A startup processing 1,000 writes/day does not need Cassandra. A single PostgreSQL instance on a $100/month server handles this trivially and offers better ACID guarantees.

Factor 5: Operational Burden

Factor	Managed Cloud	Self-Hosted
Setup time	Minutes	Days–weeks
Backup management	Automatic	Manual or scripted
Failover	Automatic	Manual + runbooks
Version upgrades	Managed	Manual, disruptive
Monthly cost	Higher (2–4×)	Lower
Control	Limited	Full
Team expertise needed	Low	High

For teams with fewer than 5 engineers, managed databases (RDS, Cloud SQL, PlanetScale, Neon) almost always win on total engineering cost, even though the monthly bill is higher than self-hosted.

Cost Comparison

Monthly cost estimates for a production workload: 1TB of data, 1,000 writes/sec, 10,000 reads/sec, single region.

Option	Monthly Cost	Hidden Costs	TCO Notes
RDS PostgreSQL (db.r6g.2xlarge)	~$700/mo	Read replica +$700, backups +$100, data transfer	Predictable; managed failover included
Aurora PostgreSQL (Serverless v2)	~$400–1,200/mo	Scales with ACU usage; I/O costs add up	Good for variable workloads; Aurora Global ~2×
DynamoDB (on-demand)	~$500–1,500/mo	Read/write capacity units add up with variable load; GSI doubles cost	Ops-free; cost unpredictable at scale
CockroachDB Cloud (Dedicated)	~$1,500/mo	Egress, cross-region replication	Distributed SQL; 3-node minimum
Cassandra self-hosted (3×c5.2xlarge)	~$600/mo (EC2 only)	Ops time: 8–16h/month; expertise; monitoring	Cheapest at scale; high operational tax
PlanetScale (Business)	~$400/mo	Branches free; connection limits on lower tiers	MySQL + Vitess; zero-downtime migrations
Neon (Pro)	~$200/mo	Compute scales with usage; only PostgreSQL-compatible	Serverless PG; great for dev/test, growing for prod

The Hidden Cost of Self-Hosted

Self-hosted Cassandra at 3 nodes costs ~$600/month in EC2. But a senior engineer costs ~$200/hour. If Cassandra consumes 10 hours/month of SRE time (upgrades, tuning, incidents), the real cost is $600 + $2,000 = $2,600/month — more than CockroachDB Cloud's managed offering. Calculate Total Cost of Ownership, not just infrastructure cost.

Polyglot Persistence Patterns

When to Use Multiple Databases

Polyglot persistence — using different databases for different workloads — is the right architecture when:

A workload has fundamentally different access patterns (OLTP vs. OLAP)
A workload requires capabilities absent from the primary database (full-text search, vector search)
A workload has different consistency requirements (ACID for transactions, eventual for caches)

Common Production Combinations

Use Case	Primary DB	Add-On DB	Reason
E-commerce	PostgreSQL	Redis + Elasticsearch	Cache product pages; search catalog
Messaging	Cassandra or ScyllaDB	Redis + PostgreSQL	LSM for messages; PG for user accounts
Ride-sharing	MySQL/Docstore	Redis + ClickHouse	Location cache; trip analytics
SaaS	PostgreSQL	Redis + pgvector	Cache sessions; AI semantic search
Gaming	Redis	PostgreSQL	Leaderboards in Redis; player data in PG

Anti-Pattern: Different DB Per Microservice

One dangerous pattern is assigning a different database technology to each microservice, justified by "each service should own its data store."

This pattern creates operational overhead proportional to the number of database technologies. Each requires separate expertise, monitoring, backup procedures, and incident runbooks. The result: a 6-person engineering team operating 6 different database systems, spending more time on database operations than on product features.

Rule of thumb: Until you have dedicated DBAs or SREs, use at most 2–3 database technologies: one OLTP store (PostgreSQL), one cache (Redis), and optionally one search engine (Elasticsearch or Typesense) or analytics store (ClickHouse).

Migration Strategy Matrix

When to Migrate

Symptom	Likely Cause	Migration Target
P99 read latency > 500ms on simple queries	Index not in RAM; connection saturation	Add read replicas; add PgBouncer
Write throughput ceiling (~10K writes/sec)	Single primary bottleneck	Sharding or LSM-based store
Schema changes take hours	Large table ALTER TABLE	PlanetScale; pt-online-schema-change
ACID not required; writes dominate	Wrong data model	Cassandra / ScyllaDB
Need full-text search on large text	PostgreSQL tsvector not sufficient	Elasticsearch sidecar
Joins across millions of rows are slow	Denormalization needed or wrong DB	Denormalize schema; consider document store

Never Migrate for the Wrong Reason

Common wrong reasons: "NoSQL is more scalable" (not always true; PostgreSQL scales to 100M+ rows), "MongoDB is more flexible" (JSONB in PostgreSQL offers equal flexibility), "everyone is using Cassandra" (without matching write volume). Always diagnose the specific bottleneck before choosing a migration target.

Migration Patterns

Three patterns cover most database migrations:

Pattern	Best For	Risk	Downtime
CDC	Live databases; any size	Replication lag during cut-over	Near-zero
Dual-Write	Small-to-medium; when CDC unavailable	Write amplification; consistency window	Near-zero
Blue-Green	Large databases; full-fidelity copy	Long preparation time	Near-zero

From → To Decision Table

From	To	Why	Pattern
MongoDB	PostgreSQL	ACID needed; joins needed	CDC (Debezium)
MongoDB	Cassandra	Write volume exceeds MongoDB capacity	Dual-write + backfill
PostgreSQL (single)	PostgreSQL (sharded)	Write ceiling hit	Dual-write; shard by domain key
MySQL	PostgreSQL	JSON, JSONB, arrays; PostGIS needed	CDC (AWS DMS or pglogical)
Cassandra 3.x	ScyllaDB	JVM GC pauses; reduce node count	CQL-compatible; dual-write
Self-hosted	Cloud managed	Team too small for ops burden	Backup restore into managed service
Separate OLTP + OLAP	PostgreSQL + ClickHouse	Analytics degrading OLTP performance	CDC from OLTP → ClickHouse

2026 Database Trends

These five trends have concrete engineering implications for database selection decisions you are making today. Each trend includes specific adoption thresholds so you can assess whether it applies to your workload.

Trend 1: PostgreSQL Extensions Replace 80% of Specialized DBs

By 2026, PostgreSQL with extensions handles use cases that previously required separate databases:

Extension	Replaces	Limitation (as of 2026)
pgvector 0.8+ + HNSW index	Pinecone, Weaviate (< 10M vectors)	Purpose-built DBs (Pinecone, Qdrant) still lead above 50M+ vectors with < 100ms p99 ANN — see ANN Benchmarks
TimescaleDB	InfluxDB (< 50GB/day ingest)	InfluxDB 3.0 (Apache Arrow engine) benchmarks higher at sustained > 1M points/sec
PostGIS	Dedicated geo DBs	Uber-scale real-time location still needs H3 + custom distributed lookup
pg_search (ParadeDB)	Elasticsearch (< 100M documents)	Elasticsearch still leads for complex relevance scoring at 100M+ docs
Citus	Custom sharding middleware	CockroachDB has a more complete distributed transaction story

Benchmarks as of 2026

pgvector 0.8 (early 2026) added parallel HNSW index builds and is production-ready at OpenAI, Supabase, and Neon. The < 10M vector threshold is a rough crossover — always benchmark your specific vector dimensions and recall requirements. These numbers shift with every release.

The implication: start with PostgreSQL + extensions. Only migrate to a specialized system when you hit a measurable limit.

Trend 2: Serverless and Edge Databases

Serverless databases scale to zero when idle and scale instantly under load. Relevant for development environments, variable-traffic applications, and startups:

Product	Based On	Key Feature	2026 Status
Neon	PostgreSQL	Branching (copy-on-write branches for dev/test); auto-suspend	Acquired by Databricks (May 2025, ~$1B); pricing cut 80% on storage; free tier doubled to 100 CU-hrs/mo
PlanetScale	MySQL + Vitess	Schema branching; zero-downtime migrations	Free tier removed April 2024 — minimum $39/mo (Business plan). No longer free for startups.
Aurora Serverless v2	MySQL/PostgreSQL	0.5 ACU minimum; per-second billing	Stable managed option
Turso	libSQL (SQLite fork)	Edge-distributed; embedded replicas sync from primary; < 10ms local reads	Production-ready; dominant 2026 edge SQLite stack with Cloudflare D1
Cloudflare D1	SQLite	Edge SQLite in Cloudflare Workers; zero ops	GA and widely adopted

PlanetScale Free Tier Removed (April 2024)

PlanetScale eliminated its free tier in April 2024. Minimum is now $39/month (Business plan). Any documentation or tutorials that describe PlanetScale as "free for startups" are outdated. For a free serverless PostgreSQL alternative, Neon's free tier (100 CU-hrs/month as of 2026 post-Databricks) is the current standard.

Serverless is Not for All Workloads

Serverless databases that scale to zero have cold-start latency (100–500ms for the first query after idle). For production APIs with consistent traffic, a reserved-instance database is cheaper and faster. Serverless shines for: staging environments, low-traffic microservices, development databases, and variable-traffic multi-tenant SaaS.

Trend 3: Vector Search Becomes a First-Class Selection Axis

Vector search is now a mainstream database selection criterion — not an exotic addon. Every major database has added native vector support by 2026:

Database	Vector Support	Index Type	Production Use
PostgreSQL + pgvector 0.8+	Native extension	HNSW, IVFFlat	OpenAI, Supabase, Neon — all major cloud providers support `CREATE EXTENSION vector`
MongoDB Atlas	Atlas Vector Search	HNSW	Combines document + vector in one system
Redis	Redis Vector Similarity	HNSW, FLAT	In-memory; lowest latency
Cassandra (DataStax Astra)	SAI vector index	SAI	Cassandra-scale vector search
DynamoDB	Not native (2026)	—	Requires a separate vector layer
Pinecone / Qdrant / Weaviate / Milvus	Standalone vector DBs	HNSW + ANN variants	Best above 50M vectors or when hybrid search (dense + keyword) is critical

Scale selection rules (as of 2026):

< 1M vectors: pgvector on existing PostgreSQL; zero operational overhead
1M–10M vectors: pgvector 0.8+ with HNSW still viable; monitor ANN recall
10M–1B vectors: Pinecone, Qdrant, or Milvus; dedicated indexing infrastructure
> 1B vectors: Milvus distributed or Vespa

The trend: vector search is a standard capability to evaluate when selecting any database for AI-augmented applications, not a separate decision. For new applications building RAG (Retrieval-Augmented Generation) pipelines or semantic search, start with PostgreSQL + pgvector unless you project > 10M embeddings.

Trend 4: TiDB X — Cloud-Native Distributed SQL for AI Workloads

TiDB X (announced 2025, public preview 2026) represents a major architectural shift from prior TiDB versions:

Decoupled compute/storage: compute layer scales independently from storage (backed by object store, not attached disks)
Context-aware auto-scaling: scales based on QPS, latency, and query-mix signals — designed for variable agentic workloads where query patterns are unpredictable
Vector search native: built-in vector index support alongside ACID SQL
AI-era positioning: TiDB X is explicitly marketed for AI agents and LLM-driven workloads where query patterns are generated dynamically

If your MySQL-compatible workload involves AI-driven queries or you need distributed SQL with vector capabilities, TiDB X is the 2026 alternative to evaluate before considering CockroachDB or custom sharding.

Trend 5: Edge Databases

Applications running at the network edge (CDN PoPs, Cloudflare Workers, Fastly Compute) need databases that can respond locally:

Turso's embedded replicas (local SQLite replica inside the app VM, synced from a remote primary) give zero-network-hop reads for edge workloads. Edge databases accept eventual consistency (reads from local replica may lag primary by 100–500ms) in exchange for sub-10ms read latency globally.

AI-Era Database Selection (New Axis, 2026)

As of 2026, AI-augmented and agentic applications introduce three new selection axes that did not exist in the classic framework:

Axis 1: Vector Search Capability

Every application using LLMs for semantic search, RAG, or recommendations needs to store and query dense vector embeddings. This is now a primary selection axis, not an afterthought:

Axis 2: Agentic Workload Compatibility

AI agents generate unpredictable query patterns — queries are synthesized dynamically from tool calls, not pre-defined. Selection criteria shift:

Property	Traditional OLTP	Agentic Workload
Query shape	Known ahead of time	Generated at runtime
Scaling trigger	QPS threshold	Query complexity + latency signal
Schema rigidity	Strict (required)	Flexible preferred (agents may request new fields)
Session length	Short (ms)	Long (agents maintain context across tool calls)

Databases explicitly designed or positioned for agentic workloads as of 2026:

Neon: ~80% of new databases are auto-provisioned by AI agents per Databricks/Neon data
TiDB X: context-aware scaling, explicitly targets agentic workloads
PostgreSQL + pgvector + LISTEN/NOTIFY: event-driven agent patterns native to Postgres

Axis 3: PostgreSQL as a Platform

PostgreSQL's extension ecosystem (pgvector, pg_search, PostGIS, TimescaleDB, pg_partman, Citus) has made it the default platform for teams that want to avoid polyglot sprawl. The 2026 pattern:

Capability	PostgreSQL Extension	Threshold Before Specializing
Vector search	pgvector 0.8+	> 10M vectors with strict latency SLA
Full-text search	pg_search (ParadeDB)	> 100M documents with complex relevance
Time-series	TimescaleDB	> 50GB/day ingest at sustained rates
Geospatial	PostGIS	Uber-scale real-time (use H3 + custom)
Horizontal scale	Citus	When single-node write ceiling is hit

Selection rule for 2026: Before adding any specialized database, ask: "Can PostgreSQL + an extension solve this within the next 12 months?" If yes, stay with PostgreSQL. The consolidation benefit (one operational model, one backup strategy, one monitoring stack) compounds over time.

Redis vs Valkey (2026)

Redis changed its license to BSLv1 + AGPLv3 in 2024. The Linux Foundation-backed fork Valkey (BSD-3 license) is now the default in Fedora 42, Ubuntu 26.04 LTS, and Debian 13. For greenfield caching and session storage in 2026, evaluate Valkey before Redis — identical API, open license, and major cloud provider support (AWS ElastiCache supports Valkey).

Case Study: Database Selection from Startup to Scale

This case study applies every framework principle to a single application evolving from MVP to 10M users.

Application: "EventFlow" — an event ticketing platform where users discover events, purchase tickets, and receive real-time venue updates.

MVP Phase (0–10K users)

Requirement: Manage events, tickets, users, and payments. Consistency is critical (ticket oversell = product failure). Team: 3 engineers.

Decision: PostgreSQL on RDS (db.t4g.medium, $50/month).

sql

-- EventFlow MVP schema (single PostgreSQL database)
CREATE TABLE events (
    id           BIGSERIAL PRIMARY KEY,
    name         TEXT NOT NULL,
    venue_id     BIGINT REFERENCES venues(id),
    start_time   TIMESTAMPTZ NOT NULL,
    capacity     INT NOT NULL,
    tickets_sold INT NOT NULL DEFAULT 0
);

-- Prevent oversell with optimistic locking
UPDATE events
SET tickets_sold = tickets_sold + 1
WHERE id = ? AND tickets_sold < capacity
RETURNING id;
-- If 0 rows updated: sold out

No Redis, no Elasticsearch, no message queue. YAGNI.

Growth Phase (10K–500K users)

Symptom: Event pages load slowly during high-traffic launches. Database CPU spikes to 90% on event release day.

Decision: Add Redis for hot event data and a read replica for search.

python

# Cache hot event data with 60-second TTL
def get_event(event_id: int) -> Event:
    cached = redis.get(f"event:{event_id}")
    if cached:
        return Event.from_json(cached)

    event = db.query("SELECT * FROM events WHERE id = %s", [event_id])
    redis.setex(f"event:{event_id}", 60, event.to_json())
    return event

Also add Elasticsearch for event discovery (search by name, location, category). Sync via Debezium CDC from PostgreSQL.

Scale Phase (500K–10M users)

Symptom: Ticket purchase writes during major events (10K tickets/second during Taylor Swift release) overwhelm the single PostgreSQL primary.

Decision: Shard tickets table by event_id. Events table remains on unsharded PostgreSQL (it is read-mostly and fits on one machine).

Final architecture at 10M users:

Component	Database	Purpose
User accounts, events	PostgreSQL (single)	ACID, normalized
Ticket purchases	PostgreSQL (sharded × 8 by event_id)	Write throughput
Session cache	Redis	Sub-ms session lookup
Event discovery	Elasticsearch	Full-text + geo search
Analytics	ClickHouse	Business intelligence

Key insight: The architecture grew incrementally to match actual load. Starting with the Scale Phase architecture at MVP would have taken 3× longer to build and introduced operational complexity that slowed the team down.

Summary Checklist

Before finalizing any database selection decision:

[ ] What is the exact access pattern? (key lookup, range scan, full-text, geospatial, graph traversal)
[ ] What consistency level is required? (ACID, read-your-writes, eventual)
[ ] What is the 12-month write volume projection?
[ ] What is the read/write ratio?
[ ] Does the team have operational expertise for this database?
[ ] Is a managed service available that reduces operational burden?
[ ] Have you tried to solve the problem with PostgreSQL + extensions first?
[ ] If choosing a specialized database, what is the migration path back if requirements change?

Common Mistakes

Mistake	Why It Happens	Impact	Fix
Choosing technology before defining access patterns	Engineers have database preferences; requirements feel vague early on	Schema, indexes, and sharding strategy all become wrong for the actual query shape	Write out the top 5 queries and their expected QPS before evaluating any database
Adding databases to the stack without consolidation analysis	Specialized DBs feel exciting; each team picks their own	Polyglot sprawl — operational burden grows faster than the benefit, cross-DB consistency impossible	Apply the consolidation checklist: can PostgreSQL + an extension solve this? Only add a DB if the answer is clearly no
Treating the framework as a one-time decision	Selection happens at project start and is never revisited	Database is kept past its useful life even after requirements shift	Re-evaluate the top decision factor every 6 months or after a major scale change
Underestimating operational burden of self-hosted distributed databases	Benchmarks and demos show only the happy path	Cassandra or CockroachDB self-hosted requires dedicated expertise; teams without it face silent data loss and unavailability	Use managed services unless the team has already operated the database in production

Chapter	Relevance
Ch01 — Database Landscape	The full taxonomy of database categories this framework navigates
Ch04 — Transactions & Concurrency	Consistency models that drive the first decision factor
Ch09 — Replication & High Availability	Replication strategies for the growth phase
Ch10 — Sharding & Partitioning	Sharding options for the scale phase
Ch13 — Instagram: PostgreSQL at Scale	PostgreSQL sharding in production
Ch14 — Discord: Data Layer Evolution	Case study of a migration driven by wrong initial choice
Ch15 — Uber: Geospatial Design	Custom database engineering for geospatial requirements

Quick Reference Card

A one-page summary of database selection criteria across all categories covered in this guide.

Database Categories at a Glance

Category	Examples	Consistency	Max Write QPS (single node)	Query Flexibility	Operational Complexity	Best For
RDBMS	PostgreSQL, MySQL	Strong (ACID)	10K–100K†	Any SQL query, JOINs	Medium	General-purpose, OLTP
Document	MongoDB, CouchDB	Configurable	20K–50K	Flexible JSON queries	Low–Medium	Variable schema, rapid dev
Key-Value	Redis, DynamoDB	Eventual/Strong	100K+ (Redis in-memory)	GET/SET by key only	Low	Caching, sessions, counters
Wide-Column	Cassandra, ScyllaDB	Tunable	50K–500K	Partition key required	High	Write-heavy, time-series
Graph	Neo4j, TigerGraph	Strong	5K–20K	Traversal queries	Medium	Social graphs, fraud, recommendations
Time-Series	ClickHouse, TimescaleDB	Strong/Eventual	1M–4M (ClickHouse)	SQL + time functions	Medium–High	Metrics, IoT, analytics
Search	Elasticsearch, Meilisearch	Near real-time	10K–50K (indexing)	Full-text, faceted	High (Elasticsearch)	Full-text search, logs
NewSQL	CockroachDB v26, TiDB X, Spanner	Distributed ACID	10K–50K per node	Full SQL	High	Global ACID at scale; TiDB X adds vector search + AI-era scaling
Vector	pgvector 0.8+, Pinecone, Qdrant	Eventual	Varies‡	ANN similarity	Low (managed)	Semantic search, RAG, ML

†Highly workload-dependent. See Ch01 for nuance. ‡Vector DB QPS depends on index type, dimensions, and recall target. pgvector 0.8 HNSW at 768 dimensions: ~1,000–3,000 QPS per core (as of early 2026). Qdrant (Rust) shows 10–25% latency advantage over Python-based alternatives in comparative benchmarks.

Decision Cheat Sheet

Question	If Yes →	If No →
Need ACID transactions?	RDBMS or NewSQL	Consider NoSQL
Need JOINs across entities?	RDBMS	Document or Key-Value
Write-heavy (>50K writes/sec)?	Wide-Column or Time-Series	RDBMS handles it
Need global distribution?	NewSQL (CockroachDB, Spanner)	Single-region RDBMS
Schema changes frequently?	Document DB or RDBMS + JSONB	Strict RDBMS schema
Need full-text search?	Elasticsearch (large scale) or PostgreSQL GIN (moderate)	Skip search engine
Need vector similarity?	pgvector 0.8+ (< 10M vectors) or dedicated vector DB (> 10M) — thresholds shift with releases; always benchmark	Not applicable
Team < 5 engineers?	Managed service (RDS, Atlas, DynamoDB)	Self-hosted is viable
Budget-constrained?	PostgreSQL (free, extensible)	Evaluate managed options

Cost Quick Reference (1TB, 1K writes/sec, single region)

Option	Monthly Cost	Ops Burden
RDS PostgreSQL (db.r6g.xlarge)	~$300–500	Low
Aurora PostgreSQL Serverless v2	~$400–1,200	Very Low
DynamoDB on-demand	~$500–1,500	None
CockroachDB Cloud	~$800–1,500	Very Low
Self-hosted Cassandra (3 nodes)	~$600 + ops time	High

Practice Questions

Beginner

Framework Application: A startup is building a recipe sharing app. Users can search recipes by ingredient, cuisine, and dietary restriction. Recipes have a flexible set of tags. The team has 2 engineers and expects 50,000 active users in year 1. Walk through the decision framework and choose a database (or combination). Justify each step.
Model Answer
Step 1 (Data model): Recipes are structured but have flexible tags — JSONB in PostgreSQL handles this. Step 2 (Consistency): Saving a recipe must be consistent but like counts can be eventual. Step 3 (Scale): 50K users is Tier 1 — a single PostgreSQL instance easily handles this. Step 4 (Ops): 2 engineers should use managed RDS. Result: PostgreSQL on RDS, with a tsvector index or pg_search for ingredient search. No separate Elasticsearch — you'd add it only if search performance becomes a measurable problem.
Cost Trade-off: A 3-engineer startup uses self-hosted Cassandra (3 nodes on EC2) at $600/month. The team spends 12 hours/month on Cassandra operations (upgrades, incidents, tuning). Their senior engineers bill at $200/hour. What is the true monthly TCO? How does this compare to CockroachDB Cloud at $1,500/month?
Model Answer
TCO = $600 (infrastructure) + 12 × $200 (engineering time) = $600 + $2,400 = $3,000/month. CockroachDB Cloud at $1,500/month is 50% cheaper in true TCO and frees 12 engineering hours for product development. The infrastructure line item is misleading; always include operations time.

Intermediate

Polyglot Decision: An e-commerce team runs PostgreSQL for orders and user data. They want to add: (a) product search with faceting and typo-tolerance, (b) session management for 1M concurrent users, (c) real-time inventory counts visible across all users within 1 second. Design a polyglot architecture. For each addition, choose a database and justify. What consistency model does each use?
Model Answer
(a) Elasticsearch or Typesense for product search — synced via CDC from PostgreSQL. Eventual consistency (search index lags writes by 100–500ms). (b) Redis for sessions — in-memory key-value at sub-millisecond latency. Eventual (Redis replication is asynchronous). (c) Inventory is trickier — requires strong consistency to prevent oversell. Keep inventory in PostgreSQL with row-level locking or optimistic concurrency (`WHERE inventory > 0`). Cache the display count in Redis with short TTL (5–10 seconds) — riders see "available" while true count is in PostgreSQL.
Migration Planning: A team runs a MongoDB-backed application with 500M documents in a messages collection. They need to migrate to Cassandra for write throughput. Design the dual-write migration: what does the application code change look like during migration? How do you handle documents that exist in MongoDB but haven't been migrated to Cassandra yet? How do you validate completeness?
Model Answer
Phase 1 (dual-write): Application writes to both MongoDB (old) and Cassandra (new). Read from Cassandra first; fall back to MongoDB if missing. Phase 2 (backfill): A background job reads all MongoDB documents in chunks and writes to Cassandra. Track progress with a cursor on `_id`. Phase 3 (validate): Sample 1% of documents, compare checksums in both stores. Alert if mismatch > 0.01%. Phase 4 (cut-over): Stop writing to MongoDB. Phase 5 (decommission): After 2 weeks of successful operation, delete MongoDB collection. Edge case: updates to documents during backfill — dual-write ensures Cassandra always has the latest version; the backfill can overwrite with the MongoDB version safely since dual-write already sent the latest version.

Advanced

Comprehensive Architecture: Design the full database architecture for a global ride-sharing startup planning to launch in 5 cities, targeting 500K rides in year 1 and 10M rides in year 3. Include: user accounts, driver accounts, trip records, real-time driver locations, analytics, and event notifications. For each data type: choose a database, justify the choice, identify the partition/shard key, and describe the consistency model. Show how data flows between the databases.
Model Answer
Year 1 (MVP): Single PostgreSQL for user/driver/trip data (ACID required). Redis for driver locations (key = driver_id, TTL = 8s). No analytics DB yet — query PostgreSQL with read replica. Year 3 (Scale): Add H3-based location lookup in Redis (partition by H3 cell). Shard trips table by city_id (5 shards). Add ClickHouse for analytics (CDC from PostgreSQL shards via Debezium + Kafka). Add PostgreSQL read replicas per region for driver/user profile reads. The data flow: App → PostgreSQL (OLTP) → Kafka (CDC) → ClickHouse (OLAP) + Elasticsearch (search). Driver pings → Kafka (location.updates) → Redis (H3 cells, TTL 8s). Key insight: the architecture at year 1 and year 3 differ significantly — don't build year 3 architecture at year 1. The transition points are measurable (specific latency thresholds, QPS ceilings) not arbitrary timelines.

References

Architecture Haiku: PostgreSQL as a Platform — Figma Engineering Blog (2020)
Stripe's Approach to Payments Infrastructure — Stripe Engineering Blog
Shopify's Database Architecture at Scale — Shopify Engineering Blog (2020)
Neon: Serverless PostgreSQL — Neon Architecture Blog
H3: Uber's Hexagonal Hierarchical Spatial Index — Uber Engineering Blog (2018)
PlanetScale: Schema Change Management — PlanetScale Engineering Blog
ClickHouse Performance Overview — ClickHouse Documentation
The CAP Theorem Explained — Eric Brewer, IEEE Computer (2012)
"Designing Data-Intensive Applications" — Kleppmann, Chapter 1 (Reliable, Scalable, Maintainable)
"Database Reliability Engineering" — Campbell & Majors, O'Reilly

Chapter 16: Database Selection Framework ​

Mind Map ​

The Decision Framework ​

Decision Factors: Priority Order ​

Factor 1: Consistency Requirement ​

Factor 2: Read/Write Ratio ​

Factor 3: Data Model ​

Factor 4: Scale Projection ​

Factor 5: Operational Burden ​

Cost Comparison ​

Polyglot Persistence Patterns ​

When to Use Multiple Databases ​

Common Production Combinations ​

Anti-Pattern: Different DB Per Microservice ​

Migration Strategy Matrix ​

When to Migrate ​

Migration Patterns ​

From → To Decision Table ​

2026 Database Trends ​

Trend 1: PostgreSQL Extensions Replace 80% of Specialized DBs ​

Trend 2: Serverless and Edge Databases ​

Trend 3: Vector Search Becomes a First-Class Selection Axis ​

Trend 4: TiDB X — Cloud-Native Distributed SQL for AI Workloads ​

Trend 5: Edge Databases ​

AI-Era Database Selection (New Axis, 2026) ​

Axis 1: Vector Search Capability ​

Axis 2: Agentic Workload Compatibility ​

Axis 3: PostgreSQL as a Platform ​

Case Study: Database Selection from Startup to Scale ​

MVP Phase (0–10K users) ​

Growth Phase (10K–500K users) ​

Scale Phase (500K–10M users) ​

Summary Checklist ​

Common Mistakes ​

Related Chapters ​

Quick Reference Card ​

Database Categories at a Glance ​

Decision Cheat Sheet ​

Cost Quick Reference (1TB, 1K writes/sec, single region) ​

Practice Questions ​

Beginner ​

Intermediate ​

Advanced ​

References ​