Skip to contentSkip to content
0/47 chapters completed (0%)

Chapter 5: PostgreSQL in Production

"PostgreSQL is not a database. It is a platform that happens to ship with a world-class relational engine."

Mind Map

Prerequisites

This chapter assumes you understand WAL mechanics (Ch01) and transaction isolation (Ch04). Review those first if needed.


Architecture Internals

Process Model

PostgreSQL uses a process-per-connection model rather than threads. When a client connects, the postmaster forks a new backend process dedicated to that connection. This architecture provides strong isolation — a crashing backend cannot corrupt another backend's memory — but creates overhead at high connection counts.

Shared Memory Layout

At startup, PostgreSQL allocates a single large shared memory segment that all backend processes read and write:

RegionPurposeKey Parameter
Shared BuffersPage cache for data filesshared_buffers (25% of RAM typical)
WAL BuffersIn-memory WAL before fsyncwal_buffers (auto: 1/32 of shared_buffers)
Lock TablesTracks all held and awaited locksmax_locks_per_transaction
Proc ArrayRunning transactions, XIDs, snapshotsmax_connections
CLOGTransaction commit/abort statusManaged automatically

Shared Buffers and the OS Page Cache

PostgreSQL operates a double-buffering system: data pages live in shared_buffers (PostgreSQL's own cache), and the same data also lives in the OS page cache. effective_cache_size tells the planner how much OS cache is available for index scans — it does not allocate memory, it only influences plan choices.

Version Note

PostgreSQL examples verified against PostgreSQL 16/17. Autovacuum defaults and some pg_stat_* views changed in PostgreSQL 17 — check the release notes for your version.


Query Execution Pipeline

Every SQL statement passes through four stages before returning results:

The Planner's Cost Model

The planner estimates the cost of every candidate execution plan using row count estimates from statistics (pg_statistic) and cost constants (seq_page_cost, random_page_cost, cpu_tuple_cost). Stale statistics are the most common cause of bad query plans.

sql
-- Force statistics update on a high-churn table
ANALYZE orders;

-- Check when statistics were last collected
SELECT schemaname, tablename, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
WHERE tablename = 'orders';

-- Inspect the plan the planner chose
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders
WHERE user_id = 42 AND created_at > NOW() - INTERVAL '30 days';

Stale Statistics Kill Plans

After a bulk load of millions of rows, run ANALYZE explicitly. Autovacuum's analyze threshold (autovacuum_analyze_scale_factor = 0.2) means it only triggers after 20% of the table changes — a 5M-row bulk insert into a 1M-row table will not trigger autoanalyze until 200K more changes occur.


WAL Deep Dive

Segment Lifecycle

WAL is written as a stream of variable-length records into 16MB segment files under pg_wal/. Each record describes a single page-level change.

WAL Levels

LevelWhat Gets LoggedUse Case
minimalMinimum for crash recoveryStandalone server, no replicas
replica+ streaming replication dataPrimary with hot standbys
logical+ full row images for decodingCDC, Debezium, logical replication

WAL Level and Performance

logical WAL is roughly 20–40% larger than replica WAL for write-heavy workloads because it logs before-images of updated rows. Only set wal_level = logical if you need CDC or logical replication.

Checkpoint Tuning

A checkpoint flushes all dirty pages from shared_buffers to data files, creating a recovery point. Frequent checkpoints cause I/O spikes; infrequent checkpoints increase crash recovery time.

ini
# postgresql.conf — checkpoint tuning
max_wal_size = 4GB                    # WAL accumulates up to this before forced checkpoint
min_wal_size = 1GB                    # Keep at least this much WAL pre-allocated
checkpoint_completion_target = 0.9   # Spread checkpoint I/O over 90% of checkpoint interval
checkpoint_timeout = 15min            # Maximum time between checkpoints

VACUUM and Autovacuum

Why VACUUM Exists

PostgreSQL's MVCC implementation never deletes rows in-place. An UPDATE inserts a new row version and marks the old one as dead. A DELETE only marks the row as dead. VACUUM physically reclaims space from dead tuples.

Autovacuum Tuning

Default autovacuum is conservative. High-traffic tables need per-table overrides:

sql
-- Per-table autovacuum tuning for a high-churn orders table
ALTER TABLE orders SET (
  autovacuum_vacuum_scale_factor = 0.01,   -- Trigger at 1% dead tuples (default 20%)
  autovacuum_analyze_scale_factor = 0.005, -- Analyze at 0.5% changes (default 20%)
  autovacuum_vacuum_cost_delay = 2,        -- Less throttling (default 20ms)
  autovacuum_vacuum_cost_limit = 800       -- More I/O budget (default 200)
);
ParameterDefaultImpact
autovacuum_vacuum_scale_factor0.20Lower = more frequent vacuum
autovacuum_max_workers3More workers = more parallel vacuum
autovacuum_vacuum_cost_delay20msLower = less throttling, more I/O
autovacuum_vacuum_cost_limit200Higher = larger I/O bursts

XID Wraparound: The Emergency

PostgreSQL's transaction ID (XID) is a 32-bit integer. After ~2.1 billion transactions, it wraps around — old rows become invisible to new transactions. PostgreSQL prevents this with anti-wraparound vacuum.

sql
-- Check XID age for all databases — alert if age > 1.5 billion
SELECT datname,
       age(datfrozenxid) AS xid_age,
       pg_size_pretty(pg_database_size(datname)) AS db_size
FROM pg_database
ORDER BY xid_age DESC;

-- Check tables most at risk
SELECT schemaname, relname,
       age(relfrozenxid) AS table_xid_age,
       pg_size_pretty(pg_total_relation_size(oid)) AS table_size
FROM pg_class
WHERE relkind = 'r'
ORDER BY table_xid_age DESC
LIMIT 20;

XID Wraparound is a Hard Shutdown

If PostgreSQL detects XID age approaching the limit (age(datfrozenxid) > 1.5B), it will refuse all write transactions with: database is not accepting commands to avoid wraparound data loss. The fix is emergency VACUUM FREEZE. Monitor XID age continuously.

TOAST (The Oversized-Attribute Storage Technique)

TOAST handles column values larger than 2KB by compressing and storing them in a shadow table. Each user table with TOAST-able columns gets a corresponding pg_toast.pg_toast_NNNN table.

sql
-- Find tables with large TOAST tables
SELECT c.relname AS main_table,
       t.relname AS toast_table,
       pg_size_pretty(pg_total_relation_size(t.oid)) AS toast_size
FROM pg_class c
JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind = 'r'
ORDER BY pg_total_relation_size(t.oid) DESC
LIMIT 10;

Connection Management

The Process-Per-Connection Problem

Each PostgreSQL backend uses 5–10MB of RAM for local state. At 500 connections, that is 2.5–5GB of RAM just for process overhead — before any query work.

PgBouncer Modes

ModeDescriptionSupported FeaturesBest For
SessionOne server conn per client sessionAll PostgreSQL featuresLong-lived connections, prepared statements
TransactionServer conn only during transactionMost features; no SET persistence, no advisory locks across txnsOLTP APIs, short transactions
StatementServer conn only per statementVery limited; no multi-statement transactionsRead-only analytics queries
ini
# pgbouncer.ini — production configuration
[databases]
myapp = host=127.0.0.1 port=5432 dbname=myapp

[pgbouncer]
pool_mode = transaction
max_client_conn = 10000
default_pool_size = 100
min_pool_size = 10
reserve_pool_size = 20
reserve_pool_timeout = 5
server_idle_timeout = 600
client_idle_timeout = 0
log_connections = 0
log_disconnections = 0

Pool Sizing Formula

pool_size = (max_connections_postgresql × 0.8) / number_of_pgbouncer_instances

For max_connections = 200 with 2 PgBouncer instances:

pool_size = (200 × 0.8) / 2 = 80 connections per PgBouncer

Reserve 20% of max_connections for superuser, monitoring tools, and emergency DBA access.

Transaction Mode Breaks These Features

PgBouncer transaction mode is incompatible with: LISTEN/NOTIFY, WITH HOLD cursors, advisory locks held across transactions, SET statements that should persist per session, and prepared statements (unless server_reset_query = DISCARD ALL). Use session mode for these patterns.


Essential Extensions

ExtensionFunctionUse CaseOverhead
pg_stat_statementsTrack query execution statsPerformance monitoring, slow query identification~5% query overhead
pg_partmanAutomated partition managementTime-based partition creation/droppingNegligible
pgvectorVector similarity search, HNSW/IVFFlatSemantic search, RAG, embeddings~10% for HNSW index maintenance
PostGISGeospatial types and operatorsLocation queries, geographic aggregations+200MB binary, significant for spatial queries
pg_cronDatabase-side cron jobsPartition maintenance, cleanup, reportingNegligible
CitusHorizontal shardingMulti-tenant SaaS, distributed aggregations+15–30% coordinator overhead
pglogicalLogical replicationCDC, zero-downtime migrationsWAL size increase ~20–40%
sql
-- Enable essential monitoring extensions
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
CREATE EXTENSION IF NOT EXISTS pg_partman;
CREATE EXTENSION IF NOT EXISTS vector;

-- pgvector: table with embedding column
CREATE TABLE documents (
  id        BIGSERIAL PRIMARY KEY,
  content   TEXT,
  embedding vector(1536)  -- OpenAI ada-002 dimensions
);

-- HNSW index: better recall than IVFFlat, no training step required
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Nearest-neighbor search
SELECT id, content,
       1 - (embedding <=> '[0.1,0.2,...]'::vector) AS similarity
FROM documents
ORDER BY embedding <=> '[0.1,0.2,...]'::vector
LIMIT 10;

Production Configuration Template

Annotated postgresql.conf for a 64GB RAM / 16-core server running OLTP:

ini
# ============================================================
# MEMORY
# ============================================================
shared_buffers = 16GB                 # 25% of RAM — PostgreSQL's own page cache
effective_cache_size = 48GB           # 75% of RAM — hint to planner, does not allocate
work_mem = 64MB                       # Per-sort, per-hash-join. 16 cores × 10 conns × 64MB
                                      # = 10GB worst case — monitor for OOM
maintenance_work_mem = 2GB            # For VACUUM, CREATE INDEX, ALTER TABLE
huge_pages = try                      # 2MB huge pages if OS supports (reduces TLB misses)

# ============================================================
# WAL / CHECKPOINT
# ============================================================
wal_level = replica                   # Enable streaming replication
max_wal_size = 8GB                    # Allow 8GB WAL before forced checkpoint
min_wal_size = 2GB                    # Pre-allocate to avoid segment allocation latency
checkpoint_completion_target = 0.9   # Spread I/O over 90% of checkpoint interval
checkpoint_timeout = 15min            # Max 15 minutes between checkpoints
wal_compression = on                  # ~30% WAL size reduction on compressible data
wal_buffers = 64MB                    # Pre-allocated WAL buffer

# ============================================================
# CONNECTIONS
# ============================================================
max_connections = 200                 # Use PgBouncer in front; keep this low
superuser_reserved_connections = 5   # Emergency DBA access

# ============================================================
# PLANNER
# ============================================================
random_page_cost = 1.1               # SSDs: near seq_page_cost (1.0). HDDs: 4.0
effective_io_concurrency = 200       # SSD: high parallelism. HDD: 2
default_statistics_target = 500      # More histogram buckets = better estimates (default 100)

# ============================================================
# PARALLEL QUERY
# ============================================================
max_worker_processes = 16            # Match CPU count
max_parallel_workers_per_gather = 8  # Up to 8 workers per query
max_parallel_workers = 16            # Total parallel workers across all queries

# ============================================================
# AUTOVACUUM
# ============================================================
autovacuum_max_workers = 6           # More workers for 16-core server
autovacuum_vacuum_scale_factor = 0.05 # Trigger at 5% dead tuples (default 20%)
autovacuum_analyze_scale_factor = 0.025
autovacuum_vacuum_cost_delay = 5ms   # Less throttling on NVMe SSD
autovacuum_vacuum_cost_limit = 1000  # Higher I/O budget per worker

# ============================================================
# LOGGING
# ============================================================
log_min_duration_statement = 1000   # Log queries slower than 1 second
log_checkpoints = on
log_lock_waits = on
log_temp_files = 64MB               # Log when sort/hash spills to disk
log_autovacuum_min_duration = 250ms # Log slow autovacuums
track_io_timing = on                # I/O stats in pg_stat_statements

Monitoring with pg_stat_statements

Finding Slow Queries

sql
-- Top 10 queries by total execution time
SELECT
  LEFT(query, 80) AS query_snippet,
  calls,
  ROUND(total_exec_time::numeric, 2) AS total_ms,
  ROUND(mean_exec_time::numeric, 2) AS avg_ms,
  ROUND(stddev_exec_time::numeric, 2) AS stddev_ms,
  ROUND((100 * total_exec_time /
    SUM(total_exec_time) OVER ())::numeric, 2) AS pct_total
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Queries with high I/O (shared blocks read from disk)
SELECT
  LEFT(query, 80) AS query_snippet,
  calls,
  ROUND(mean_exec_time::numeric, 2) AS avg_ms,
  shared_blks_read / NULLIF(calls, 0) AS blks_read_per_call
FROM pg_stat_statements
WHERE calls > 100
ORDER BY shared_blks_read DESC
LIMIT 20;

Cache Hit Ratio

sql
-- Buffer hit ratio — should be > 99% for OLTP
SELECT
  SUM(heap_blks_hit) AS heap_hits,
  SUM(heap_blks_read) AS heap_reads,
  ROUND(100.0 * SUM(heap_blks_hit) /
    NULLIF(SUM(heap_blks_hit) + SUM(heap_blks_read), 0), 2) AS hit_ratio_pct
FROM pg_statio_user_tables;

-- Index vs sequential scan ratio per table
SELECT
  relname,
  idx_scan,
  seq_scan,
  ROUND(100.0 * idx_scan / NULLIF(idx_scan + seq_scan, 0), 2) AS idx_hit_pct,
  n_live_tup
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_scan DESC
LIMIT 20;

Alert Thresholds

MetricWarningCriticalSource
Buffer hit ratio< 99%< 95%pg_statio_user_tables
Longest running query> 5 min> 30 minpg_stat_activity
Idle in transaction> 2 min> 10 minpg_stat_activity
XID age> 1B> 1.5Bpg_database.datfrozenxid
Table bloat ratio> 30%> 60%pgstattuple extension
Autovacuum lag> 1 hr> 6 hrpg_stat_user_tables
Replication lag> 100MB> 1GBpg_stat_replication
sql
-- Replication lag on primary
SELECT
  application_name,
  state,
  pg_size_pretty(pg_wal_lsn_diff(
    pg_current_wal_lsn(), sent_lsn)) AS send_lag,
  pg_size_pretty(pg_wal_lsn_diff(
    sent_lsn, flush_lsn)) AS flush_lag,
  pg_size_pretty(pg_wal_lsn_diff(
    flush_lsn, replay_lsn)) AS replay_lag
FROM pg_stat_replication;

Case Study: Instagram's PostgreSQL Journey

Instagram scaled PostgreSQL from a single EC2 instance (2010) to thousands of shards serving 2B+ users — without ever migrating away from PostgreSQL. Their journey through read replicas, functional sharding, and horizontal sharding by user_id demonstrates every concept in this chapter: WAL-based streaming replication, PgBouncer connection pooling, per-table autovacuum tuning, and pg_stat_statements-driven query optimization.

For the full deep dive — sharding architecture, Snowflake-inspired ID generation, feed fan-out design, and Django ORM at scale — see Ch13 — Instagram: PostgreSQL at Scale.


ChapterRelevance
Ch01 — Database LandscapeWAL mechanics and storage engine foundations
Ch03 — Indexing StrategiesB-tree, GIN, and partial index types used in production
Ch04 — Transactions & ConcurrencyMVCC driving the need for VACUUM
Ch06 — MySQL & Distributed SQLPostgreSQL vs MySQL production trade-offs
System Design Ch09SQL database selection in system design interviews

Common Mistakes

MistakeWhy It HappensImpactFix
Setting shared_buffers to 50%+ of RAM"More cache = better"OS page cache starved; kernel swapping at high connection countsKeep shared_buffers at 25% of RAM; effective_cache_size at 75% (hint only, no allocation)
Not monitoring XID age"Autovacuum handles it"Database refuses all writes when age exceeds limitAlert when age(datfrozenxid) > 1B; investigate long-running transactions that block autovacuum
Using PgBouncer session mode with prepared statements in transaction modeMisread documentationPrepared statements fail unpredictably across pool connectionsUse protocol-level parameterized queries instead of SQL PREPARE/EXECUTE with transaction pooling
Keeping autovacuum_vacuum_scale_factor at 20% on large tablesDefault config left unchangedTable with 100M rows only vacuumed after 20M dead tuples — bloat accumulatesSet per-table autovacuum_vacuum_scale_factor = 0.01 for high-churn tables
Not running ANALYZE after large bulk loads"Autovacuum will handle it"Planner uses stale statistics; bad query plans until autoanalyze triggersRun ANALYZE table_name explicitly after loading millions of rows

Practice Questions

Beginner

  1. Shared Buffers Sizing: A new 32GB RAM server will run PostgreSQL for a web application. What value would you set shared_buffers to, and why? What value would you set effective_cache_size to, and what does that parameter actually control?

    Model Answer `shared_buffers = 8GB` (25% of 32GB). `effective_cache_size = 24GB` (75% of 32GB). The critical distinction: `shared_buffers` allocates real memory at startup. `effective_cache_size` is only a hint to the planner about total available cache (PostgreSQL + OS) — it does not allocate any memory.
  2. VACUUM vs ANALYZE: What is the difference between VACUUM and ANALYZE in PostgreSQL? Why would you run one without the other? Give a scenario where each is appropriate.

    Model Answer `VACUUM` reclaims space from dead tuples and updates the free space map. `ANALYZE` collects statistics about column distributions for the planner. Run `VACUUM` alone after a large DELETE/UPDATE to reclaim bloat. Run `ANALYZE` alone after a bulk INSERT — there are no dead tuples but the data distribution has changed, so planner statistics need refreshing.

Intermediate

  1. PgBouncer Mode Selection: Your application uses PostgreSQL advisory locks to prevent concurrent job execution, and LISTEN/NOTIFY for real-time push notifications. A colleague suggests deploying PgBouncer in transaction mode to reduce connection overhead. What problems would this cause?

    Model Answer Advisory locks held across transactions break in transaction mode — the lock is released when the connection returns to the pool between transactions. `LISTEN/NOTIFY` breaks entirely because notifications are tied to server sessions, not transactions. Recommendation: session mode for these workloads, or separate PgBouncer pools (transaction mode for regular OLTP, session mode for workers using advisory locks/NOTIFY).
  2. XID Wraparound Prevention: Your DBA reports age(datfrozenxid) is at 1.2 billion for your main database. Autovacuum is enabled. Why is this still happening, and what steps would you take immediately?

    Model Answer Most common cause: a long-running transaction blocking autovacuum from freezing tuples. Check `pg_stat_activity` for old transactions. Immediate actions: (1) Kill long-running idle transactions. (2) Run `VACUUM FREEZE ANALYZE` manually on the oldest tables from `pg_class`. (3) Check if autovacuum is being blocked by lock contention (`pg_locks` + `pg_stat_activity`). (4) Consider increasing `autovacuum_freeze_max_age` to give more runway.

Advanced

  1. Production Tuning: A 64GB RAM / NVMe SSD server runs PostgreSQL for order processing (20K writes/second, 5K reads/second). Autovacuum consumes 60% of disk I/O blocking queries. Checkpoints fire every 2 minutes. work_mem = 256MB causing OOM errors. Design a tuning plan addressing all three problems.

    Model Answer Autovacuum I/O: increase `autovacuum_vacuum_cost_delay = 10ms`, `autovacuum_vacuum_cost_limit = 400` (slows burst I/O). Alternatively, use per-table settings for the high-churn tables only. Frequent checkpoints: at 20K writes/sec × ~500 bytes each = 10MB/s WAL, a 2-min checkpoint accumulates only 1.2GB — increase `max_wal_size = 8GB` and `checkpoint_timeout = 15min`. OOM from work_mem: reduce to 16–32MB; with 16 cores × 200 connections × 256MB = 800GB potential allocation, even 32MB can be dangerous under heavy concurrent sorts. Accept occasional disk spills for rare complex queries rather than OOM risk.

References

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.

Built with VitePress + Dracula Theme