Chapter 10: Sharding & Partitioning

"Sharding is not a scaling strategy. It is a last resort after every other strategy has been exhausted."

Mind Map

Prerequisites

This chapter assumes understanding of WAL mechanics (Ch01) and replication fundamentals (Ch09). Review those first if needed.

Overview

Partitioning and sharding are related but distinct concepts. Partitioning divides a single large table into smaller physical segments within the same database server. Sharding distributes data across multiple independent database servers (shards), each owning a subset of the data. You should almost always try partitioning before sharding — it provides many of the same benefits (query pruning, partition-level vacuuming, parallel scans) without the operational complexity of a distributed system.

This chapter covers both, starting with PostgreSQL table partitioning and advancing to full cluster sharding with consistent hashing, shard key design, and the tools that make horizontal scaling possible.

Table Partitioning in PostgreSQL

PostgreSQL supports declarative table partitioning since version 10. A partitioned table is a virtual table — the data lives in child tables (partitions) that the planner routes to automatically.

RANGE Partitioning (Time-Based)

The most common partitioning pattern for time-series and append-heavy data:

sql

-- Create the partitioned parent table
CREATE TABLE events (
    id          BIGSERIAL,
    user_id     BIGINT NOT NULL,
    event_type  TEXT NOT NULL,
    payload     JSONB,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE events_2024_01 PARTITION OF events
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE events_2024_02 PARTITION OF events
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Queries automatically route to the correct partition
EXPLAIN SELECT * FROM events WHERE created_at >= '2024-01-15';
-- Output: Seq Scan on events_2024_01 (partition pruning in action)

-- Old partitions can be detached and archived instead of DELETE
ALTER TABLE events DETACH PARTITION events_2024_01;
-- Now events_2024_01 is an independent table — attach to cold storage

Version Note

PostgreSQL examples verified against PostgreSQL 17/18. PostgreSQL 18 (May 2026) is the current major version. Check the release notes for your version.

LIST Partitioning (Category/Region)

sql

-- Partition by region for multi-tenant or geographic workloads
CREATE TABLE orders (
    id      BIGSERIAL,
    region  TEXT NOT NULL,  -- 'us', 'eu', 'apac'
    amount  NUMERIC(12, 2),
    created_at TIMESTAMPTZ DEFAULT now()
) PARTITION BY LIST (region);

CREATE TABLE orders_us   PARTITION OF orders FOR VALUES IN ('us');
CREATE TABLE orders_eu   PARTITION OF orders FOR VALUES IN ('eu');
CREATE TABLE orders_apac PARTITION OF orders FOR VALUES IN ('apac');
CREATE TABLE orders_rest PARTITION OF orders DEFAULT;  -- catch-all

HASH Partitioning (Uniform Distribution)

When no natural range or list partition key exists, hash partitioning distributes rows uniformly:

sql

-- Hash partition by user_id into 8 buckets
CREATE TABLE user_activity (
    id      BIGSERIAL,
    user_id BIGINT NOT NULL,
    action  TEXT,
    ts      TIMESTAMPTZ DEFAULT now()
) PARTITION BY HASH (user_id);

CREATE TABLE user_activity_0 PARTITION OF user_activity
    FOR VALUES WITH (MODULUS 8, REMAINDER 0);
CREATE TABLE user_activity_1 PARTITION OF user_activity
    FOR VALUES WITH (MODULUS 8, REMAINDER 1);
-- ... repeat for 2-7

Partition Pruning

Partition pruning is the key performance benefit: the planner scans only relevant partitions. For this to work, the WHERE clause must reference the partition key.

sql

-- Pruning works: planner only scans 2024-01 partition
SELECT count(*) FROM events WHERE created_at BETWEEN '2024-01-01' AND '2024-01-31';

-- Pruning does NOT work: full scan across all partitions
SELECT count(*) FROM events WHERE user_id = 42;
-- If you frequently query by user_id, partition by user_id instead (or add a regular index)

Automating Partitions with pg_partman

Manually creating monthly partitions is error-prone. pg_partman automates creation and retention:

sql

-- Install pg_partman extension
CREATE EXTENSION pg_partman;

-- Set up automated monthly partitions, retain 12 months
SELECT partman.create_parent(
    p_parent_table   => 'public.events',
    p_control        => 'created_at',
    p_interval       => 'monthly',
    p_premake        => 3           -- pre-create 3 future partitions
);

-- Schedule maintenance (run via pg_cron or external cron)
SELECT partman.run_maintenance();
-- Creates new future partitions, drops/detaches old ones per retention policy

-- Configure retention
UPDATE partman.part_config
SET retention = '12 months',
    retention_keep_table = false   -- drop old partitions automatically
WHERE parent_table = 'public.events';

Partition Before You Shard

Before moving to multi-server sharding, try: (1) RANGE partitioning to enable efficient time-based pruning and archival, (2) adding read replicas to offload analytics queries, (3) upgrading to a larger server. Partitioning is free in terms of query transparency — the same SQL works, partition pruning happens automatically.

Sharding: Distributing Data Across Servers

Sharding splits data across multiple independent database servers. Each server (shard) owns a disjoint subset of the data and handles only queries for its subset. The application (or middleware) must route queries to the correct shard.

Hash-Based Sharding

The most common sharding strategy. A hash function maps the shard key to a shard number:

shard = hash(user_id) % num_shards

Problem with naive modulo: Adding a new shard invalidates all existing assignments — hash(key) % 8 routes differently than hash(key) % 9. You must migrate all data.

Consistent Hashing

Consistent hashing solves the resharding problem. Keys and shards are placed on a circular ring. A key maps to the first shard clockwise on the ring. Adding or removing a shard only displaces a fraction of keys.

Virtual Nodes (vnodes)

With only 3 physical shards on the ring, adding a 4th shard requires moving ~33% of data (the arc between shard 3 and its neighbor). Virtual nodes distribute each physical shard across many positions on the ring, making data movement more granular and balanced:

Physical Shard 0 → vnodes at positions: 5, 18, 31, 47, 62, ... (100 positions)
Physical Shard 1 → vnodes at positions: 2, 12, 25, 39, 55, ... (100 positions)

Adding a physical shard means taking a few vnodes from each existing shard — small, balanced movements across all shards.

Range-Based Sharding

Shard 0: user_id 1 – 1,000,000
Shard 1: user_id 1,000,001 – 2,000,000
Shard 2: user_id 2,000,001 – 3,000,000

Pros: Simple routing logic, range queries on the shard key stay on one shard.

Cons: Prone to hotspots — if user_id is sequential and new users are always on the last shard, that shard receives all new writes (hot shard problem). Requires more careful shard key selection.

Directory-Based Sharding

A lookup table maps each key (or key range) to a specific shard:

sql

-- Shard directory (lives in a highly available metadata service)
CREATE TABLE shard_directory (
    entity_type TEXT,
    entity_id   BIGINT,
    shard_id    INT,
    PRIMARY KEY (entity_type, entity_id)
);

-- Route: SELECT shard_id FROM shard_directory WHERE entity_type='user' AND entity_id=42;

Pros: Maximum flexibility — can move individual keys between shards without any formula change.

Cons: Every query requires a directory lookup (often cached in memory). The directory itself is a single point of failure and must be highly available.

Choosing a Shard Key

The shard key decision determines the performance characteristics of your sharding cluster. A bad shard key creates hot shards, forces expensive cross-shard queries, and makes resharding difficult.

Shard Key Properties

Property	Requirement	Failure Mode if Missing
High cardinality	Many distinct values	Too few values → can't split evenly
Uniform distribution	Values spread evenly over the key space	Hot shard receives disproportionate traffic
Query alignment	Most queries filter by the shard key	Cross-shard scatter-gather queries
Immutability	Value doesn't change after insertion	Changing shard key requires moving the row
Business meaning	Key maps to a logical data owner	Arbitrary keys break cross-entity queries

Good vs Bad Shard Keys

Compound Shard Keys

When a single key produces hotspots, compound keys can help:

Shard key: (tenant_id, user_id)
→ All queries for a user stay on one shard
→ Queries across tenants span multiple shards (acceptable)
→ A single tenant with millions of users can still overwhelm one shard

For a large multi-tenant SaaS, large tenants may need their own dedicated shards (tenant isolation), while small tenants share shards.

Cross-Shard Queries and Joins

Cross-shard queries are the primary operational cost of sharding. JOINs between two sharded tables require a scatter-gather pattern: query all shards, collect results, merge in the application layer.

Scatter-Gather

Scatter-gather cost: latency is max(shard_latency) + merge time. With 100 shards, the tail latency of the slowest shard dominates. Scatter-gather queries do not scale well — minimize them.

Avoiding Cross-Shard Joins: Denormalization

Instead of joining orders with users across shards, embed the user data you need into the orders table:

sql

-- Before denormalization (requires cross-shard JOIN)
CREATE TABLE orders (
    id       BIGSERIAL PRIMARY KEY,
    user_id  BIGINT NOT NULL,  -- references users table on another shard
    amount   NUMERIC(12, 2)
);

-- After denormalization (self-contained)
CREATE TABLE orders (
    id          BIGSERIAL PRIMARY KEY,
    user_id     BIGINT NOT NULL,
    user_email  TEXT NOT NULL,   -- denormalized from users
    user_name   TEXT NOT NULL,   -- denormalized from users
    amount      NUMERIC(12, 2)
);
-- Tradeoff: email/name updates must be propagated via CDC or background job

Reference Tables

For small, frequently-joined lookup tables (countries, categories, currencies), replicate the entire table to every shard:

sql

-- In Citus: reference tables are replicated to all workers
SELECT create_reference_table('countries');
SELECT create_reference_table('product_categories');

-- Now JOINs between sharded orders and reference countries work without scatter-gather
SELECT o.amount, c.name AS country
FROM orders o
JOIN countries c ON o.country_code = c.code
WHERE o.user_id = 42;  -- routed to single shard, countries available locally

Resharding: Splitting and Migrating Shards

You will eventually need more shards than you started with. Resharding moves data between shards while the application continues serving traffic.

Why Resharding Happens

Shard capacity exceeded (disk, CPU, or connection count)
Shard key chosen poorly — one shard is a hotspot
Business growth requiring geographic distribution
Shard failure requiring emergency data redistribution

Online Resharding with Vitess

Vitess implements online resharding via a multi-phase process that keeps the old shards live while building new ones:

The key technique is ghost tables: Vitess creates the new shard tables alongside the old ones, copies data in the background, and uses binlog replication to keep them in sync — the same pattern as gh-ost for online schema changes.

Sharding Tools Comparison

Tool	Target DB	Sharding Model	Cross-Shard Queries	Managed Version
Vitess	MySQL	Hash / range	Scatter-gather + merge	PlanetScale (min $39/mo as of 2025; free tier removed 2024)
Citus	PostgreSQL	Hash / range / reference tables	Distributed SQL	Azure Cosmos DB for PostgreSQL
ProxySQL	MySQL	Application-defined routing rules	None (routing only)	None
Crunchy Postgres	PostgreSQL	Per-tenant isolation	Federated queries	Crunchy Bridge

Distributed SQL Alternatives to Manual Sharding

For teams that need distributed writes but want to avoid the operational complexity of Vitess or Citus, distributed SQL databases handle sharding transparently at the engine level:

Database	SQL Compatibility	Sharding Model	Notes (as of 2026)
CockroachDB	PostgreSQL wire protocol	Automatic range-based	v26.x; quarterly release cadence; managed Cloud or self-hosted
TiDB	MySQL protocol	Automatic (TiKV + PD)	TiDB X (2025): cloud-native, decoupled compute/storage on object store; context-aware scaling for AI workloads
YugabyteDB	PostgreSQL (YSQL) + Cassandra (YCQL)	Automatic (Raft-based)	v2025.x; Postgres-compatible with hybrid logical clocks for distributed ACID
Spanner	GoogleSQL	Automatic	Google-managed only; Spanner dialect, not standard SQL

When to use distributed SQL vs Vitess/Citus:

Distributed SQL (CockroachDB, TiDB, YugabyteDB): better choice when you want distributed ACID without an application-layer sharding layer; higher per-operation overhead than hand-tuned sharding
Vitess/Citus: better choice when you already run MySQL/PostgreSQL and need to scale an existing schema; lower overhead for known query patterns

Vitess Architecture

Citus Architecture

sql

-- Citus setup: distribute orders by user_id
SELECT create_distributed_table('orders', 'user_id');
SELECT create_distributed_table('order_items', 'user_id', colocate_with => 'orders');
-- Citus collocates order_items with orders on the same shard
-- → JOINs on user_id are local, no scatter-gather

-- Distributed query — Citus handles routing transparently
SELECT user_id, count(*), sum(amount)
FROM orders
GROUP BY user_id
ORDER BY sum(amount) DESC
LIMIT 10;

Case Study: Pinterest's Sharding Architecture

Pinterest scaled to billions of pins and hundreds of billions of relationships, making sharding unavoidable. Their architecture is a textbook example of practical sharding design.

Problem: By 2012, Pinterest had outgrown a single MySQL server. Simple master-slave replication helped with reads but writes were hitting the ceiling.

Approach: Manual range-based sharding across 8 MySQL servers, later expanded to 1,024 virtual shards (logical shards mapped to physical servers).

Shard ID embedded in object IDs:

Pinterest encodes the shard ID directly in every object's ID using a 64-bit integer format:

64-bit Pinterest ID:
┌─────────────────┬────────┬──────┬────────┐
│  Timestamp (ms) │ Shard  │ Type │ Local  │
│  (41 bits)      │ (10 b) │ (0b) │ (13 b) │
└─────────────────┴────────┴──────┴────────┘

Example:
ID = 1234567890123456789
Shard  = (ID >> 13) & 0x3FF    → shard 42
Type   = (ID >> 3)  & 0x1F     → type 1 (pin)

This design means every object carries its shard location — no directory lookup required. The routing function is a pure bit manipulation on the ID, O(1) with no database access.

Cross-shard social graph:

Pin follows (user A → pin B):
  - user A is on shard 12
  - pin B is on shard 87
  → Store the relationship record on shard 12 (user A's shard)
  → Query: "all pins user A follows" → single-shard query on shard 12
  → Query: "all users following pin B" → stored on each user's shard (fan-out)

Key decisions:

1,024 virtual shards mapped to 8 physical servers — adding a server means moving 128 virtual shards, not resharding at the key level
Shard ID in object ID eliminates directory lookups entirely
Social graph edges stored on the entity's own shard for read locality
MySQL chosen for familiar ACID semantics; sharding added as a layer above

Scale achieved: 200+ billion pins, 40+ petabytes of data, hundreds of millions of monthly active users — all on this sharded MySQL architecture as of Pinterest's 2013 engineering blog post.

Chapter	Relevance
Ch09 — Replication & HA	Each shard needs its own replication setup
Ch11 — Query Optimization	Cross-shard query performance and EXPLAIN
Ch02 — Data Modeling for Scale	Schema design that avoids cross-shard joins
System Design Ch10 — NoSQL Databases	Consistent hashing and sharding in distributed systems

Common Mistakes

Mistake	Why It Happens	Impact	Fix
Choosing shard key without analyzing query patterns	"user_id is always the right key"	Most queries require scatter-gather across all shards	List every query first; choose the key that keeps the top 80% of queries on a single shard
Sharding too early	"We might need it someday"	Enormous operational complexity for a problem that doesn't exist yet	Try read replicas, vertical scaling, and table partitioning first; shard only when those are exhausted
Cross-shard JOINs in application code	"We'll just query both shards and merge"	O(shards × rows) work; latency dominated by slowest shard	Denormalize data needed for common JOINs into the primary shard table at write time
Using `created_at` as shard key	Range sharding is simple	All new writes land on the most recent shard; other shards are idle	Use consistent hash of user_id or tenant_id for uniform distribution

Practice Questions

Beginner

Partitioning Basics: An e-commerce database has an orders table with 500 million rows. 90% of queries filter by created_at within the last 30 days. The DBA wants to archive and delete data older than 2 years. How should they partition this table, and what SQL would they use to delete all orders from 2021?
Model Answer
RANGE partition by created_at with monthly or yearly partitions. To "delete" 2021 data without a slow DELETE: `ALTER TABLE orders DETACH PARTITION orders_2021;` — this is instantaneous because it only removes the partition from the parent table's catalog, not the data. Then `DROP TABLE orders_2021;` or move it to cold storage.
Hot Shard Problem: A sharded user database uses created_at as the shard key. After launch, nearly all new user registrations go to shard 3 (the most recent shard). The DBA observes shard 3 at 95% CPU while shards 0–2 are idle. Explain why this happened and what shard key would have prevented it.
Model Answer
created_at is monotonically increasing — all new users land on the most recent shard range. `user_id` (with consistent hashing) would distribute uniformly because the hash function maps sequential IDs to different positions on the ring. Alternatively, hash(user_id) % num_shards distributes new users evenly regardless of their sequential ID.

Intermediate

Shard Key Design: A multi-tenant SaaS application stores users, projects, and tasks tables. The application's most common queries are:
- All projects for a user
- All tasks for a project
- All tasks assigned to a user
Design a sharding strategy that keeps the most common queries on a single shard. What trade-offs does your choice make?
Model Answer
Shard key: `user_id` for users and tasks, `owner_user_id` for projects (the user who created the project). Collocate `projects` and `tasks` with their owning user on the same shard. This makes "all projects for a user" and "all tasks for a user" single-shard queries. The trade-off: "all tasks for a project" requires knowing the project's owner_user_id first (or doing a cross-shard lookup for tasks not owned by the current user). For Citus, use `colocate_with` to ensure related tables land on the same shard.
Online Resharding: A Pinterest-style application is running 8 MySQL shards and needs to expand to 16 shards without downtime. Describe the Vitess online resharding process step by step. What is the maximum acceptable downtime during the cutover phase?
Model Answer
Vitess steps: (1) Create 8 new shards alongside existing 8. (2) Copy all data from old shards to new shards in background (no traffic impact, takes hours for large datasets). (3) Start replicating new writes from old to new shards via binlog streaming. (4) Wait until new shards are < 1 second behind. (5) VTGate atomically switches reads then writes to new shards (~1 second of write blocking). (6) Verify and clean up old shards. Maximum downtime: < 5 seconds in practice for the write cutover. The data copy phase has zero downtime.

Advanced

Cross-Shard Social Graph: Design a sharding strategy for a Twitter-like follow graph with 500 million users and 50 billion follow relationships. The two critical read paths are: (a) "who does user X follow" and (b) "who follows user X" (followers). How do you avoid scatter-gather for both paths? What is the write cost of this design?
Model Answer
Use the "double write" or "fan-out" pattern: store each follow relationship twice — once on the follower's shard (for "who does X follow") and once on the followee's shard (for "who follows X"). This doubles storage and makes writes more expensive (two shards must be written per follow), but both reads are single-shard. For users with 50M+ followers (celebrities), fan-out-on-write to all followers' shards on each tweet is impractical — switch to fan-out-on-read for these accounts (computed at read time). This is exactly Twitter's architecture: regular users get push-model feed, celebrities get pull-model feed.

Chapter 10: Sharding & Partitioning ​

Mind Map ​

Overview ​

Table Partitioning in PostgreSQL ​

RANGE Partitioning (Time-Based) ​

LIST Partitioning (Category/Region) ​

HASH Partitioning (Uniform Distribution) ​

Partition Pruning ​

Automating Partitions with pg_partman ​

Sharding: Distributing Data Across Servers ​

Hash-Based Sharding ​

Consistent Hashing ​

Virtual Nodes (vnodes) ​

Range-Based Sharding ​

Directory-Based Sharding ​

Choosing a Shard Key ​

Shard Key Properties ​

Good vs Bad Shard Keys ​

Compound Shard Keys ​

Cross-Shard Queries and Joins ​

Scatter-Gather ​

Avoiding Cross-Shard Joins: Denormalization ​

Reference Tables ​

Resharding: Splitting and Migrating Shards ​

Why Resharding Happens ​

Online Resharding with Vitess ​

Sharding Tools Comparison ​

Distributed SQL Alternatives to Manual Sharding ​

Vitess Architecture ​

Citus Architecture ​

Case Study: Pinterest's Sharding Architecture ​

Related Chapters ​

Common Mistakes ​

Practice Questions ​

Beginner ​

Intermediate ​

Advanced ​

References & Further Reading ​