Chapter 13: Instagram: PostgreSQL at Scale
"We were surprised how far PostgreSQL took us. We kept expecting to hit the wall. We kept not hitting it." — Instagram Engineering, 2012
Mind Map
Prerequisites
This chapter assumes familiarity with PostgreSQL production concepts (Ch05) and sharding strategies (Ch10). Review those first if needed.
The Starting Point (2010–2012)
Instagram launched in October 2010 with a remarkably simple stack: a single PostgreSQL 9.0 instance running on a dedicated server, with Django as the application framework and a custom Python backend. The founding engineers — Kevin Systrom and Mike Krieger — were not database specialists. They chose PostgreSQL because it was the best general-purpose relational database available and because they knew SQL.
The initial schema was conventional: users, photos, follows, likes, comments — each a normalized table with foreign key constraints. Django's ORM generated clean SQL queries. Everything fit on a single machine.
-- Simplified original Instagram schema (2010)
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
username VARCHAR(30) UNIQUE NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
bio TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE photos (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT REFERENCES users(id),
image_url TEXT NOT NULL,
caption TEXT,
filter_type VARCHAR(30),
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE follows (
follower_id BIGINT REFERENCES users(id),
followee_id BIGINT REFERENCES users(id),
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (follower_id, followee_id)
);
CREATE TABLE likes (
photo_id BIGINT REFERENCES photos(id),
user_id BIGINT REFERENCES users(id),
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (photo_id, user_id)
);This schema handled 1 million users without modification. By January 2011, Instagram had 2 million users. By June 2011: 5 million. The single PostgreSQL instance was still handling the load — but the engineering team could see the ceiling.
What Worked at <10M Users
- A single PostgreSQL primary with one read replica
BIGSERIALauto-increment IDs (simple, no coordination needed)- Standard B-tree indexes on
user_idandcreated_at - Django ORM for all queries (100–500ms acceptable at this scale)
- PgBouncer in transaction pooling mode to handle connection overhead
Read Replica Phase (2012–2014)
By early 2012, Instagram had 10 million users and had been acquired by Facebook for $1 billion. The engineering team grew from 3 to 13 engineers. The single primary approach was failing:
- Read/write ratio: Instagram is extremely read-heavy — browsing the feed, loading a profile, viewing likes are all reads. Writes (posting a photo, adding a like) are infrequent. The primary was spending 80% of its capacity serving reads that did not need to come from the primary.
- VACUUM pressure: PostgreSQL's MVCC generates dead tuples that VACUUM must clean up. At high write rates, VACUUM could not keep up, causing table bloat and query plan degradation.
- Connection exhaustion: Each Django worker needed a database connection. At 100 workers × multiple servers, connection count exceeded PostgreSQL's default
max_connections = 100.
The solution was standard: streaming replication with multiple hot standbys.
Replication lag became the first truly hard problem. When a user posted a photo, subsequent requests to load their profile came back from a replica that had not yet replicated the new photo (the WAL had been written but not yet applied to the standby). Instagram solved this with a combination of:
- Read-your-own-writes routing: POST requests to the primary; GET requests that immediately follow a POST also route to the primary for 2 seconds
- Lag monitoring: pg_stat_replication tracking; if replica lag exceeded 100ms, route those queries to primary
- Asynchronous replication: acceptable for most reads; only notifications and messaging required synchronous replication
Replication Lag is a Product Bug
A user who likes a photo and immediately reloads will see the like count unchanged if served from a lagging replica. At Instagram's scale, even 50ms of replication lag affects thousands of concurrent users. Plan your routing logic before you add replicas, not after users report the bug.
Sharding Strategy (2014–2016)
By 2013, Instagram had 100 million users and 55 million photos per day. Vertical scaling had been exhausted — the PostgreSQL primary was on the largest available machine. The team made the decision to shard.
Why User-ID as the Partition Key
The core question in any sharding design is: what is the primary partition key?
Instagram chose user_id for all core tables. The reasoning:
- Locality: All photos, likes, comments, and follows for a user live on the same shard. A profile page query hits one shard.
- Simplicity: The application router is a simple
user_id % num_shardscomputation (with a lookup table for re-balancing flexibility). - Predictability: User creation is the only operation that assigns a new shard; everything else follows the user.
The cost was cross-shard queries. "Show me all likes from users I follow on photos from users I follow" requires joining data across many shards. Instagram accepted this cost by denormalizing aggressively (more on this in the schema section).
Logical Shard Mapping
Instagram used a two-level mapping: logical shards → physical shards. Rather than mapping user_id directly to a server, they mapped to one of 2,000 logical shards, and logical shards to physical servers. This made re-balancing easier: moving a logical shard to a new physical server only required updating the mapping table, not rehashing all user IDs.
ID Generation: The Snowflake-Inspired Scheme
BIGSERIAL auto-increment IDs do not work in a sharded system — each shard would generate its own sequence starting at 1, causing collisions. Instagram needed globally unique IDs that also encoded the shard they came from.
They built a PostgreSQL-based ID generator using a stored function. Each generated ID encodes:
Bit layout (64 bits total):
[41 bits: millisecond timestamp] [13 bits: logical shard ID] [10 bits: sequence]- 41 bits of timestamp: milliseconds since a custom epoch (January 1, 2011). Gives ~69 years of IDs before overflow.
- 13 bits of shard ID: encodes which logical shard the row lives on. Allows 2^13 = 8,192 logical shards.
- 10 bits of sequence: allows 2^10 = 1,024 IDs per millisecond per shard before rollover.
-- Instagram's ID generation function (PostgreSQL)
CREATE OR REPLACE FUNCTION next_id(OUT result BIGINT) AS $$
DECLARE
our_epoch BIGINT := 1314220021721; -- Custom epoch (ms since Unix epoch)
seq_id BIGINT;
now_ms BIGINT;
shard_id INT := 5; -- Set per shard at install time
BEGIN
SELECT nextval('global_id_sequence') % 1024 INTO seq_id;
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_ms;
result := (now_ms - our_epoch) << 23; -- 41 bits
result := result | (shard_id << 10); -- 13 bits
result := result | (seq_id); -- 10 bits
END;
$$ LANGUAGE PLPGSQL;This scheme has elegant properties:
- Sortable by time: IDs generated later are larger.
ORDER BY id DESCis equivalent toORDER BY created_at DESC— no timestamp column needed for ordering. - Shard-encoded: Given any entity ID, you can extract the logical shard without a lookup:
(id >> 10) & 0x1FFF. - No coordination: Each shard generates IDs independently using its own sequence.
ID Generation vs. Snowflake
Twitter's Snowflake system (2010) uses the same 41+10+12 bit layout but with a worker ID instead of a shard ID. Instagram's adaptation uses 13 shard bits and 10 sequence bits, tuned for their 2,000-shard architecture. Both approaches produce 64-bit integers that fit in a PostgreSQL BIGINT column.
Schema Design Deep Dive
Feed Generation: Fan-Out Trade-offs
The Instagram feed — "show me the recent photos from everyone I follow" — is the most read-heavy query in the system. At scale, two strategies exist:
Fan-out on write (push model): When Alice posts a photo, immediately write that photo ID to the feed of every follower. A "feed" is a pre-materialized list of photo IDs per user.
Fan-out on read (pull model): When Bob loads his feed, query all users Bob follows, fetch their recent photos, merge and rank.
Instagram used fan-out on write for users with <10,000 followers and fan-out on read for celebrities (Kylie Jenner with 400M followers). Writing Kylie's photo to 400 million feed rows synchronously would take hours and consume enormous storage. Instead, celebrity posts are fetched on-demand and merged with the pre-materialized feed of followed non-celebrities at read time.
Likes: Denormalized Count
The naive schema stores one row per like and computes COUNT(*) for the like count:
-- Naive: expensive COUNT(*) at read time
SELECT COUNT(*) FROM likes WHERE photo_id = ?;At billions of likes, this query is slow even with an index. Instagram maintains a denormalized count directly on the photo row:
-- Instagram-style schema: denormalized count
ALTER TABLE photos ADD COLUMN like_count BIGINT DEFAULT 0;
-- On like insert:
UPDATE photos SET like_count = like_count + 1 WHERE id = ?;
INSERT INTO likes (photo_id, user_id, created_at) VALUES (?, ?, NOW());
-- On like read: instant, no aggregation
SELECT like_count FROM photos WHERE id = ?;The cost: like_count can drift from the true COUNT(likes) if the UPDATE and INSERT are not atomic. Instagram wraps both in a transaction and runs periodic reconciliation jobs to correct drift.
Stories: Ephemeral Data with TTL
Stories disappear after 24 hours. The naive approach — a deleted_at column with a background job — accumulates rows and requires VACUUM to reclaim space. Instagram uses a more aggressive approach:
- Stories live in a separate partitioned table partitioned by
created_at(daily partitions) - A daily cron drops the previous day's partition:
DROP TABLE stories_2024_01_15 - Dropping a partition is instantaneous — no row-by-row deletion, no VACUUM needed
-- Partitioned stories table
CREATE TABLE stories (
id BIGINT NOT NULL,
user_id BIGINT NOT NULL,
media_url TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);
-- Daily partitions created automatically via pg_partman
CREATE TABLE stories_2024_01_16
PARTITION OF stories
FOR VALUES FROM ('2024-01-16') TO ('2024-01-17');
-- Expiry: drop yesterday's partition
DROP TABLE stories_2024_01_15; -- Instant, reclaims all disk spaceDjango ORM at Scale
Django's ORM was Instagram's primary query interface throughout its growth. At scale, several patterns emerged:
Custom Database Router
Django's database router allows directing queries to specific databases:
# Instagram-style Django DB router
class InstagramDBRouter:
def db_for_read(self, model, **hints):
user_id = hints.get('user_id')
if user_id:
shard = user_id % 2000
return f'shard_{shard_to_server[shard]}'
return 'replica_default'
def db_for_write(self, model, **hints):
user_id = hints.get('user_id')
if user_id:
shard = user_id % 2000
return f'shard_{shard_to_server[shard]}'
return 'primary'
def allow_relation(self, obj1, obj2, **db):
# Relations only allowed within the same shard
return TrueORM Limitations Discovered at Scale
- N+1 queries: Django's ORM generates N+1 SQL queries when iterating over related objects.
select_related()andprefetch_related()are mandatory for any query touching multiple tables. - No connection awareness: The ORM does not know about sharding — the router must be injected at every query site.
- Long transactions from ORMs: Django's default transaction model wraps each request in a transaction. On heavily contested rows (like counters), this held locks too long. Instagram moved hot counters to Redis.
Django ORM and Connection Pooling
Django opens a new database connection per request by default. At 500 application servers × 20 threads = 10,000 connections, this overwhelmed PostgreSQL's max_connections. PgBouncer in transaction pooling mode — which multiplexes many client connections onto a small pool of server connections — is non-negotiable at Instagram's scale. However, transaction pooling is incompatible with prepared statements; Instagram had to disable them.
Current Architecture (2024)
Instagram's 2024 database architecture has evolved far beyond the original sharded PostgreSQL setup:
Key architectural facts from Instagram engineering blog posts (2019–2023):
- Thousands of PostgreSQL shards serve user profile data, media metadata, and comments
- TAO (Facebook's social graph storage system) handles the follow graph — a workload that outgrew PostgreSQL's JOIN performance
- Memcached provides L1 caching per data center; cache hit rates above 99% for profile reads
- Redis stores hot feed lists, counters, and session data
- PostgreSQL's role has narrowed to: structured user data that requires ACID guarantees
Key Lessons
| Lesson | Detail |
|---|---|
| PostgreSQL scales further than expected | Instagram ran a single PG instance to 10M users, then replicas to 100M users — longer than any prediction |
Shard early by user_id | The partition key decision is the hardest one to change later; user_id gives excellent locality for social data |
| ID scheme should encode shard | Instagram's Snowflake-inspired IDs allow shard routing without a lookup table |
| Denormalize for reads | like_count on the photo row avoids expensive COUNT(*) aggregations |
| Partitioned tables for ephemeral data | Dropping a partition is 1,000× faster than deleting rows for TTL data |
| Connection pooling is non-negotiable | PgBouncer in transaction mode is required at any scale above ~100 application servers |
| Fan-out strategy must handle celebrity accounts | Pure fan-out-on-write fails when some users have 100M+ followers |
Common Mistakes
| Mistake | Why It Happens | Impact | Fix |
|---|---|---|---|
| Copying Instagram's sharding strategy without their scale | "They shard by user_id, so should we" | Sharding at 10K users adds massive complexity with no benefit | Only shard when you have exhausted read replicas, vertical scaling, and partitioning — Instagram didn't shard until 100M+ users |
| Fan-out-on-write without capping follower count | "Push model is simpler for feed reads" | Posting for a celebrity with 100M followers fans out 100M writes | Use hybrid model: fan-out for users < 1M followers, pull for high-follower accounts at read time |
| Storing like counts only in the aggregate column | "Counting is expensive, cache it" | Concurrent likes on viral posts cause lost-update races under extreme load | Buffer hot counters in Redis with INCR; flush to PostgreSQL periodically |
| Not adapting Django ORM queries to shard-aware routing | ORM generates standard SQL | Cross-shard queries silently scatter; performance degrades | Implement a custom Django database router that enforces shard-key-aware queries |
Related Chapters
| Chapter | Relevance |
|---|---|
| Ch02 — Data Modeling for Scale | Access-pattern schema design used throughout Instagram's evolution |
| Ch03 — Indexing Strategies | Index choices for user_id, created_at, and composite keys |
| Ch04 — Transactions & Concurrency | MVCC and locking behavior relevant to like counter contention |
| Ch05 — PostgreSQL in Production | WAL, VACUUM, and connection pooling internals |
| Ch10 — Sharding & Partitioning | Horizontal sharding patterns applied here |
| Ch16 — Database Selection Framework | How to apply Instagram's lessons to your own selection decision |
Practice Questions
Beginner
ID Generation: Instagram's ID scheme uses 41 bits for timestamp, 13 for shard ID, and 10 for sequence. How many IDs per millisecond can a single shard generate? What is the maximum number of logical shards the scheme supports? When will IDs overflow the 64-bit integer?
Model Answer
10 sequence bits = 2^10 = 1,024 IDs per millisecond per shard. 13 shard ID bits = 2^13 = 8,192 logical shards. 41 timestamp bits in milliseconds = 2^41 ms ≈ 69 years from the custom epoch (set to 2011, so overflow around 2080).Read Replicas: A user posts a photo and immediately reloads their profile. Their new photo does not appear. What is the likely cause, and how would you fix it in the application routing layer?
Model Answer
The profile read was routed to a replica that had not yet replicated the new photo row. The fix: read-your-own-writes routing — for a short window after a write (or permanently for the posting user), route profile reads to the primary.
Intermediate
Fan-Out Design: A new social platform has 95% of users with fewer than 1,000 followers, and 5% of users ("influencers") with more than 1 million followers. Design a hybrid feed system that uses fan-out-on-write for regular users and fan-out-on-read for influencers. What threshold would you use, and how does the application know which strategy to apply at read time?
Model Answer
Store follower_count on the user row, updated on every follow/unfollow. At feed generation time, check if the followee is "celebrity" (follower_count > threshold, e.g., 50,000). Fetch celebrity posts inline at read time and merge with the pre-materialized feed. The threshold is a tuning parameter: lower reduces storage for pre-materialized feeds, higher increases read-time fan-out cost.Sharding Key Analysis: Instagram chose
user_idas the sharding key. What queries become efficient? What queries become expensive or impossible without cross-shard joins? Give three examples of each.Model Answer
Efficient: fetch all photos by user, fetch user profile + followers, load user's feed (if feed is stored by user_id). Expensive: "find all photos tagged #sunset" (cross-shard full scan), "find all users who liked photo X" (likes distributed by photo's user shard, not liker's shard), "recommended follows based on mutual friends" (graph traversal across shards).
Advanced
Migration Planning: Instagram had 100M rows in the
photostable on a single PostgreSQL instance and needed to migrate to a sharded system with 2,000 logical shards. Design the migration process: how do you move data without downtime? How do you handle writes that come in during the migration? How do you validate that all rows were migrated correctly?Model Answer
Phase 1: Dual-write — all new writes go to both old DB and new shard. Phase 2: Backfill — copy historical data shard by shard, using a cursor on photo_id to avoid re-processing. Phase 3: Validation — compare row counts and checksums per logical shard. Phase 4: Read cut-over — gradually shift read traffic to shards, starting with lowest traffic shards. Phase 5: Stop writing to old DB. The trickiest part is handling updates to rows still being backfilled — use a replication lag check or a "migration_complete" flag per logical shard.
References
- Sharding & IDs at Instagram — Instagram Engineering Blog (2012)
- Storing Hundreds of Millions of Simple Key-Value Pairs in Redis — Instagram Engineering Blog (2012)
- What Powers Instagram: Hundreds of Instances, Dozens of Technologies — Instagram Engineering Blog (2011)
- TAO: Facebook's Distributed Data Store for the Social Graph — Bronson et al., USENIX ATC (2013)
- Snowflake: Generating Unique IDs in High Scale Systems — Twitter Engineering Blog (2010)
- PgBouncer Documentation — Connection pooling modes and trade-offs
- "Designing Data-Intensive Applications" — Kleppmann, Chapter 6 (Partitioning)

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.