Skip to contentSkip to content
0/47 chapters completed (0%)

Chapter 4: Back-of-Envelope Estimation โ€‹

Chapter banner

Mind Map โ€‹

Overview โ€‹

Back-of-envelope estimation is the art of quickly approximating the scale of a system using simple math and a handful of memorized reference numbers. Before investing hours designing a distributed database, you should spend five minutes confirming that your design is even necessary โ€” or whether a single Postgres instance will handle the load just fine.

This chapter is a reference chapter. Return to it every time you start a new system design problem. The numbers here feed directly into every case study in Part 4.

DAU/MAU Numbers Change Frequently

The user counts in worked examples below reflect approximate figures as of 2024. Exact DAU/MAU numbers change quarterly โ€” what matters for estimation exercises is the technique and order-of-magnitude reasoning, not the precise input values. In interviews, ask your interviewer for the DAU assumption or state your own clearly.

As described in Chapter 3 โ€” Core Trade-offs, estimation is Step 2 of the interview framework: you clarify requirements, then you estimate scale before drawing a single box.

Why interviewers care: Estimation reveals whether you think at system scale or algorithm scale. An engineer who says "we'll need about 150 TB/day of storage for media" is thinking like a systems engineer. One who says "it depends" is not.


Why Estimation Matters โ€‹

1. Validates Design Feasibility โ€‹

A 30-second calculation can prevent 30 minutes of wasted design. If your estimated QPS is 50, you do not need sharding. If it is 500,000, you do.

2. Guides Architecture Decisions โ€‹

Estimated QPSImplication
< 1,000Single server, vertical scaling
1,000 โ€“ 10,000Load balancer + a few app servers
10,000 โ€“ 100,000Caching layer mandatory, read replicas
100,000+Horizontal sharding, CDN, async processing

3. Prevents Over/Under-Engineering โ€‹

  • Under-engineering: Building a single-server app for a system that needs to handle 50,000 QPS โ€” system crashes on day one.
  • Over-engineering: Deploying a 20-node Kafka cluster for a system with 100 users/day โ€” wasted cost and complexity.

Powers of 2 โ€” Reference Table โ€‹

Everything in computing is binary. These are the numbers you must know without thinking.

PowerExact ValueApproximateStorage Name
2^101,024~1 Thousand1 KB
2^201,048,576~1 Million1 MB
2^301,073,741,824~1 Billion1 GB
2^401,099,511,627,776~1 Trillion1 TB
2^501,125,899,906,842,624~1 Quadrillion1 PB

Practical shortcuts:

  • 1 KB = 1,000 bytes (close enough for estimates)
  • 1 MB = 1,000 KB = 1 million bytes
  • 1 GB = 1,000 MB = 1 billion bytes
  • 1 TB = 1,000 GB = 1 trillion bytes
  • 1 PB = 1,000 TB โ€” think "Netflix stores ~1 PB of video per day"

Memory aid: Each step multiplies by 1,000 (roughly). Going KB โ†’ MB โ†’ GB โ†’ TB โ†’ PB is ร—1,000 each time.


Latency Numbers Every Programmer Should Know (2008 Baseline) โ€‹

Originally published by Jeff Dean (Google, ~2008). These numbers are approximate but stable enough for estimation exercises. Modern hardware has improved many of these values โ€” memorize the order of magnitude, not the exact value.

OperationLatencyNotes
L1 cache reference0.5 nsFastest memory access
Branch mispredict5 nsCPU pipeline flush
L2 cache reference7 ns14ร— slower than L1
Mutex lock/unlock100 nsContended lock cost
Main memory reference100 nsDRAM access
Compress 1 KB (Snappy)3 ยตs3,000 ns
Send 1 KB over 1 Gbps network10 ยตsLocal network
Read 4 KB randomly from SSD150 ยตsRandom I/O is expensive
Read 1 MB sequentially from memory250 ยตsSequential is fast
Round trip within same datacenter500 ยตsIntra-DC latency
Read 1 MB sequentially from SSD1 ms1,000 ยตs
HDD seek10 msMechanical seek time
Read 1 MB sequentially from HDD20 msSequential but slow disk
Send packet CA โ†’ Netherlands โ†’ CA150 msCross-continent RTT

Latency Scale Visualization โ€‹

Key Takeaways from the Latency Table โ€‹

  1. Memory is 200ร— faster than SSD for random reads (100 ns vs 20,000 ns)
  2. SSD is 1,000ร— faster than HDD for random reads (150 ยตs vs 10 ms HDD seek + read)
  3. Avoid network round trips inside hot code paths โ€” even intra-DC costs 500 ยตs
  4. Sequential access beats random access by 10โ€“100ร— on both SSD and HDD
  5. Cross-continent latency is irreducible โ€” physics sets a floor of ~100 ms

Historical Note

These are the classic latency numbers originally compiled by Jeff Dean (~2008) and widely used in system design interviews. Modern hardware has improved several of these values significantly (e.g., mutex lock/unlock is now ~17-25ns, network serialization is faster). The original values remain the standard reference for estimation exercises and interviews. See the community-maintained gist for updated figures.


QPS (Queries Per Second) Estimation โ€‹

Formula โ€‹

Average QPS = DAU ร— Average Queries Per User Per Day รท 86,400
Peak QPS    = Average QPS ร— 2 to 3

Where:

  • DAU = Daily Active Users
  • 86,400 = seconds per day (60 ร— 60 ร— 24)
  • Peak multiplier = 2โ€“3ร— is a common default for consumer apps, but varies significantly by domain: news/media sites may spike 50โ€“100ร— during breaking events, while banking apps typically peak at only 1.2โ€“1.5ร— average. Always research domain-specific patterns.

Worked Example: Twitter Read QPS โ€‹

Assumptions:

  • ~500 million DAU (approximate as of 2023; exact figures vary by source)
  • Each user reads their timeline ~10 times per day
  • Average of 20 tweets shown per timeline load

Calculation:

Timeline loads per day = 500M ร— 10 = 5 billion
Reads per second (avg) = 5,000,000,000 รท 86,400 โ‰ˆ 57,870 QPS
Peak QPS               = 57,870 ร— 3 โ‰ˆ 174,000 QPS

What this tells us: Twitter/X needs to serve ~174,000 read QPS at peak. This immediately implies caching is mandatory โ€” no database can handle 100K+ QPS on live queries without a cache layer in front of it.

Quick QPS Conversions โ€‹

Requests Per DayApprox QPS
1 million/day~12 QPS
10 million/day~116 QPS
100 million/day~1,160 QPS
1 billion/day~11,600 QPS
10 billion/day~115,700 QPS

Memory shortcut: 1 million requests/day โ‰ˆ 12 QPS. Scale linearly from there.


Storage Estimation โ€‹

Formula โ€‹

Daily Storage = DAU ร— Data Generated Per User Per Day
Total Storage = Daily Storage ร— Retention Period (days)
With Replication = Total Storage ร— Replication Factor (3ร—)

Data Size Reference โ€‹

Data TypeTypical Size
Tweet / short text post280 chars โ‰ˆ 300 bytes
User metadata record~1 KB
Profile photo (thumbnail)~10 KB
Photo (compressed JPEG)~200 KB โ€“ 2 MB
Short video (1 min, 720p)~50 MB
Video (1 hour, 1080p)~2 GB

Worked Example: Instagram Photo Storage โ€‹

Assumptions:

  • ~1.3 billion DAU (approximate as of 2024)
  • 10% of users post one photo per day = 130 million photos/day
  • Average photo size after compression: 300 KB
  • Thumbnails generated: 3 sizes ร— 20 KB = 60 KB per photo
  • Metadata per photo: 1 KB

Calculation:

Photo data/day  = 130M ร— 300 KB  = 39,000,000,000 KB = ~39 TB/day
Thumbnail data  = 130M ร— 60 KB   = 7,800,000,000 KB  = ~8 TB/day
Metadata/day    = 130M ร— 1 KB    = 130,000,000 KB     = ~130 GB/day

Total raw/day   โ‰ˆ 47 TB/day
With 3ร— replication = 141 TB/day
5-year total    = 47 TB ร— 365 ร— 5 ร— 3 = ~257 PB

What this tells us: Instagram-scale photo storage demands dedicated object storage (S3-equivalent), not block storage. At ~257 PB over 5 years, the cost alone justifies aggressive compression and tiered storage strategies.


Bandwidth Estimation โ€‹

Formula โ€‹

Outbound Bandwidth = Read QPS ร— Average Response Size
Inbound Bandwidth  = Write QPS ร— Average Request Size

Worked Example: Twitter Bandwidth โ€‹

Assumptions (continuing from QPS example above):

  • Read QPS: 57,870 (average), 174,000 (peak)
  • Average timeline response: 20 tweets ร— 300 bytes = 6,000 bytes = ~6 KB

Calculation:

Average outbound = 57,870 QPS ร— 6 KB  = 347,220 KB/s โ‰ˆ 340 MB/s
Peak outbound    = 174,000 QPS ร— 6 KB = 1,044,000 KB/s โ‰ˆ 1 GB/s

What this tells us: At ~1 GB/s peak egress, Twitter/X's network infrastructure must handle ~8 Gbps of outbound traffic from timeline endpoints alone. CDN caching of popular content is essential to reduce origin server load.


Worked Example: Twitter Storage Estimation (Full Walkthrough) โ€‹

This step-by-step walkthrough shows how to chain assumptions into a complete estimate.

Estimation Process โ€‹

Step-by-Step Calculation โ€‹

Step 1: State assumptions clearly

  • Daily Active Users (DAU): ~500 million
  • Tweets posted per day: ~800 million
  • Average tweet: 280 characters of text + 100 bytes metadata = ~300 bytes total
  • 10% of tweets contain one image (average 200 KB after compression)
  • 1% of tweets contain a video (average 2 MB for short video)
  • Data retained: 5 years

Step 2: Text storage

Text per day = 800M tweets ร— 300 bytes
             = 240,000,000,000 bytes
             = 240 GB/day

Step 3: Image storage

Tweets with images = 800M ร— 10%  = 80 million
Image storage/day  = 80M ร— 200 KB = 16,000,000,000 KB
                   = 16 TB/day

Step 4: Video storage

Tweets with video = 800M ร— 1%    = 8 million
Video storage/day = 8M ร— 2 MB    = 16,000,000 MB
                  = 16 TB/day

Step 5: Metadata (user data, indexes, etc.)

Metadata overhead โ‰ˆ 20% of total = ~6 TB/day (rough)

Step 6: Sum daily total

Text:     0.24 TB/day
Images:   16   TB/day
Video:    16   TB/day
Metadata:  6   TB/day
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Total:  ~38.24 TB/day โ‰ˆ 40 TB/day

Step 7: Apply replication factor

Storage with 3ร— replication = 40 TB ร— 3 = 120 TB/day

Step 8: Project over 5 years

5-year storage = 120 TB/day ร— 365 days ร— 5 years
               = 120 ร— 1,825
               = 219,000 TB
               โ‰ˆ 219 PB

Conclusion: Twitter/X needs approximately 219 petabytes of storage over 5 years. This demands a distributed object storage system (like S3 or HDFS), not a relational database. Media delivery via CDN is mandatory โ€” serving 32 TB/day of media from origin servers alone is not viable.


Worked Example: YouTube Bandwidth Estimation โ€‹

Assumptions โ€‹

  • Monthly Active Users (MAU): ~2.7 billion (as of 2024)
  • DAU โ‰ˆ 30% of MAU = ~800 million
  • Average videos watched per DAU per day: 5 videos
  • Average video duration: 5 minutes
  • Video quality: blended average ~5 Mbps (mix of 720p, 1080p, and 4K streams)
  • Upload rate: 500 hours of video uploaded every minute

Step-by-Step Calculation โ€‹

Step 1: Daily video watch hours

Video views/day    = 800M DAU ร— 5 videos  = 4 billion views/day
Watch minutes/day  = 4B ร— 5 min           = 20 billion minutes/day
Watch hours/day    = 20B รท 60             = ~333 million hours/day

Step 2: Outbound bandwidth (streaming)

Bandwidth per stream  = 5 Mbps (blended average) = 5,000,000 bits/s
Concurrent viewers    = 333M hours/day รท 24 hours
                      = ~13.9M concurrent viewers (average)

Average outbound BW   = 13.9M ร— 5 Mbps
                      = 69.5 Tbps (terabits per second, average)

Peak outbound BW      = avg ร— 3 (peak hour multiplier)
                      โ‰ˆ 208 Tbps

Step 3: Inbound bandwidth (uploads)

Upload rate = 500 hours of video/minute
           = 500 ร— 60 minutes of video/minute
           = 30,000 minutes of video/minute

At blended 5 Mbps per stream:
Inbound BW = 30,000 min/min ร— 5 Mbps
           = 150,000 Mbps
           = 150 Gbps upload ingestion bandwidth

Step 4: Storage for new uploads per day

New video/day    = 500 hrs/min ร— 60 min/hr ร— 24 hrs
                 = 720,000 hours of video/day

At blended quality (~800 MB/hour compressed):
Storage/day      = 720,000 hrs ร— 800 MB
                 = 576,000,000 MB
                 โ‰ˆ 576 TB/day of new video

Conclusion: YouTube's bandwidth requirements (~200+ Tbps peak outbound) make it one of the largest consumers of internet bandwidth globally. At this scale, YouTube must operate its own CDN infrastructure (Google Global Cache), peering directly with ISPs. No third-party CDN can handle this volume cost-effectively.


Common Estimation Mistakes โ€‹

1. Forgetting Replication โ€‹

Storage estimates are for raw data. In production, you replicate data 3ร— (minimum) for durability.

Raw storage: 10 TB
With 3ร— replication: 30 TB  โ† always use this number for cost/capacity planning

2. Ignoring Metadata Overhead โ€‹

Databases, file systems, and object stores all add metadata: indexes, checksums, tombstones, headers. Add 10โ€“30% overhead to any storage estimate.

3. Confusing Peak vs Average โ€‹

Average QPS is what you calculate. But you must provision for peak QPS (2โ€“3ร— average). A system that handles average load but crashes at peak is a failed design.

4. Confusing Bits and Bytes โ€‹

Network bandwidth is measured in bits. Storage is measured in bytes.

1 Gbps network  = 1 gigabit per second
                = 125 megabytes per second (MB/s)

Rule: Divide bits by 8 to get bytes. When someone says "we have a 1 Gbps pipe," they mean ~125 MB/s of actual data throughput.

5. Ignoring Growth Rate โ€‹

A system handling 1,000 QPS today may need to handle 10,000 QPS in 18 months. Always ask: what is the expected growth rate? Commonly 2โ€“3ร— per year for fast-growing products.

6. Treating All Operations as Equal โ€‹

A "write" to a database is not the same cost as a "read." Writes typically require quorum confirmation across replicas, making them 5โ€“10ร— more expensive. Separate your read QPS from write QPS in estimates.


Estimation Cheat Sheet โ€‹

QPS Quick Reference โ€‹

Requests Per DayQPS
100K/day~1 QPS
1M/day~12 QPS
10M/day~115 QPS
100M/day~1,160 QPS
1B/day~11,574 QPS
10B/day~115,740 QPS

Bandwidth Quick Reference โ€‹

QPS ร— Response SizeBandwidth
1,000 QPS ร— 1 KB1 MB/s
10,000 QPS ร— 1 KB10 MB/s
10,000 QPS ร— 100 KB1 GB/s
100,000 QPS ร— 1 KB100 MB/s

Storage Quick Reference โ€‹

DailyMonthlyYearly5-Year
1 GB/day~30 GB~365 GB~1.8 TB
100 GB/day~3 TB~36 TB~182 TB
1 TB/day~30 TB~365 TB~1.8 PB
10 TB/day~300 TB~3.6 PB~18 PB

Multiplication Shortcuts โ€‹

  • ร—1,000 = KB โ†’ MB โ†’ GB โ†’ TB โ†’ PB
  • รท86,400 = requests/day โ†’ QPS (or use รท100,000 for a fast rough estimate)
  • ร—3 = raw storage โ†’ replicated storage
  • ร—2โ€“3 = average QPS โ†’ peak QPS
  • รท8 = bits โ†’ bytes (for network bandwidth)
  • ร—1.2โ€“1.3 = add metadata overhead to storage

Key Takeaway: Back-of-envelope estimation is a practiced skill, not a talent. Memorize the reference tables, internalize the formulas, and practice on real systems. The goal is not precision โ€” it is order-of-magnitude correctness that guides architectural decisions.


Estimation Process Diagrams โ€‹

The following diagrams show the process and calculation trees for the four core estimation types. Use them as a repeatable mental model for every new problem.

How to Approach Any Estimation Problem โ€‹

QPS Estimation Tree โ€‹

Start from DAU and decompose into per-service query rates.

Storage Estimation Tree โ€‹

Bandwidth Estimation Tree โ€‹


ChapterRelevance
Ch02 โ€” ScalabilityEstimation feeds directly into scalability planning
Ch25 โ€” Interview FrameworkEstimation is Step 2 of the 4-step interview framework
Ch18 โ€” URL ShortenerClassic estimation walkthrough: QPS, storage, bandwidth

Practice Questions โ€‹

Attempt each estimate before reading the hint. Write your assumptions explicitly before calculating.

Beginner โ€‹

  1. Instagram Storage Estimation: Estimate how much new storage Instagram requires per year, given ~1.3B DAU. State all assumptions (posting rate, photo/video mix, average file sizes, replication factor) before calculating. What is the monthly storage growth in petabytes?

    Hint Assume ~5% of DAU post daily; photos average 3 MB, videos average 50 MB; use a 3ร— replication factor and roughly 20/80 photo-to-video mix for posted content.

Intermediate โ€‹

  1. Uber Peak QPS Estimation: Estimate Uber's peak QPS for ride-related API calls during rush hour (5 PM Friday) in a major metro. Uber completes ~15M rides globally per day. Account for location updates, matching calls, and payment events per ride, then apply a realistic peak-to-average multiplier.

    Hint A single ride generates ~100 API calls spread over 20 minutes; peak hour sees ~3ร— daily average; remember global vs. metro scope if the question narrows to one city.
  2. WhatsApp Message Throughput: Estimate the peak message throughput WhatsApp must handle globally. WhatsApp has ~2B MAU. State assumptions for DAU conversion rate, messages per active user per day, and delivery receipt overhead, then calculate peak QPS.

    Hint Each message generates at minimum 2 events (sent + delivered receipt); apply the standard 2โ€“3ร— peak multiplier over the daily average QPS.
  3. Netflix Bandwidth Estimation: Estimate Netflix's total outbound bandwidth during peak evening hours (~8 PM local time). ~300M subscribers globally; assume 10% concurrently streaming. Use a blended bitrate of 3 Mbps across quality tiers. Express the answer in Tbps and compare to known internet backbone capacities.

    Hint 20M concurrent streams ร— 3 Mbps = X Tbps; to sanity-check, recall that Netflix has historically been cited as ~15% of global internet traffic during peak.

Advanced โ€‹

  1. Google Search Index Size: Estimate the storage required for Google's web search index. The crawled web has roughly 5โ€“10B pages. Account for compressed HTML storage, extracted inverted index structures (roughly 3ร— raw data), PageRank scores, and the number of historical versions and replicas Google maintains for durability and query serving.

    Hint Start with raw page size (~10 KB compressed), multiply by index amplification factor and replica count; the answer should land in the tens-of-exabytes range and can be cross-checked against Google's reported data center capacity.

References & Further Reading โ€‹

  • "System Design Interview" โ€” Alex Xu, Chapter 2 (Back-of-the-Envelope Estimation)
  • Jeff Dean's "Numbers Everyone Should Know"
  • "The Art of Capacity Planning" โ€” John Allspaw
  • Latency Numbers Every Programmer Should Know

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.

Built with VitePress + Dracula Theme