Chapter 4: Back-of-Envelope Estimation โ

Mind Map โ
Overview โ
Back-of-envelope estimation is the art of quickly approximating the scale of a system using simple math and a handful of memorized reference numbers. Before investing hours designing a distributed database, you should spend five minutes confirming that your design is even necessary โ or whether a single Postgres instance will handle the load just fine.
This chapter is a reference chapter. Return to it every time you start a new system design problem. The numbers here feed directly into every case study in Part 4.
DAU/MAU Numbers Change Frequently
The user counts in worked examples below reflect approximate figures as of 2024. Exact DAU/MAU numbers change quarterly โ what matters for estimation exercises is the technique and order-of-magnitude reasoning, not the precise input values. In interviews, ask your interviewer for the DAU assumption or state your own clearly.
As described in Chapter 3 โ Core Trade-offs, estimation is Step 2 of the interview framework: you clarify requirements, then you estimate scale before drawing a single box.
Why interviewers care: Estimation reveals whether you think at system scale or algorithm scale. An engineer who says "we'll need about 150 TB/day of storage for media" is thinking like a systems engineer. One who says "it depends" is not.
Why Estimation Matters โ
1. Validates Design Feasibility โ
A 30-second calculation can prevent 30 minutes of wasted design. If your estimated QPS is 50, you do not need sharding. If it is 500,000, you do.
2. Guides Architecture Decisions โ
| Estimated QPS | Implication |
|---|---|
| < 1,000 | Single server, vertical scaling |
| 1,000 โ 10,000 | Load balancer + a few app servers |
| 10,000 โ 100,000 | Caching layer mandatory, read replicas |
| 100,000+ | Horizontal sharding, CDN, async processing |
3. Prevents Over/Under-Engineering โ
- Under-engineering: Building a single-server app for a system that needs to handle 50,000 QPS โ system crashes on day one.
- Over-engineering: Deploying a 20-node Kafka cluster for a system with 100 users/day โ wasted cost and complexity.
Powers of 2 โ Reference Table โ
Everything in computing is binary. These are the numbers you must know without thinking.
| Power | Exact Value | Approximate | Storage Name |
|---|---|---|---|
| 2^10 | 1,024 | ~1 Thousand | 1 KB |
| 2^20 | 1,048,576 | ~1 Million | 1 MB |
| 2^30 | 1,073,741,824 | ~1 Billion | 1 GB |
| 2^40 | 1,099,511,627,776 | ~1 Trillion | 1 TB |
| 2^50 | 1,125,899,906,842,624 | ~1 Quadrillion | 1 PB |
Practical shortcuts:
- 1 KB = 1,000 bytes (close enough for estimates)
- 1 MB = 1,000 KB = 1 million bytes
- 1 GB = 1,000 MB = 1 billion bytes
- 1 TB = 1,000 GB = 1 trillion bytes
- 1 PB = 1,000 TB โ think "Netflix stores ~1 PB of video per day"
Memory aid: Each step multiplies by 1,000 (roughly). Going KB โ MB โ GB โ TB โ PB is ร1,000 each time.
Latency Numbers Every Programmer Should Know (2008 Baseline) โ
Originally published by Jeff Dean (Google, ~2008). These numbers are approximate but stable enough for estimation exercises. Modern hardware has improved many of these values โ memorize the order of magnitude, not the exact value.
| Operation | Latency | Notes |
|---|---|---|
| L1 cache reference | 0.5 ns | Fastest memory access |
| Branch mispredict | 5 ns | CPU pipeline flush |
| L2 cache reference | 7 ns | 14ร slower than L1 |
| Mutex lock/unlock | 100 ns | Contended lock cost |
| Main memory reference | 100 ns | DRAM access |
| Compress 1 KB (Snappy) | 3 ยตs | 3,000 ns |
| Send 1 KB over 1 Gbps network | 10 ยตs | Local network |
| Read 4 KB randomly from SSD | 150 ยตs | Random I/O is expensive |
| Read 1 MB sequentially from memory | 250 ยตs | Sequential is fast |
| Round trip within same datacenter | 500 ยตs | Intra-DC latency |
| Read 1 MB sequentially from SSD | 1 ms | 1,000 ยตs |
| HDD seek | 10 ms | Mechanical seek time |
| Read 1 MB sequentially from HDD | 20 ms | Sequential but slow disk |
| Send packet CA โ Netherlands โ CA | 150 ms | Cross-continent RTT |
Latency Scale Visualization โ
Key Takeaways from the Latency Table โ
- Memory is 200ร faster than SSD for random reads (100 ns vs 20,000 ns)
- SSD is 1,000ร faster than HDD for random reads (150 ยตs vs 10 ms HDD seek + read)
- Avoid network round trips inside hot code paths โ even intra-DC costs 500 ยตs
- Sequential access beats random access by 10โ100ร on both SSD and HDD
- Cross-continent latency is irreducible โ physics sets a floor of ~100 ms
Historical Note
These are the classic latency numbers originally compiled by Jeff Dean (~2008) and widely used in system design interviews. Modern hardware has improved several of these values significantly (e.g., mutex lock/unlock is now ~17-25ns, network serialization is faster). The original values remain the standard reference for estimation exercises and interviews. See the community-maintained gist for updated figures.
QPS (Queries Per Second) Estimation โ
Formula โ
Average QPS = DAU ร Average Queries Per User Per Day รท 86,400
Peak QPS = Average QPS ร 2 to 3Where:
- DAU = Daily Active Users
- 86,400 = seconds per day (60 ร 60 ร 24)
- Peak multiplier = 2โ3ร is a common default for consumer apps, but varies significantly by domain: news/media sites may spike 50โ100ร during breaking events, while banking apps typically peak at only 1.2โ1.5ร average. Always research domain-specific patterns.
Worked Example: Twitter Read QPS โ
Assumptions:
- ~500 million DAU (approximate as of 2023; exact figures vary by source)
- Each user reads their timeline ~10 times per day
- Average of 20 tweets shown per timeline load
Calculation:
Timeline loads per day = 500M ร 10 = 5 billion
Reads per second (avg) = 5,000,000,000 รท 86,400 โ 57,870 QPS
Peak QPS = 57,870 ร 3 โ 174,000 QPSWhat this tells us: Twitter/X needs to serve ~174,000 read QPS at peak. This immediately implies caching is mandatory โ no database can handle 100K+ QPS on live queries without a cache layer in front of it.
Quick QPS Conversions โ
| Requests Per Day | Approx QPS |
|---|---|
| 1 million/day | ~12 QPS |
| 10 million/day | ~116 QPS |
| 100 million/day | ~1,160 QPS |
| 1 billion/day | ~11,600 QPS |
| 10 billion/day | ~115,700 QPS |
Memory shortcut: 1 million requests/day โ 12 QPS. Scale linearly from there.
Storage Estimation โ
Formula โ
Daily Storage = DAU ร Data Generated Per User Per Day
Total Storage = Daily Storage ร Retention Period (days)
With Replication = Total Storage ร Replication Factor (3ร)Data Size Reference โ
| Data Type | Typical Size |
|---|---|
| Tweet / short text post | 280 chars โ 300 bytes |
| User metadata record | ~1 KB |
| Profile photo (thumbnail) | ~10 KB |
| Photo (compressed JPEG) | ~200 KB โ 2 MB |
| Short video (1 min, 720p) | ~50 MB |
| Video (1 hour, 1080p) | ~2 GB |
Worked Example: Instagram Photo Storage โ
Assumptions:
- ~1.3 billion DAU (approximate as of 2024)
- 10% of users post one photo per day = 130 million photos/day
- Average photo size after compression: 300 KB
- Thumbnails generated: 3 sizes ร 20 KB = 60 KB per photo
- Metadata per photo: 1 KB
Calculation:
Photo data/day = 130M ร 300 KB = 39,000,000,000 KB = ~39 TB/day
Thumbnail data = 130M ร 60 KB = 7,800,000,000 KB = ~8 TB/day
Metadata/day = 130M ร 1 KB = 130,000,000 KB = ~130 GB/day
Total raw/day โ 47 TB/day
With 3ร replication = 141 TB/day
5-year total = 47 TB ร 365 ร 5 ร 3 = ~257 PBWhat this tells us: Instagram-scale photo storage demands dedicated object storage (S3-equivalent), not block storage. At ~257 PB over 5 years, the cost alone justifies aggressive compression and tiered storage strategies.
Bandwidth Estimation โ
Formula โ
Outbound Bandwidth = Read QPS ร Average Response Size
Inbound Bandwidth = Write QPS ร Average Request SizeWorked Example: Twitter Bandwidth โ
Assumptions (continuing from QPS example above):
- Read QPS: 57,870 (average), 174,000 (peak)
- Average timeline response: 20 tweets ร 300 bytes = 6,000 bytes = ~6 KB
Calculation:
Average outbound = 57,870 QPS ร 6 KB = 347,220 KB/s โ 340 MB/s
Peak outbound = 174,000 QPS ร 6 KB = 1,044,000 KB/s โ 1 GB/sWhat this tells us: At ~1 GB/s peak egress, Twitter/X's network infrastructure must handle ~8 Gbps of outbound traffic from timeline endpoints alone. CDN caching of popular content is essential to reduce origin server load.
Worked Example: Twitter Storage Estimation (Full Walkthrough) โ
This step-by-step walkthrough shows how to chain assumptions into a complete estimate.
Estimation Process โ
Step-by-Step Calculation โ
Step 1: State assumptions clearly
- Daily Active Users (DAU): ~500 million
- Tweets posted per day: ~800 million
- Average tweet: 280 characters of text + 100 bytes metadata = ~300 bytes total
- 10% of tweets contain one image (average 200 KB after compression)
- 1% of tweets contain a video (average 2 MB for short video)
- Data retained: 5 years
Step 2: Text storage
Text per day = 800M tweets ร 300 bytes
= 240,000,000,000 bytes
= 240 GB/dayStep 3: Image storage
Tweets with images = 800M ร 10% = 80 million
Image storage/day = 80M ร 200 KB = 16,000,000,000 KB
= 16 TB/dayStep 4: Video storage
Tweets with video = 800M ร 1% = 8 million
Video storage/day = 8M ร 2 MB = 16,000,000 MB
= 16 TB/dayStep 5: Metadata (user data, indexes, etc.)
Metadata overhead โ 20% of total = ~6 TB/day (rough)Step 6: Sum daily total
Text: 0.24 TB/day
Images: 16 TB/day
Video: 16 TB/day
Metadata: 6 TB/day
โโโโโโโโโโโโโโโโโโโโโ
Total: ~38.24 TB/day โ 40 TB/dayStep 7: Apply replication factor
Storage with 3ร replication = 40 TB ร 3 = 120 TB/dayStep 8: Project over 5 years
5-year storage = 120 TB/day ร 365 days ร 5 years
= 120 ร 1,825
= 219,000 TB
โ 219 PBConclusion: Twitter/X needs approximately 219 petabytes of storage over 5 years. This demands a distributed object storage system (like S3 or HDFS), not a relational database. Media delivery via CDN is mandatory โ serving 32 TB/day of media from origin servers alone is not viable.
Worked Example: YouTube Bandwidth Estimation โ
Assumptions โ
- Monthly Active Users (MAU): ~2.7 billion (as of 2024)
- DAU โ 30% of MAU = ~800 million
- Average videos watched per DAU per day: 5 videos
- Average video duration: 5 minutes
- Video quality: blended average ~5 Mbps (mix of 720p, 1080p, and 4K streams)
- Upload rate: 500 hours of video uploaded every minute
Step-by-Step Calculation โ
Step 1: Daily video watch hours
Video views/day = 800M DAU ร 5 videos = 4 billion views/day
Watch minutes/day = 4B ร 5 min = 20 billion minutes/day
Watch hours/day = 20B รท 60 = ~333 million hours/dayStep 2: Outbound bandwidth (streaming)
Bandwidth per stream = 5 Mbps (blended average) = 5,000,000 bits/s
Concurrent viewers = 333M hours/day รท 24 hours
= ~13.9M concurrent viewers (average)
Average outbound BW = 13.9M ร 5 Mbps
= 69.5 Tbps (terabits per second, average)
Peak outbound BW = avg ร 3 (peak hour multiplier)
โ 208 TbpsStep 3: Inbound bandwidth (uploads)
Upload rate = 500 hours of video/minute
= 500 ร 60 minutes of video/minute
= 30,000 minutes of video/minute
At blended 5 Mbps per stream:
Inbound BW = 30,000 min/min ร 5 Mbps
= 150,000 Mbps
= 150 Gbps upload ingestion bandwidthStep 4: Storage for new uploads per day
New video/day = 500 hrs/min ร 60 min/hr ร 24 hrs
= 720,000 hours of video/day
At blended quality (~800 MB/hour compressed):
Storage/day = 720,000 hrs ร 800 MB
= 576,000,000 MB
โ 576 TB/day of new videoConclusion: YouTube's bandwidth requirements (~200+ Tbps peak outbound) make it one of the largest consumers of internet bandwidth globally. At this scale, YouTube must operate its own CDN infrastructure (Google Global Cache), peering directly with ISPs. No third-party CDN can handle this volume cost-effectively.
Common Estimation Mistakes โ
1. Forgetting Replication โ
Storage estimates are for raw data. In production, you replicate data 3ร (minimum) for durability.
Raw storage: 10 TB
With 3ร replication: 30 TB โ always use this number for cost/capacity planning2. Ignoring Metadata Overhead โ
Databases, file systems, and object stores all add metadata: indexes, checksums, tombstones, headers. Add 10โ30% overhead to any storage estimate.
3. Confusing Peak vs Average โ
Average QPS is what you calculate. But you must provision for peak QPS (2โ3ร average). A system that handles average load but crashes at peak is a failed design.
4. Confusing Bits and Bytes โ
Network bandwidth is measured in bits. Storage is measured in bytes.
1 Gbps network = 1 gigabit per second
= 125 megabytes per second (MB/s)Rule: Divide bits by 8 to get bytes. When someone says "we have a 1 Gbps pipe," they mean ~125 MB/s of actual data throughput.
5. Ignoring Growth Rate โ
A system handling 1,000 QPS today may need to handle 10,000 QPS in 18 months. Always ask: what is the expected growth rate? Commonly 2โ3ร per year for fast-growing products.
6. Treating All Operations as Equal โ
A "write" to a database is not the same cost as a "read." Writes typically require quorum confirmation across replicas, making them 5โ10ร more expensive. Separate your read QPS from write QPS in estimates.
Estimation Cheat Sheet โ
QPS Quick Reference โ
| Requests Per Day | QPS |
|---|---|
| 100K/day | ~1 QPS |
| 1M/day | ~12 QPS |
| 10M/day | ~115 QPS |
| 100M/day | ~1,160 QPS |
| 1B/day | ~11,574 QPS |
| 10B/day | ~115,740 QPS |
Bandwidth Quick Reference โ
| QPS ร Response Size | Bandwidth |
|---|---|
| 1,000 QPS ร 1 KB | 1 MB/s |
| 10,000 QPS ร 1 KB | 10 MB/s |
| 10,000 QPS ร 100 KB | 1 GB/s |
| 100,000 QPS ร 1 KB | 100 MB/s |
Storage Quick Reference โ
| Daily | Monthly | Yearly | 5-Year |
|---|---|---|---|
| 1 GB/day | ~30 GB | ~365 GB | ~1.8 TB |
| 100 GB/day | ~3 TB | ~36 TB | ~182 TB |
| 1 TB/day | ~30 TB | ~365 TB | ~1.8 PB |
| 10 TB/day | ~300 TB | ~3.6 PB | ~18 PB |
Multiplication Shortcuts โ
- ร1,000 = KB โ MB โ GB โ TB โ PB
- รท86,400 = requests/day โ QPS (or use รท100,000 for a fast rough estimate)
- ร3 = raw storage โ replicated storage
- ร2โ3 = average QPS โ peak QPS
- รท8 = bits โ bytes (for network bandwidth)
- ร1.2โ1.3 = add metadata overhead to storage
Key Takeaway: Back-of-envelope estimation is a practiced skill, not a talent. Memorize the reference tables, internalize the formulas, and practice on real systems. The goal is not precision โ it is order-of-magnitude correctness that guides architectural decisions.
Estimation Process Diagrams โ
The following diagrams show the process and calculation trees for the four core estimation types. Use them as a repeatable mental model for every new problem.
How to Approach Any Estimation Problem โ
QPS Estimation Tree โ
Start from DAU and decompose into per-service query rates.
Storage Estimation Tree โ
Bandwidth Estimation Tree โ
Related Chapters โ
| Chapter | Relevance |
|---|---|
| Ch02 โ Scalability | Estimation feeds directly into scalability planning |
| Ch25 โ Interview Framework | Estimation is Step 2 of the 4-step interview framework |
| Ch18 โ URL Shortener | Classic estimation walkthrough: QPS, storage, bandwidth |
Practice Questions โ
Attempt each estimate before reading the hint. Write your assumptions explicitly before calculating.
Beginner โ
Instagram Storage Estimation: Estimate how much new storage Instagram requires per year, given ~1.3B DAU. State all assumptions (posting rate, photo/video mix, average file sizes, replication factor) before calculating. What is the monthly storage growth in petabytes?
Hint
Assume ~5% of DAU post daily; photos average 3 MB, videos average 50 MB; use a 3ร replication factor and roughly 20/80 photo-to-video mix for posted content.
Intermediate โ
Uber Peak QPS Estimation: Estimate Uber's peak QPS for ride-related API calls during rush hour (5 PM Friday) in a major metro. Uber completes ~15M rides globally per day. Account for location updates, matching calls, and payment events per ride, then apply a realistic peak-to-average multiplier.
Hint
A single ride generates ~100 API calls spread over 20 minutes; peak hour sees ~3ร daily average; remember global vs. metro scope if the question narrows to one city.WhatsApp Message Throughput: Estimate the peak message throughput WhatsApp must handle globally. WhatsApp has ~2B MAU. State assumptions for DAU conversion rate, messages per active user per day, and delivery receipt overhead, then calculate peak QPS.
Hint
Each message generates at minimum 2 events (sent + delivered receipt); apply the standard 2โ3ร peak multiplier over the daily average QPS.Netflix Bandwidth Estimation: Estimate Netflix's total outbound bandwidth during peak evening hours (~8 PM local time). ~300M subscribers globally; assume 10% concurrently streaming. Use a blended bitrate of 3 Mbps across quality tiers. Express the answer in Tbps and compare to known internet backbone capacities.
Hint
20M concurrent streams ร 3 Mbps = X Tbps; to sanity-check, recall that Netflix has historically been cited as ~15% of global internet traffic during peak.
Advanced โ
Google Search Index Size: Estimate the storage required for Google's web search index. The crawled web has roughly 5โ10B pages. Account for compressed HTML storage, extracted inverted index structures (roughly 3ร raw data), PageRank scores, and the number of historical versions and replicas Google maintains for durability and query serving.
Hint
Start with raw page size (~10 KB compressed), multiply by index amplification factor and replica count; the answer should land in the tens-of-exabytes range and can be cross-checked against Google's reported data center capacity.
References & Further Reading โ
- "System Design Interview" โ Alex Xu, Chapter 2 (Back-of-the-Envelope Estimation)
- Jeff Dean's "Numbers Everyone Should Know"
- "The Art of Capacity Planning" โ John Allspaw
- Latency Numbers Every Programmer Should Know

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.