Chapter 14: Discord: Data Layer Evolution

"The best migration is the one you never have to do. The second best is the one you do once and never have to undo." — Discord Engineering

Mind Map

Prerequisites

This chapter assumes familiarity with NoSQL data models (Ch07), particularly Cassandra tombstones and compaction. Review that first if needed.

Phase 1: MongoDB (2015–2017)

Discord launched in May 2015 and chose MongoDB for the same reason many startups do: flexible schema, fast iteration, and a document model that matched the early mental model of "a message is a document."

The initial message schema was simple:

javascript

// MongoDB message document (2015)
{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "channel_id": "123456789",
  "author_id": "987654321",
  "content": "Hello, world!",
  "timestamp": ISODate("2015-05-13T10:00:00Z"),
  "mentions": ["111111111"],
  "embeds": [],
  "attachments": []
}

For the first year, this worked well. The document model matched the application's data — messages have variable structure (some have attachments, some have embeds, some have reactions). MongoDB's secondary indexes on channel_id and timestamp made "load the last 50 messages in a channel" fast.

Pain Points That Emerged at Scale

By 2017, Discord had 100 million messages per day and the MongoDB pain points were undeniable:

RAM-bound indexes: MongoDB requires its working set (active data + indexes) to fit in RAM for good performance. The messages collection's index on (channel_id, timestamp) was larger than available RAM on a single machine. Queries that fell outside the hot cache caused disk seeks.

Single-document transaction model: MongoDB 3.x (what Discord used) had no multi-document transactions. Updating a message and updating a channel's last_message_id in the same operation was not atomic.

No native TTL at scale: MongoDB's TTL index worked for simple cases but struggled with the rate at which Discord needed to expire and archive messages at scale.

Scaling model mismatch: MongoDB's sharding required careful shard key selection. Discord had already built too much application logic assuming a single MongoDB instance, making a retrofit expensive.

Why Not MongoDB at Scale

MongoDB's original storage engine (MMAPv1) was replaced by WiredTiger in 3.0, which improved significantly. However, by 2017, Discord's access pattern was fundamentally append-heavy — messages are rarely updated, never deleted by users — and time-ordered. This is exactly the pattern that wide-column stores with LSM engines are designed for. MongoDB's document model added flexibility Discord did not need, without the write throughput it did need.

Phase 2: Apache Cassandra (2017–2022)

Discord chose Cassandra for its write throughput, linear horizontal scalability, and time-ordered data model. Messages are append-heavy: Discord sends billions of messages that are rarely edited and almost never deleted. Cassandra's LSM storage engine is purpose-built for this pattern. See Ch01 — Storage Engines for the LSM fundamentals underlying this choice.

Data Model Design

The message table design was the most critical architectural decision:

sql

-- Discord's Cassandra messages table (simplified)
CREATE TABLE messages (
    channel_id   BIGINT,
    bucket       INT,              -- Time bucket: timestamp_ms / bucket_size_ms
    message_id   BIGINT,          -- Snowflake ID (encodes timestamp)
    author_id    BIGINT,
    content      TEXT,
    mentions     LIST<BIGINT>,
    reactions    MAP<TEXT, INT>,  -- emoji -> count (denormalized)
    edited_at    TIMESTAMP,
    deleted      BOOLEAN,
    PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Why (channel_id, bucket) as the partition key?

Cassandra stores all rows in a partition together on the same node. The natural partition key — just channel_id — creates unbounded partitions: a Discord server active for years would accumulate millions of messages in a single partition, eventually hitting Cassandra's practical partition size limit (~100MB or ~2 billion cells).

The time bucket solves this. Each bucket represents a fixed time window (e.g., 10 days of messages). When a bucket fills, it becomes read-only and compaction finishes quickly. The application calculates the bucket at query time:

python

BUCKET_SIZE_DAYS = 10
EPOCH_MS = 1420070400000  # Jan 1, 2015 in milliseconds

def get_bucket(message_id: int) -> int:
    # message_id is a Snowflake: top 42 bits encode millisecond timestamp
    timestamp_ms = (message_id >> 22) + EPOCH_MS
    return int(timestamp_ms / 1000 / 86400 / BUCKET_SIZE_DAYS)

def get_messages(channel_id: int, before_id: int, limit: int = 50):
    bucket = get_bucket(before_id)
    results = []

    while len(results) < limit:
        rows = session.execute(
            "SELECT * FROM messages "
            "WHERE channel_id = %s AND bucket = %s "
            "AND message_id < %s ORDER BY message_id DESC LIMIT %s",
            [channel_id, bucket, before_id, limit]
        )
        results.extend(rows)
        bucket -= 1    # Walk backwards through time buckets
        if bucket < 0:
            break

    return results[:limit]

Why Not User-ID as Partition Key

An alternative design partitions by author_id instead of channel_id. Discord rejected this because:

The primary read pattern is "load messages in channel X" — not "load all messages by user Y"
Partitioning by author would require scatter-gather across all authors in a channel for every message load
Channel-based partitioning means loading a channel's history requires exactly ceil(message_count / bucket_capacity) partition reads

Hot Partition Problem

Large Discord servers ("guilds") with millions of members generate thousands of messages per second in a single channel. All those writes land on the (channel_id, bucket) partition — potentially on a single Cassandra node.

Discord addressed hot partitions through:

Replication factor 3 — writes go to 3 nodes; no single node bears 100% of the write load
Cassandra token assignment — the consistent hashing ring distributes the 3 replicas across different physical nodes
Rate limiting — server-side rate limits prevent any single channel from exceeding burst thresholds

JVM GC Pauses: The Critical Failure Mode

Cassandra runs on the JVM. As of 2022, Discord was running Cassandra 3.x with Java heap sizes of 16–32GB. Stop-the-world GC pauses (even with G1GC or ZGC) caused:

P99 latency spikes from 40ms to 300–500ms during GC
Occasional full GC pauses causing 2–10 second outages visible to users
Compaction pressure triggering GC: background compaction (merging SSTables) generates garbage that triggered GC

Discord's SRE team spent significant engineering time tuning JVM flags, heap sizes, and GC algorithms. They could reduce pause frequency but not eliminate it at their write rates.

Phase 3: ScyllaDB (Migration Completed 2022–2023)

ScyllaDB is a C++ reimplementation of the Cassandra storage engine. It is API-compatible with Cassandra (same CQL, same data model, same replication protocol) but eliminates the JVM entirely. Discord began the migration in 2022 and completed it in 2023. The decision to migrate was driven by three ScyllaDB advantages:

1. No JVM — No GC Pauses

ScyllaDB is written in C++ using the Seastar framework. Memory management is explicit; there is no garbage collector. Compaction runs without causing stop-the-world pauses. Discord's GC-induced P99 spikes disappeared.

2. Shard-Per-Core Architecture

ScyllaDB assigns each CPU core its own shard of data and handles requests for that shard without cross-core communication. This eliminates lock contention that occurs in the JVM Cassandra implementation when multiple threads compete for shared data structures.

3. Dramatic Infrastructure Reduction

Discord's published results from the completed migration (2023):

Metric	Cassandra (before)	ScyllaDB (after)	Improvement
Node count	177 nodes	72 nodes	59% reduction
P99 read latency	40–125ms	~15ms	3–8× better
P99 write latency	5ms	5ms	Comparable
GC-induced spikes	Frequent (JVM stop-the-world)	None	Eliminated
Messages migrated	—	Trillions	~9 days total
Backfill rate	—	~3.2M messages/sec	Rust migrator

Migration Strategy

The MongoDB→Cassandra and Cassandra→ScyllaDB migrations followed similar patterns but with different tooling. Both migrations are now complete. The description below uses past tense to reflect the completed state.

Dual-Write Pattern

Every migration began with a dual-write phase: the application writes to both old and new stores simultaneously.

Key principles:

New DB is the primary write target; old DB is the fallback read source
Reads check new DB first; fall back to old DB if the message has not yet been migrated
Dual-write runs for weeks before backfill begins, ensuring all new messages land in the new store

Rust Bulk Migrator

For the Cassandra→ScyllaDB migration, Discord built a bulk migrator in Rust. The choice of Rust was deliberate:

No GC: migrating data at maximum throughput while avoiding GC pauses in the migrator itself
Controlled memory: Rust's ownership model ensures the migrator does not accumulate unbounded heap
Throughput achieved: 3.2 million messages per second sustained throughput

Rust Beyond the Migrator

Discord's use of Rust for the bulk migrator was not a one-off choice. Discord's data services layer — the application-level services that sit between the application tier and ScyllaDB — is also implemented in Rust (published in Discord's 2023 engineering post on the migration). The combination of Rust's memory safety, zero GC, and high throughput made it Discord's preferred language for performance-critical database infrastructure.

Tombstone Stall Lesson

The most painful incident during migration was the tombstone crisis. In Cassandra and ScyllaDB, deletes do not immediately remove data — they write a tombstone marker. The compaction process eventually removes tombstones and associated data. But during compaction, a read must scan past all tombstones to find live data.

Discord's messages table had a deleted column (soft deletes) and also used actual CQL DELETE statements for moderation use cases. During the migration backfill, old tombstones accumulated faster than compaction could clear them. Read latency on channels with heavy moderation activity spiked because each read scanned thousands of tombstones.

Resolution: Discord changed the data model to use a deleted BOOLEAN column and never issue CQL DELETE — instead updating the column to true. This produces a tombstone-free compaction path. The lesson: model deletes as updates, not as deletes, in LSM-based databases.

Sampling Validation

After backfill, Discord validated correctness by sampling:

Random 0.1% of channel_id values
Fetch last 1,000 messages from each in both old and new DB
Compare message_id sets and content checksums
Alert if mismatch rate exceeded 0.001%

This approach validated correctness without the cost of reading every row in both databases.

Schema Analysis: Why Channel-First

The (channel_id, bucket) partition key optimizes for the 99% case: "load messages in channel X." But it creates inefficiency for "find all messages by user Y." Discord made an explicit product decision: users cannot search their own message history across servers. This removed the requirement for an author-indexed table.

Access Pattern	Efficiency	Notes
Load last N messages in channel	Excellent	Single partition read
Load message by ID	Excellent	Derive partition from Snowflake timestamp
Search by keyword in channel	Good	Elasticsearch sidecar handles this
All messages by user across servers	Not supported	Product decision
Count unread messages per channel	Good	Read from latest bucket only

Reactions and Embeds as Denormalized Columns

Reactions (emoji responses to messages) and embeds (link previews) are stored as map columns within the message row rather than separate tables:

sql

-- Reactions stored as MAP<text, int> within the message row
-- Query: SELECT reactions FROM messages WHERE ...
-- Result:
reactions = {
    '👍': 42,
    '❤️': 15,
    '😂': 7
}

This denormalization avoids secondary index lookups for the common case (display a message with its reaction counts). The cost: if reactions need to be queried independently (e.g., "show all messages reacted to with ❤️"), no efficient query path exists. Discord accepted this limitation because that access pattern is not exposed in the product. This is the same denormalization principle seen in Instagram's like_count pattern.

Key Lessons

Lesson	Detail
Best migration is the one you avoid	MongoDB worked for 2 years; the Cassandra migration was significant engineering cost
Model deletes as updates in LSM stores	CQL DELETE creates tombstones that accumulate and stall reads; use a `deleted` boolean instead
Test with production access patterns	Benchmarks that don't replicate hot partitions and time-bucket traversals miss the real bottleneck
Rust for data migration pipelines	No GC and controlled memory achieved ~3.2M msg/sec throughput; Python or Java migrators would GC under this load
Monitor compaction health, not just query latency	Cassandra's compaction falling behind was a leading indicator 30 minutes before query latency degraded
Shard-per-core eliminates JVM as the bottleneck	ScyllaDB's architecture shift — not just a language change — reduced node count by 59%
Rust for data services (not just migrations)	Discord's data services layer (application-to-ScyllaDB tier) is also Rust — the same GC-free, high-throughput properties apply to the permanent service, not just the one-off migration tool

Common Mistakes

Mistake	Why It Happens	Impact	Fix
Choosing Cassandra for small datasets	"It scales to petabytes"	Complex operational model, tombstone management, and eventual consistency with no benefit below 100GB	Use PostgreSQL or MongoDB for datasets under 1TB; Cassandra's strengths emerge at massive write volume
Not testing with production tombstone patterns	Dev data has no deletes	Tombstone accumulation not discovered until production read latency spikes	Load test with realistic delete rates; set `tombstone_warn_threshold` monitoring before go-live
Skipping dual-write validation before full cutover	"Migration tool was tested"	Subtle data transformation bugs only appear on edge-case documents	Run shadow reads comparing old vs new for at least 48 hours across all access patterns before cutover
Underestimating JVM GC impact on tail latency	"Average latency is fine"	p99 latency spikes during GC pauses invisible in average metrics	Monitor p99/p999 latency, not average; GC pauses show up only in tail percentiles

Chapter	Relevance
Ch01 — Database Landscape	LSM vs B-tree storage engine tradeoffs underlying this evolution
Ch02 — Data Modeling for Scale	Access-pattern-driven schema design used in channel + bucket key
Ch09 — Replication & High Availability	Cassandra's tunable consistency used during dual-write migration
Ch10 — Sharding & Partitioning	Token ring and partition key selection relevant to bucket design
Ch13 — Instagram: PostgreSQL at Scale	Parallel case study: denormalization and sharding in a different DB family
Ch16 — Database Selection Framework	Applying Discord's migration lessons to future selection decisions

Practice Questions

Beginner

Tombstone Behavior: A Discord channel has 1 million messages of which 50,000 have been moderated (deleted) using CQL DELETE. How does this affect read performance when loading the most recent 50 messages? What alternative data model avoids this problem?
Model Answer
Cassandra and ScyllaDB mark deleted rows with tombstones. Reads must scan past tombstones to find live rows. With 50,000 tombstones scattered across the partition, reading 50 live messages requires scanning many more entries. Alternative: use a `deleted BOOLEAN` column and `UPDATE messages SET deleted = true WHERE ...` — this writes a regular value update, not a tombstone, so compaction handles it efficiently.
Time Bucketing: Discord uses 10-day time buckets for message partitions. What happens when you load the last 50 messages in a channel and the most recent bucket has only 20 messages? Describe the query behavior step by step.
Model Answer
The application calculates the current bucket from the most recent message_id, queries that bucket and gets 20 messages. Since fewer than 50 were returned, it decrements the bucket number and queries again — getting up to 30 more messages from the previous bucket. Each bucket query is a single Cassandra partition read. This continues until 50 total messages are found or the earliest bucket is reached.

Intermediate

Hot Partition Mitigation: A large Discord server has 2 million members and one main channel that receives 10,000 messages per minute during live events. How would you detect this hot partition, and what options exist to mitigate it without changing the data model?
Model Answer
Detection: monitor per-partition write rate using ScyllaDB's shard-level metrics or Cassandra's nodetool tablestats. Mitigation options: (1) Read replicas with LOCAL_QUORUM reads to distribute read load across 3 nodes. (2) Application-level rate limiting on writes per channel. (3) Token-aware driver so each write goes directly to the correct replica without coordinator overhead. (4) Increase replication factor so writes land on more nodes.
Dual-Write Migration Risk: During the Cassandra→ScyllaDB migration, a user edits a message. The edit lands in ScyllaDB but the async write to Cassandra fails due to a network partition. An hour later, the migration completes and Cassandra is decommissioned. Was any data lost? What if the failure was reversed (Cassandra succeeds, ScyllaDB fails)?
Model Answer
If ScyllaDB (new DB) gets the write and Cassandra (old DB) does not: no data is lost — ScyllaDB is the source of truth. If Cassandra gets the write and ScyllaDB does not: the edit is lost in ScyllaDB. The message shows the pre-edit version. This is why the migration framework must treat the new DB as the primary write target — new DB success is required; old DB failure is acceptable.

Advanced

ScyllaDB vs Cassandra Architecture: Discord reduced from 177 Cassandra nodes to 72 ScyllaDB nodes while handling the same workload. Explain the architectural reasons why ScyllaDB requires fewer nodes. What does the shard-per-core model eliminate that Cassandra's JVM thread pool cannot? Under what conditions might ScyllaDB's advantage shrink?
Model Answer
Three factors: (1) No JVM overhead — ScyllaDB's C++ uses less memory per operation, so more CPU budget goes to actual work vs. GC bookkeeping. (2) Shard-per-core eliminates cross-core synchronization — each core handles its partitions without locks. (3) Predictable compaction — no GC pauses mean compaction runs continuously at steady state rather than bursty. ScyllaDB's advantage shrinks with very low concurrency (a single-threaded workload doesn't benefit from shard-per-core) and with workloads that are network-bound rather than CPU-bound.

References

How Discord Stores Billions of Messages — Discord Engineering Blog (2017)
How Discord Stores Trillions of Messages — Discord Engineering Blog (2023)
How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB — ScyllaDB Tech Talk (2023)
ScyllaDB Architecture: Shared-Nothing Design — ScyllaDB Documentation
Cassandra Tombstones: The Good, The Bad, and the Ugly — The Last Pickle (2016)
The Seastar Framework for High-Performance Server Applications — Avi Kivity et al.
"Designing Data-Intensive Applications" — Kleppmann, Chapter 3 (Storage and Retrieval)
Apache Cassandra Documentation: Compaction

Part 1: Foundations

Part 2: Engine Deep Dives

Part 3: Scaling & Operations

Part 4: Real-World Design

Chapter 14: Discord: Data Layer Evolution

Mind Map

Phase 1: MongoDB (2015–2017)

Pain Points That Emerged at Scale

Phase 2: Apache Cassandra (2017–2022)

Data Model Design

Why Not User-ID as Partition Key

Hot Partition Problem

JVM GC Pauses: The Critical Failure Mode

Phase 3: ScyllaDB (Migration Completed 2022–2023)

1. No JVM — No GC Pauses

2. Shard-Per-Core Architecture

3. Dramatic Infrastructure Reduction

Migration Strategy

Dual-Write Pattern

Rust Bulk Migrator

Tombstone Stall Lesson

Sampling Validation

Schema Analysis: Why Channel-First

Reactions and Embeds as Denormalized Columns

Key Lessons

Common Mistakes

Practice Questions

Beginner

Intermediate

Advanced

References

Chapter 14: Discord: Data Layer Evolution ​

Mind Map ​

Phase 1: MongoDB (2015–2017) ​

Pain Points That Emerged at Scale ​

Phase 2: Apache Cassandra (2017–2022) ​

Data Model Design ​

Why Not User-ID as Partition Key ​

Hot Partition Problem ​

JVM GC Pauses: The Critical Failure Mode ​

Phase 3: ScyllaDB (Migration Completed 2022–2023) ​

1. No JVM — No GC Pauses ​

2. Shard-Per-Core Architecture ​

3. Dramatic Infrastructure Reduction ​

Migration Strategy ​

Dual-Write Pattern ​

Rust Bulk Migrator ​

Tombstone Stall Lesson ​

Sampling Validation ​

Schema Analysis: Why Channel-First ​

Reactions and Embeds as Denormalized Columns ​

Key Lessons ​

Common Mistakes ​

Related Chapters ​

Practice Questions ​

Beginner ​

Intermediate ​

Advanced ​

References ​

Chapter 14: Discord: Data Layer Evolution

Mind Map

Phase 1: MongoDB (2015–2017)

Pain Points That Emerged at Scale

Phase 2: Apache Cassandra (2017–2022)

Data Model Design

Why Not User-ID as Partition Key

Hot Partition Problem

JVM GC Pauses: The Critical Failure Mode

Phase 3: ScyllaDB (Migration Completed 2022–2023)

1. No JVM — No GC Pauses

2. Shard-Per-Core Architecture

3. Dramatic Infrastructure Reduction

Migration Strategy

Dual-Write Pattern

Rust Bulk Migrator

Tombstone Stall Lesson

Sampling Validation

Schema Analysis: Why Channel-First

Reactions and Embeds as Denormalized Columns

Key Lessons

Common Mistakes

Related Chapters

Practice Questions

Beginner

Intermediate

Advanced

References