Skip to contentSkip to content
0/47 chapters completed (0%)

Chapter 14: Discord: Data Layer Evolution

"The best migration is the one you never have to do. The second best is the one you do once and never have to undo." — Discord Engineering

Mind Map

Prerequisites

This chapter assumes familiarity with NoSQL data models (Ch07), particularly Cassandra tombstones and compaction. Review that first if needed.

Phase 1: MongoDB (2015–2017)

Discord launched in May 2015 and chose MongoDB for the same reason many startups do: flexible schema, fast iteration, and a document model that matched the early mental model of "a message is a document."

The initial message schema was simple:

javascript
// MongoDB message document (2015)
{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "channel_id": "123456789",
  "author_id": "987654321",
  "content": "Hello, world!",
  "timestamp": ISODate("2015-05-13T10:00:00Z"),
  "mentions": ["111111111"],
  "embeds": [],
  "attachments": []
}

For the first year, this worked well. The document model matched the application's data — messages have variable structure (some have attachments, some have embeds, some have reactions). MongoDB's secondary indexes on channel_id and timestamp made "load the last 50 messages in a channel" fast.

Pain Points That Emerged at Scale

By 2017, Discord had 100 million messages per day and the MongoDB pain points were undeniable:

RAM-bound indexes: MongoDB requires its working set (active data + indexes) to fit in RAM for good performance. The messages collection's index on (channel_id, timestamp) was larger than available RAM on a single machine. Queries that fell outside the hot cache caused disk seeks.

Single-document transaction model: MongoDB 3.x (what Discord used) had no multi-document transactions. Updating a message and updating a channel's last_message_id in the same operation was not atomic.

No native TTL at scale: MongoDB's TTL index worked for simple cases but struggled with the rate at which Discord needed to expire and archive messages at scale.

Scaling model mismatch: MongoDB's sharding required careful shard key selection. Discord had already built too much application logic assuming a single MongoDB instance, making a retrofit expensive.

Why Not MongoDB at Scale

MongoDB's original storage engine (MMAPv1) was replaced by WiredTiger in 3.0, which improved significantly. However, by 2017, Discord's access pattern was fundamentally append-heavy — messages are rarely updated, never deleted by users — and time-ordered. This is exactly the pattern that wide-column stores with LSM engines are designed for. MongoDB's document model added flexibility Discord did not need, without the write throughput it did need.

Phase 2: Apache Cassandra (2017–2022)

Discord chose Cassandra for its write throughput, linear horizontal scalability, and time-ordered data model. Messages are append-heavy: Discord sends billions of messages that are rarely edited and almost never deleted. Cassandra's LSM storage engine is purpose-built for this pattern. See Ch01 — Storage Engines for the LSM fundamentals underlying this choice.

Data Model Design

The message table design was the most critical architectural decision:

sql
-- Discord's Cassandra messages table (simplified)
CREATE TABLE messages (
    channel_id   BIGINT,
    bucket       INT,              -- Time bucket: timestamp_ms / bucket_size_ms
    message_id   BIGINT,          -- Snowflake ID (encodes timestamp)
    author_id    BIGINT,
    content      TEXT,
    mentions     LIST<BIGINT>,
    reactions    MAP<TEXT, INT>,  -- emoji -> count (denormalized)
    edited_at    TIMESTAMP,
    deleted      BOOLEAN,
    PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Why (channel_id, bucket) as the partition key?

Cassandra stores all rows in a partition together on the same node. The natural partition key — just channel_id — creates unbounded partitions: a Discord server active for years would accumulate millions of messages in a single partition, eventually hitting Cassandra's practical partition size limit (~100MB or ~2 billion cells).

The time bucket solves this. Each bucket represents a fixed time window (e.g., 10 days of messages). When a bucket fills, it becomes read-only and compaction finishes quickly. The application calculates the bucket at query time:

python
BUCKET_SIZE_DAYS = 10
EPOCH_MS = 1420070400000  # Jan 1, 2015 in milliseconds

def get_bucket(message_id: int) -> int:
    # message_id is a Snowflake: top 42 bits encode millisecond timestamp
    timestamp_ms = (message_id >> 22) + EPOCH_MS
    return int(timestamp_ms / 1000 / 86400 / BUCKET_SIZE_DAYS)

def get_messages(channel_id: int, before_id: int, limit: int = 50):
    bucket = get_bucket(before_id)
    results = []

    while len(results) < limit:
        rows = session.execute(
            "SELECT * FROM messages "
            "WHERE channel_id = %s AND bucket = %s "
            "AND message_id < %s ORDER BY message_id DESC LIMIT %s",
            [channel_id, bucket, before_id, limit]
        )
        results.extend(rows)
        bucket -= 1    # Walk backwards through time buckets
        if bucket < 0:
            break

    return results[:limit]

Why Not User-ID as Partition Key

An alternative design partitions by author_id instead of channel_id. Discord rejected this because:

  • The primary read pattern is "load messages in channel X" — not "load all messages by user Y"
  • Partitioning by author would require scatter-gather across all authors in a channel for every message load
  • Channel-based partitioning means loading a channel's history requires exactly ceil(message_count / bucket_capacity) partition reads

Hot Partition Problem

Large Discord servers ("guilds") with millions of members generate thousands of messages per second in a single channel. All those writes land on the (channel_id, bucket) partition — potentially on a single Cassandra node.

Discord addressed hot partitions through:

  1. Replication factor 3 — writes go to 3 nodes; no single node bears 100% of the write load
  2. Cassandra token assignment — the consistent hashing ring distributes the 3 replicas across different physical nodes
  3. Rate limiting — server-side rate limits prevent any single channel from exceeding burst thresholds

JVM GC Pauses: The Critical Failure Mode

Cassandra runs on the JVM. As of 2022, Discord was running Cassandra 3.x with Java heap sizes of 16–32GB. Stop-the-world GC pauses (even with G1GC or ZGC) caused:

  • P99 latency spikes from 40ms to 300–500ms during GC
  • Occasional full GC pauses causing 2–10 second outages visible to users
  • Compaction pressure triggering GC: background compaction (merging SSTables) generates garbage that triggered GC

Discord's SRE team spent significant engineering time tuning JVM flags, heap sizes, and GC algorithms. They could reduce pause frequency but not eliminate it at their write rates.

Phase 3: ScyllaDB (2022–Present)

ScyllaDB is a C++ reimplementation of the Cassandra storage engine. It is API-compatible with Cassandra (same CQL, same data model, same replication protocol) but eliminates the JVM entirely. Discord's decision to migrate was driven by three ScyllaDB advantages:

1. No JVM — No GC Pauses

ScyllaDB is written in C++ using the Seastar framework. Memory management is explicit; there is no garbage collector. Compaction runs without causing stop-the-world pauses. Discord's GC-induced P99 spikes disappeared.

2. Shard-Per-Core Architecture

ScyllaDB assigns each CPU core its own shard of data and handles requests for that shard without cross-core communication. This eliminates lock contention that occurs in the JVM Cassandra implementation when multiple threads compete for shared data structures.

3. Dramatic Infrastructure Reduction

Discord's results after migrating message storage to ScyllaDB:

MetricCassandraScyllaDBImprovement
Node count177 nodes72 nodes59% reduction
P99 read latency40–125ms15ms3–8× better
P99 write latency5ms5msComparable
GC-induced spikesFrequentNoneEliminated
Latency jitterHighLowPredictable

Migration Strategy

The MongoDB→Cassandra and Cassandra→ScyllaDB migrations followed similar patterns but with different tooling.

Dual-Write Pattern

Every migration began with a dual-write phase: the application writes to both old and new stores simultaneously.

Key principles:

  • New DB is the primary write target; old DB is the fallback read source
  • Reads check new DB first; fall back to old DB if the message has not yet been migrated
  • Dual-write runs for weeks before backfill begins, ensuring all new messages land in the new store

Rust Bulk Migrator

For the Cassandra→ScyllaDB migration, Discord built a bulk migrator in Rust. The choice of Rust was deliberate:

  • No GC: migrating data at maximum throughput while avoiding GC pauses in the migrator itself
  • Controlled memory: Rust's ownership model ensures the migrator does not accumulate unbounded heap
  • Throughput achieved: 3.2 million messages per second sustained throughput

Tombstone Stall Lesson

The most painful incident during migration was the tombstone crisis. In Cassandra and ScyllaDB, deletes do not immediately remove data — they write a tombstone marker. The compaction process eventually removes tombstones and associated data. But during compaction, a read must scan past all tombstones to find live data.

Discord's messages table had a deleted column (soft deletes) and also used actual CQL DELETE statements for moderation use cases. During the migration backfill, old tombstones accumulated faster than compaction could clear them. Read latency on channels with heavy moderation activity spiked because each read scanned thousands of tombstones.

Resolution: Discord changed the data model to use a deleted BOOLEAN column and never issue CQL DELETE — instead updating the column to true. This produces a tombstone-free compaction path. The lesson: model deletes as updates, not as deletes, in LSM-based databases.

Sampling Validation

After backfill, Discord validated correctness by sampling:

  1. Random 0.1% of channel_id values
  2. Fetch last 1,000 messages from each in both old and new DB
  3. Compare message_id sets and content checksums
  4. Alert if mismatch rate exceeded 0.001%

This approach validated correctness without the cost of reading every row in both databases.

Schema Analysis: Why Channel-First

The (channel_id, bucket) partition key optimizes for the 99% case: "load messages in channel X." But it creates inefficiency for "find all messages by user Y." Discord made an explicit product decision: users cannot search their own message history across servers. This removed the requirement for an author-indexed table.

Access PatternEfficiencyNotes
Load last N messages in channelExcellentSingle partition read
Load message by IDExcellentDerive partition from Snowflake timestamp
Search by keyword in channelGoodElasticsearch sidecar handles this
All messages by user across serversNot supportedProduct decision
Count unread messages per channelGoodRead from latest bucket only

Reactions and Embeds as Denormalized Columns

Reactions (emoji responses to messages) and embeds (link previews) are stored as map columns within the message row rather than separate tables:

sql
-- Reactions stored as MAP<text, int> within the message row
-- Query: SELECT reactions FROM messages WHERE ...
-- Result:
reactions = {
    '👍': 42,
    '❤️': 15,
    '😂': 7
}

This denormalization avoids secondary index lookups for the common case (display a message with its reaction counts). The cost: if reactions need to be queried independently (e.g., "show all messages reacted to with ❤️"), no efficient query path exists. Discord accepted this limitation because that access pattern is not exposed in the product. This is the same denormalization principle seen in Instagram's like_count pattern.

Key Lessons

LessonDetail
Best migration is the one you avoidMongoDB worked for 2 years; the Cassandra migration was significant engineering cost
Model deletes as updates in LSM storesCQL DELETE creates tombstones that accumulate and stall reads; use a deleted boolean instead
Test with production access patternsBenchmarks that don't replicate hot partitions and time-bucket traversals miss the real bottleneck
Rust for data migration pipelinesNo GC and controlled memory achieved 3.2M msg/sec throughput; Python or Java migrators would GC under this load
Monitor compaction health, not just query latencyCassandra's compaction falling behind was a leading indicator 30 minutes before query latency degraded
Shard-per-core eliminates JVM as the bottleneckScyllaDB's architecture shift — not just a language change — reduced node count by 59%

Common Mistakes

MistakeWhy It HappensImpactFix
Choosing Cassandra for small datasets"It scales to petabytes"Complex operational model, tombstone management, and eventual consistency with no benefit below 100GBUse PostgreSQL or MongoDB for datasets under 1TB; Cassandra's strengths emerge at massive write volume
Not testing with production tombstone patternsDev data has no deletesTombstone accumulation not discovered until production read latency spikesLoad test with realistic delete rates; set tombstone_warn_threshold monitoring before go-live
Skipping dual-write validation before full cutover"Migration tool was tested"Subtle data transformation bugs only appear on edge-case documentsRun shadow reads comparing old vs new for at least 48 hours across all access patterns before cutover
Underestimating JVM GC impact on tail latency"Average latency is fine"p99 latency spikes during GC pauses invisible in average metricsMonitor p99/p999 latency, not average; GC pauses show up only in tail percentiles
ChapterRelevance
Ch01 — Database LandscapeLSM vs B-tree storage engine tradeoffs underlying this evolution
Ch02 — Data Modeling for ScaleAccess-pattern-driven schema design used in channel + bucket key
Ch09 — Replication & High AvailabilityCassandra's tunable consistency used during dual-write migration
Ch10 — Sharding & PartitioningToken ring and partition key selection relevant to bucket design
Ch13 — Instagram: PostgreSQL at ScaleParallel case study: denormalization and sharding in a different DB family
Ch16 — Database Selection FrameworkApplying Discord's migration lessons to future selection decisions

Practice Questions

Beginner

  1. Tombstone Behavior: A Discord channel has 1 million messages of which 50,000 have been moderated (deleted) using CQL DELETE. How does this affect read performance when loading the most recent 50 messages? What alternative data model avoids this problem?

    Model Answer Cassandra and ScyllaDB mark deleted rows with tombstones. Reads must scan past tombstones to find live rows. With 50,000 tombstones scattered across the partition, reading 50 live messages requires scanning many more entries. Alternative: use a `deleted BOOLEAN` column and `UPDATE messages SET deleted = true WHERE ...` — this writes a regular value update, not a tombstone, so compaction handles it efficiently.
  2. Time Bucketing: Discord uses 10-day time buckets for message partitions. What happens when you load the last 50 messages in a channel and the most recent bucket has only 20 messages? Describe the query behavior step by step.

    Model Answer The application calculates the current bucket from the most recent message_id, queries that bucket and gets 20 messages. Since fewer than 50 were returned, it decrements the bucket number and queries again — getting up to 30 more messages from the previous bucket. Each bucket query is a single Cassandra partition read. This continues until 50 total messages are found or the earliest bucket is reached.

Intermediate

  1. Hot Partition Mitigation: A large Discord server has 2 million members and one main channel that receives 10,000 messages per minute during live events. How would you detect this hot partition, and what options exist to mitigate it without changing the data model?

    Model Answer Detection: monitor per-partition write rate using ScyllaDB's shard-level metrics or Cassandra's nodetool tablestats. Mitigation options: (1) Read replicas with LOCAL_QUORUM reads to distribute read load across 3 nodes. (2) Application-level rate limiting on writes per channel. (3) Token-aware driver so each write goes directly to the correct replica without coordinator overhead. (4) Increase replication factor so writes land on more nodes.
  2. Dual-Write Migration Risk: During the Cassandra→ScyllaDB migration, a user edits a message. The edit lands in ScyllaDB but the async write to Cassandra fails due to a network partition. An hour later, the migration completes and Cassandra is decommissioned. Was any data lost? What if the failure was reversed (Cassandra succeeds, ScyllaDB fails)?

    Model Answer If ScyllaDB (new DB) gets the write and Cassandra (old DB) does not: no data is lost — ScyllaDB is the source of truth. If Cassandra gets the write and ScyllaDB does not: the edit is lost in ScyllaDB. The message shows the pre-edit version. This is why the migration framework must treat the new DB as the primary write target — new DB success is required; old DB failure is acceptable.

Advanced

  1. ScyllaDB vs Cassandra Architecture: Discord reduced from 177 Cassandra nodes to 72 ScyllaDB nodes while handling the same workload. Explain the architectural reasons why ScyllaDB requires fewer nodes. What does the shard-per-core model eliminate that Cassandra's JVM thread pool cannot? Under what conditions might ScyllaDB's advantage shrink?

    Model Answer Three factors: (1) No JVM overhead — ScyllaDB's C++ uses less memory per operation, so more CPU budget goes to actual work vs. GC bookkeeping. (2) Shard-per-core eliminates cross-core synchronization — each core handles its partitions without locks. (3) Predictable compaction — no GC pauses mean compaction runs continuously at steady state rather than bursty. ScyllaDB's advantage shrinks with very low concurrency (a single-threaded workload doesn't benefit from shard-per-core) and with workloads that are network-bound rather than CPU-bound.

References

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.

Built with VitePress + Dracula Theme