Chapter 6: MySQL & Distributed SQL β
"MySQL won the web era. Vitess is MySQL's answer to the distributed era. NewSQL is the question after that."
Mind Map β
Prerequisites
This chapter assumes you understand storage engine fundamentals, B-tree vs LSM trade-offs (Ch01), and MVCC/isolation levels (Ch04). Review those first if needed.
InnoDB Internals β
InnoDB is MySQL's default storage engine since 5.5 (2010). Understanding its internals is prerequisite for tuning and troubleshooting any MySQL production system.
Buffer Pool β
The buffer pool is InnoDB's equivalent of PostgreSQL's shared_buffers β an in-memory page cache that holds data pages and index pages. The critical difference: InnoDB manages its own LRU list with a "midpoint insertion strategy" to prevent full table scans from evicting hot OLTP pages.
# my.cnf β InnoDB buffer pool configuration
[mysqld]
innodb_buffer_pool_size = 48G # 75% of 64GB RAM
innodb_buffer_pool_instances = 8 # One per ~6GB; reduces mutex contention
innodb_buffer_pool_chunk_size = 128M # Must divide evenly into pool size/instancesClustered Index and Secondary Index Architecture β
InnoDB organizes every table as a B-tree indexed by the primary key β this is the clustered index. The table data is the index; there is no separate heap file. This is fundamentally different from PostgreSQL, where heap files and indexes are separate structures.
The secondary index double-read problem: Every secondary index lookup requires two B-tree traversals β once to find the primary key from the secondary index, then again on the clustered index to retrieve the full row. This is why SELECT * via a secondary index is expensive. Use covering indexes to avoid the second lookup:
-- Without covering index: two B-tree traversals
SELECT name, email FROM users WHERE email = 'a@x.com';
-- With covering index: one traversal, data in index leaf
CREATE INDEX idx_users_email_name ON users (email, name);
-- Now (email, name) are both in the index β no clustered index lookup neededUndo Log and Redo Log β
InnoDB uses two separate log structures:
| Log | Purpose | Location | Behavior |
|---|---|---|---|
| Redo Log | Crash recovery β replays committed changes | ib_logfile0, ib_logfile1 | Sequential writes, fixed size (ring buffer) |
| Undo Log | MVCC β provides old row versions for snapshots | System tablespace or undo tablespaces | Grows with long transactions |
# Redo log tuning β larger = fewer checkpoints, longer recovery time
innodb_log_file_size = 2G # Each file; total = 4GB
innodb_log_files_in_group = 2 # Standard is 2
innodb_log_buffer_size = 64M # In-memory buffer before fsync
# Undo tablespace tuning
innodb_undo_tablespaces = 4 # Separate undo files (MySQL 8.0+)
innodb_max_undo_log_size = 1G # Trigger undo truncation above thisLong Transactions Bloat the Undo Log
A transaction left open (even idle) forces InnoDB to retain all row versions modified since that transaction started. A 6-hour-old read transaction on a high-write table can cause the undo log to grow by gigabytes. Monitor INFORMATION_SCHEMA.INNODB_TRX and kill idle transactions.
MySQL Replication β
Binary Log and GTID β
MySQL replication is statement-based or row-based, controlled by the binary log (binlog). Unlike PostgreSQL's WAL-based replication, MySQL replicas replay SQL statements or row events from the binlog.
GTID (Global Transaction Identifier): Each transaction gets a globally unique ID (server_uuid:sequence_number). GTIDs make failover deterministic β a replica knows exactly which transactions it has applied, enabling automatic primary election without manual binlog position tracking.
-- Enable GTID replication (MySQL 5.6+)
-- my.cnf:
-- gtid_mode = ON
-- enforce_gtid_consistency = ON
-- Check GTID status
SHOW MASTER STATUS\G
SHOW REPLICA STATUS\G
-- Find transactions on primary not yet on replica
SELECT GTID_SUBTRACT(
@@GLOBAL.gtid_executed, -- primary
'...' -- replica's gtid_executed
) AS missing_transactions;MySQL Group Replication and InnoDB Cluster β
Group Replication adds Paxos consensus to MySQL replication, enabling multi-primary writes with automatic conflict detection. InnoDB Cluster packages Group Replication with MySQL Shell and MySQL Router for production deployments.
| Mode | Description | Write Throughput | Use Case |
|---|---|---|---|
| Single-Primary | One writer, automatic failover | Limited by single node | OLTP with HA requirement |
| Multi-Primary | All nodes accept writes, conflict detection | Higher but coordination overhead | Geographically distributed writes |
Group Replication Latency
Every write in Group Replication must be certified by a quorum of nodes before committing. This adds 1β5ms latency for local cluster, 10β50ms for geo-distributed cluster. For write-heavy workloads, this overhead matters. Vitess scales better for very high write rates.
Vitess: MySQL at YouTube Scale β
Vitess was built at YouTube in 2010 to scale MySQL horizontally without changing the application. It is now the foundation of Shopify, GitHub, Slack, and PlanetScale.
Architecture β
VTGate is a stateless proxy layer that speaks the MySQL protocol. Applications connect to VTGate exactly as if connecting to MySQL. VTGate parses queries, consults the topology service for the shard map, and routes queries to the appropriate VTTablet(s).
VTTablet runs alongside each MySQL instance, managing connection pooling (replacing ProxySQL), query rewrites, schema tracking, and replication health checks.
VSchema: Defining the Sharding Strategy β
VSchema is a JSON schema that tells VTGate how data is distributed:
{
"sharded": true,
"vindexes": {
"user_id_hash": {
"type": "hash"
}
},
"tables": {
"orders": {
"column_vindexes": [
{
"column": "user_id",
"name": "user_id_hash"
}
]
}
}
}Vitess automatically routes WHERE user_id = 42 to the correct shard. Cross-shard queries trigger a scatter-gather: VTGate sends the query to all shards and merges results.
Scatter-Gather and Cross-Shard Queries β
-- Single-shard query (fast): routes to exactly one MySQL shard
SELECT * FROM orders WHERE user_id = 42;
-- Scatter-gather query (slow): hits all shards, merges in VTGate
SELECT COUNT(*) FROM orders WHERE status = 'pending';
-- Cross-shard JOIN (very expensive): scattered, then joined in VTGate memory
SELECT o.id, u.name
FROM orders o JOIN users u ON o.user_id = u.id
WHERE o.status = 'pending';Scatter-Gather at Scale
A scatter-gather query against 100 shards takes 100Γ the resources of a single-shard query. Design your access patterns so that common queries always include the shard key. Analytics queries should be routed to a separate ClickHouse cluster, not scattered across MySQL shards.
Resharding: Online Horizontal Scaling β
Vitess's killer feature is online resharding β splitting a shard without downtime:
- Copy phase: VTTablet streams all rows from the source shard to two new destination shards, applying ongoing writes to both
- Catchup phase: Destination shards consume the binlog to catch up
- Switch reads: Traffic switches to destination shards (reads first)
- Switch writes: Write traffic migrates, source becomes read-only
- Cleanup: Source shard data is dropped
The application sees zero downtime throughout β VTGate handles routing transparently.
NewSQL: Distributed SQL β
NewSQL databases offer SQL + ACID + horizontal scale by using distributed consensus (Raft or Paxos) and distributed MVCC. The trade-off: latency.
CockroachDB β
CockroachDB stores data as key-value pairs in RocksDB, distributes them across nodes as Raft groups, and presents a PostgreSQL-compatible SQL interface.
TiDB β
TiDB separates compute (TiDB nodes, MySQL-compatible SQL) from storage (TiKV, distributed key-value). TiFlash adds a columnar replica for HTAP analytics.
Google Spanner β
Spanner uses TrueTime (GPS + atomic clocks) to assign globally consistent timestamps, enabling external consistency without coordinator latency. Available as Cloud Spanner on GCP.
NewSQL Comparison β
| System | Protocol | Consensus | Latency Overhead | Geographic Distribution |
|---|---|---|---|---|
| CockroachDB | PostgreSQL | Raft | +5β20ms single-region | Yes, native multi-region |
| TiDB | MySQL | Raft | +10β20ms | Yes, with TiDB Operator |
| Spanner | proprietary / JDBC | Paxos + TrueTime | +20β50ms | Yes, global |
| Sharded MySQL + Vitess | MySQL | N/A (async replication) | <1ms (local shard) | Manual, cross-region replication |
When NewSQL Beats Sharded MySQL + Vitess β
Use NewSQL when:
- You need distributed transactions across shards (e.g., financial transfers between accounts in different shards)
- You want SQL without building a sharding layer
- Your p99 latency budget allows 20β50ms additional overhead
Use Vitess when:
- You are already on MySQL and want to scale horizontally
- Latency is critical and cross-shard transactions are rare
- You need proven scale (YouTube/Shopify-scale deployments)
- You want to stay on standard MySQL tooling
MySQL vs PostgreSQL: 2025 Feature Comparison β
| Feature | MySQL 8.x | PostgreSQL 16 |
|---|---|---|
| ACID transactions | Yes (InnoDB) | Yes |
| JSON support | JSON type, limited operators | JSONB with full indexing (GIN), operators |
| Full-text search | Basic FULLTEXT indexes | tsvector, GIN, advanced ranking |
| Partitioning | Range, List, Hash, Key | Range, List, Hash + pg_partman automation |
| Parallel query | Yes (MySQL 8.0+) | More mature, more query types |
| Window functions | Yes (MySQL 8.0+) | Yes, more complete |
| CTEs | Yes (MySQL 8.0+) | Yes, with RECURSIVE |
| Replication | Binlog (async/semi-sync), Group Replication | WAL streaming (sync/async), logical replication |
| Sharding | Via Vitess (external) | Via Citus (extension) |
| Vector search | Native VECTOR type (MySQL 9.0+, GA June 2024) | pgvector (HNSW, IVFFlat, mature ecosystem) |
| Extensions | Limited | Rich ecosystem (300+ extensions) |
| License | GPL (dual-license) | PostgreSQL License (permissive) |
| Default storage | InnoDB (clustered index) | Heap files + B-tree indexes |
Case Study: Shopify's Vitess Migration β
Shopify runs one of the world's largest MySQL deployments. By 2014, their monolithic MySQL database could no longer scale vertically to handle Black Friday traffic spikes.
The Problem: A single MySQL primary with read replicas. Write throughput was capped by single-node limits. Schema changes required long ALTER TABLE locks. Flash sales created hotspot contention on specific product rows.
The Migration Path:
2014β2015: Evaluated MySQL sharding options. Chose Vitess (originally built at YouTube, open-sourced 2012) over custom sharding because of the transparent MySQL protocol compatibility.
2016: Deployed Vitess for the
productsandorderskeyspaces. Used Vitess's online schema change tool (gh-ostintegration via VTTablet) to eliminate ALTER TABLE downtime.2017β2019: Migrated all core tables to Vitess. Implemented VSchema for the
shop_idshard key β every merchant's data lives on a single shard, eliminating cross-shard transactions for the common case.2020βpresent: Running hundreds of MySQL shards behind Vitess. The 2020 BFCM (Black Friday / Cyber Monday) processed 44 orders per second at peak β a 76% increase over 2019 β with zero database downtime.
Key Technical Decisions:
shop_idas the universal shard key: all queries for a given shop route to one shard- VTGate's connection pooling replaced MySQL's per-connection overhead, allowing 100Γ more application threads
- Online resharding (splitting overloaded shards) done during off-peak hours with zero application downtime
- Vitess's VReplication (internal binlog-based data movement) powers resharding and cross-keyspace moves
Lessons:
- Choosing the right shard key eliminates cross-shard transactions in the common case
- Schema changes are the biggest operational challenge on sharded MySQL β invest in tooling early
- Vitess's metadata management (topology service in etcd) needs high availability itself
Related Chapters β
| Chapter | Relevance |
|---|---|
| Ch01 β Database Landscape | Storage engine fundamentals, B-tree vs LSM |
| Ch04 β Transactions & Concurrency | MVCC and isolation levels in InnoDB |
| Ch05 β PostgreSQL in Production | PostgreSQL vs MySQL internals comparison |
| Ch07 β NoSQL at Scale | When to move off SQL entirely |
| System Design Ch09 | SQL database selection framework |
Common Mistakes β
| Mistake | Why It Happens | Impact | Fix |
|---|---|---|---|
| Using UUID as InnoDB primary key | UUIDs are globally unique and familiar | Random inserts cause B-tree page fragmentation; secondary index double-reads become expensive | Use auto-increment integer or ordered UUID v7 as clustered index primary key |
| Ignoring Group Replication flow control | "Multi-primary is better" | Flow control throttles writes unpredictably when a node falls behind | Monitor performance_schema.replication_group_member_stats; tune group_replication_flow_control_mode |
| Choosing NewSQL without measuring single-row latency overhead | "Distributed ACID solves everything" | CockroachDB/TiDB add 10β50ms per cross-shard transaction; unacceptable for OLTP | Benchmark p99 latency for your critical queries before migrating; single-region sharded MySQL may be faster |
| Not using Vitess for horizontal MySQL scaling | "We'll shard the application manually" | Manual sharding creates cross-shard JOIN complexity and schema change pain | Vitess provides transparent sharding with online resharding and schema change tooling |
Practice Questions β
Beginner β
InnoDB Clustered Index: A developer creates a MySQL table with no explicit primary key. InnoDB creates a hidden 6-byte
rowidcolumn as the primary key. What performance consequences does this have compared to a meaningful integer primary key? How does this affect secondary index lookups?Model Answer
Without a meaningful primary key, secondary index lookups carry a 6-byte hidden rowid pointer β functionally the same overhead as any other primary key. The real problem: the hidden rowid is not exposed to the application, making it impossible to build covering indexes that include the PK for common access patterns. Also, UUID primary keys cause random B-tree insertions and fragmentation β use auto-increment integers or ordered UUIDs.Replication Lag: A MySQL replica reports 30-second replication lag during peak traffic. The primary is writing 5000 rows/second. What are the three most likely causes, and how would you diagnose each?
Model Answer
(1) Single-threaded SQL thread: MySQL 5.6 replicas apply events serially. Fix: enable parallel replication (`slave_parallel_workers = 8`, `slave_parallel_type = LOGICAL_CLOCK`). (2) Large transactions: a single 1M-row UPDATE creates a large binlog event that blocks the SQL thread. (3) Network/disk I/O bottleneck on replica: check `SHOW REPLICA STATUS` for `Seconds_Behind_Source` vs IO thread vs SQL thread lag separately to isolate.
Intermediate β
Vitess Scatter-Gather: Your team shards a
messagestable byuser_idacross 16 Vitess shards. A product manager requests a "global inbox" feature that shows all unread messages across all users, sorted by timestamp. What does Vitess do with this query, and what is your recommendation?Model Answer
Vitess scatter-gathers this query to all 16 shards, merges 16 sorted result sets in VTGate memory, and returns the merged result. This is 16Γ the work of a single-shard query and does not scale with shard count. Recommendation: maintain a separate `global_inbox` denormalized table on a dedicated keyspace with `(timestamp, message_id)` as the access key, written via application-level fan-out on message creation. Or route analytics queries to a ClickHouse replica fed by binlog CDC.NewSQL Latency Trade-off: Your fintech application processes cross-account transfers. Currently using sharded MySQL + Vitess where accounts can be on different shards. A transaction moving $100 from account A (shard 1) to account B (shard 3) requires a distributed XA transaction, which fails 0.1% of the time and requires manual reconciliation. A colleague proposes migrating to CockroachDB. What are the trade-offs?
Model Answer
CockroachDB pros: native distributed ACID transactions with no XA failures, simpler application code. CockroachDB cons: +10β20ms latency per transaction vs <1ms for single-shard Vitess queries; operational complexity of a new system; less mature ecosystem; cost (CockroachDB cloud vs self-managed MySQL). At 0.1% failure rate, evaluate: is the reconciliation cost higher than the engineering cost of migrating + the latency budget cost? For a fintech system where correctness is paramount, CockroachDB's native distributed transactions are worth the latency if your SLA allows it.
Advanced β
Vitess Resharding Design: You run 4 Vitess shards for a social media platform sharded by
user_id. Shard 0 (users 0β25%) is overloaded because it contains your top influencers (Pareto distribution β top 1% of users generate 40% of writes). You cannot reshard by user_id alone. Design a solution.Model Answer
The Pareto distribution problem: hash-based sharding distributes users evenly by count but not by write volume. Options: (1) Custom vindex in Vitess β map hot user IDs to dedicated shards (lookup vindex), keeping normal users on hash vindex. (2) Separate keyspace for "power users" with finer-grained sharding. (3) Application-level write buffering for influencer writes (rate-limit writes to MySQL, buffer in Redis, flush in batches). (4) Move influencer post fanout to an async queue (Kafka β separate write path). The real lesson: shard key choice assumes uniform distribution β for power-law distributions, you need tiered sharding or application-level isolation of hot entities.
References β
- Vitess Architecture Overview
- MySQL InnoDB Storage Engine Documentation
- CockroachDB Architecture Design
- TiDB Architecture
- Google Spanner Paper β Corbett et al. (2012)
- Shopify Engineering β Resurrecting MySQL
- PlanetScale β How Vitess Works
- "Designing Data-Intensive Applications" β Kleppmann, Ch. 5 (Replication), Ch. 6 (Partitioning)
- MySQL Group Replication Documentation

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.