Chapter 14: Event-Driven Architecture

In a request-driven world, services talk to each other. In an event-driven world, services talk about what happened — and any interested party can listen. That shift from commands to facts is the foundation of resilient, loosely coupled distributed systems.

Mind Map

What Is Event-Driven Architecture?

An event is an immutable record of something that happened: OrderPlaced, PaymentProcessed, UserSignedUp. Event-driven architecture (EDA) is a design paradigm where services communicate by publishing and consuming events rather than calling each other directly.

The fundamental model:

Producer — a service that detects a state change and publishes an event
Event bus / broker — the infrastructure that routes events (Kafka, SNS, EventBridge)
Consumer — a service that subscribes to events and reacts to them

The defining property is loose coupling: the producer does not know who will consume its event, and the consumer does not know (or care) which service produced it.

Request-Driven vs Event-Driven

Comparison:

Property	Request-Driven	Event-Driven
Coupling	Tight — caller knows all receivers	Loose — publisher unaware of consumers
Latency	Total = sum of all downstream calls	Near-zero for publisher; async for consumers
Resilience	One slow/down service blocks the chain	Consumers fail independently; events buffer
Discoverability	Explicit call graph	Implicit — new consumers subscribe without changing producers
Complexity	Simple to trace and debug	Harder to trace; requires distributed observability

Event Types

Not all events serve the same purpose. Distinguishing them prevents design confusion.

Type	Purpose	Scope	Example
Domain Event	Records a business fact within a bounded context	Internal	`OrderConfirmed`, `ItemShipped`
Integration Event	Crosses bounded context or service boundary	Cross-service	`OrderPlacedEvent` published to external consumers
Notification Event	Lightweight signal; receiver fetches full state separately	Cross-service	`UserProfileUpdated { userId }` — no payload, just a ping

Domain events are owned by a single domain and carry full context. Integration events are the public contract between services — treat them like API contracts: version them carefully, evolve them with backward compatibility. Notification events are useful when the payload would be large or frequently changing; the consumer calls back to fetch current state.

Pub/Sub Pattern

Publish/Subscribe is the most common EDA transport pattern. Publishers write to a topic; any number of subscribers independently consume from that topic.

Key Pub/Sub Properties

Fan-out — a single published event is delivered to all active subscribers simultaneously. Adding a new consumer requires zero changes to the publisher.

Ordering guarantees — most brokers provide ordering within a partition/shard but not across them. Kafka guarantees order within a partition; SNS/SQS Standard does not guarantee order globally. If strict ordering matters (e.g., order lifecycle events), partition by the natural key (e.g., orderId).

Durable subscriptions — subscribers can be offline; the broker retains events for a configurable retention window. Consumers replay missed events on restart. This is a core advantage over point-to-point HTTP calls.

Cross-reference: Kafka's topic/partition model, consumer groups, and offset management are covered in depth in Chapter 11 — Message Queues. This chapter focuses on the architectural patterns built on top of that infrastructure.

Event Sourcing

The Core Idea

Traditional systems store current state: the orders table holds the latest snapshot of each order. Event sourcing inverts this: store the sequence of events that led to the current state. Current state is derived by replaying the event log.

Traditional:  DB row = { orderId: 42, status: "shipped", total: 99.00 }

Event Sourced: Event log for orderId 42:
  1. OrderPlaced     { total: 99.00, items: [...] }
  2. PaymentReceived { amount: 99.00, method: "card" }
  3. OrderPicked     { warehouseId: "W-7" }
  4. OrderShipped    { trackingId: "UPS-123", carrier: "UPS" }

Write and Read Flow

Pros and Cons

Aspect	CRUD (State Storage)	Event Sourcing
Storage model	Current state only	Full event history
Audit trail	Requires separate audit log	Built-in — every change is an event
Temporal queries	Hard — requires history table	Native — replay to any point in time
Debugging	Query current state	Replay events to reproduce exact past state
Complexity	Low	High — projections, snapshots, schema evolution
Read performance	Fast — direct query	Requires projection/read model (eventual consistency)
Write performance	Update in place	Append-only (fast writes, no contention)
Storage cost	Low — only current state	High — full history grows unbounded
Onboarding	Familiar to most engineers	Steep learning curve

When event sourcing is worth the cost:

You need a complete, tamper-proof audit trail (financial systems, compliance)
You need temporal queries ("what was the state of this account at 2PM yesterday?")
Multiple independent projections need to derive different read models from the same facts
The domain is complex enough that replaying business rules over events adds value

Snapshot optimization — replaying thousands of events on every read is expensive. Periodically store a snapshot of current state at a known version; on read, load the latest snapshot then replay only newer events.

CQRS — Command Query Responsibility Segregation

The Pattern

Command Query Responsibility Segregation (CQRS) separates the write path (commands that mutate state) from the read path (queries that retrieve state). They use different models, different data stores if needed, and scale independently.

Write side — receives commands, validates them against domain logic (aggregates), and appends events to the event store. Optimized for consistency and correctness.

Read side — queries a pre-built materialized view that is updated asynchronously by projectors. Optimized for query performance. The read model can be shaped exactly to the UI's needs (denormalized, pre-joined).

When to Use CQRS

CQRS adds significant complexity. Apply it selectively:

Scenario	Use CQRS?	Reason
Simple CRUD with low traffic	No	Overhead exceeds benefit
Complex domain with many business rules	Yes	Write model enforces invariants cleanly
Read/write ratio heavily skewed (e.g., 100:1 reads)	Yes	Independent scaling of read replicas
Multiple UI views need differently shaped data	Yes	Separate projections per view
Audit trail / regulatory compliance required	Yes	Combine with event sourcing
Small team, single service	No	KISS — premature optimization

Note: CQRS does not require event sourcing, and event sourcing does not require CQRS. They compose well together but are independent patterns.

Choreography vs Orchestration

When multiple services must collaborate to complete a business process, you have two coordination models.

Choreography — Decentralized Reaction

Each service listens for events and reacts independently. No central coordinator exists. The workflow emerges from the interactions.

Orchestration — Centralized Coordination

A dedicated orchestrator (saga coordinator) explicitly directs each step, waits for responses, and handles failures.

Comparison

Property	Choreography	Orchestration
Coupling	Low — services only know the event schema	Medium — orchestrator knows all participants
Visibility	Low — workflow is implicit, hard to see end-to-end	High — orchestrator holds the full state machine
Complexity	Grows with number of services (emergent behavior)	Localized in orchestrator (explicit, testable)
Failure handling	Each service must publish compensating events independently	Orchestrator drives compensating transactions centrally
Scalability	Each service scales independently	Orchestrator can become a bottleneck
Debugging	Requires distributed tracing to reconstruct flow	Orchestrator state is queryable
Best for	Simple, stable workflows with few services	Complex workflows requiring rollback, retries, human steps

Cross-reference: The saga pattern — the standard approach for distributed transactions using orchestration — is detailed in Chapter 13 — Microservices, including compensating transactions and the outbox pattern for reliable event publishing.

Event Schema Evolution

Events are the public API between services. Once published, consumers depend on them. Schema evolution is one of the hardest operational problems in EDA.

Backward compatibility — new consumers can read events written by older producers. Achieved by: only adding optional fields, never removing fields, never changing field semantics.

Forward compatibility — old consumers can read events written by newer producers. Achieved by: consumers ignore unknown fields (use schema-flexible formats like Avro/Protobuf with default values, or JSON with additive-only changes).

Schema Registry — a central service (e.g., Confluent Schema Registry, AWS Glue Schema Registry) that:

Stores versioned schemas for every topic
Validates that new schemas are compatible before publishing
Provides schema IDs embedded in message headers so consumers deserialize correctly
Enforces compatibility rules (BACKWARD, FORWARD, FULL) per topic

Versioning strategy:

For minor additions: add optional fields, bump minor version
For breaking changes: publish to a new topic (order-events-v2), run both topics in parallel during migration, retire the old topic once all consumers are migrated

Idempotency

Why Events Are Delivered More Than Once

In any reliable messaging system using at-least-once delivery (which includes Kafka, SQS, SNS, and most brokers in practice), a consumer may process the same event multiple times:

Consumer crashes after processing but before committing the offset
Network timeout causes the broker to assume the consumer failed and redeliver
Manual reprocessing during incident recovery replays past events

How to Handle Duplicates

Idempotency key — every event carries a unique identifier (eventId, correlationId). Before processing, the consumer checks a deduplication store (Redis, a DB table with a unique constraint) whether this event was already processed.

Process(event):
  if dedup_store.exists(event.eventId):
    return  # Already processed — skip

  process_business_logic(event)
  dedup_store.set(event.eventId, TTL=7_days)
  commit_offset()

Idempotent operations — design business operations to produce the same result when applied multiple times. Setting a field to a value is idempotent; incrementing a counter is not. Where possible, use SET status = 'shipped' rather than UPDATE shipped_count = shipped_count + 1.

Natural idempotency — database upserts (INSERT ... ON CONFLICT DO UPDATE) make many operations naturally safe to retry without explicit deduplication tracking.

Cross-reference: Exactly-once semantics using Kafka's idempotent producers and transactional API are covered in Chapter 11 — Message Queues.

Real-World Cases

Uber — Event-Driven Dispatch

Uber's dispatch system coordinates drivers, riders, GPS pings, pricing, and payments across tens of millions of concurrent events. The event-driven approach emerged from necessity: synchronous coordination at that scale would require impossibly low latency across many dependent services.

Each ride lifecycle step (RideRequested, DriverMatched, RideStarted, RideCompleted) is an event published to their internal streaming platform (originally Cherami, later migrated to Kafka-compatible infrastructure)
Pricing service subscribes to location and demand events to recalculate surge pricing in near-real-time without blocking the ride request path
Driver supply service independently consumes RideCompleted and DriverGoesOnline events to update availability — no direct coupling to the dispatch service
Billing service processes RideCompleted events asynchronously; if it lags, it doesn't affect ride availability
Key insight from Uber's engineering blog: decoupling the payment pipeline from the ride completion event eliminated a class of cascading failures where slow payment processing would delay driver availability

Netflix — Event Pipeline for Personalization

Netflix's recommendation engine, encoding pipeline, and A/B testing infrastructure are all event-driven.

Watch event pipeline: Every play, pause, seek, and completion fires an event. These flow through a Kafka-based pipeline into multiple consumers: real-time recommendation updates, analytics aggregation, content popularity scoring, and billing.
Encoding pipeline: When a new video is ingested, a VideoUploaded event triggers a choreography-based workflow — thumbnail generation, encoding at multiple bitrates, DRM packaging, CDN distribution — each as independent consumers. The ingestion service publishes once; dozens of workers react in parallel.
Schema evolution in practice: Netflix runs thousands of event types. Their internal schema registry enforces that every change to a high-volume event schema is backward-compatible before deployment. Engineers cannot publish to a production topic without schema validation passing.
Scale: Netflix processes 500 billion events per day through their pipeline (primarily on Apache Kafka), demonstrating that EDA at extreme scale requires disciplined schema governance, not just the right broker.

Key Takeaway

Event-driven architecture trades the simplicity of direct service calls for the resilience and scalability of asynchronous fact-sharing. The benefits — loose coupling, independent scaling, natural audit trails — are real, but so are the costs: eventual consistency, invisible workflows, and the operational burden of schema governance and idempotency. Apply EDA where the coupling costs of synchronous calls are genuinely painful, not as a default pattern for every interaction.

Distributed Transaction Protocols

Before the Saga pattern, engineers used database-level protocols to coordinate multi-node writes. Understanding their failure modes explains why Saga exists.

Two-Phase Commit (2PC)

2PC uses a coordinator to drive N participants through two phases, ensuring all-or-nothing commit semantics across distributed nodes.

The blocking problem: Between Phase 1 vote and Phase 2 decision, participants hold locks. If the coordinator crashes after collecting YES votes but before sending Phase 2, participants are blocked indefinitely — they cannot commit or abort without the coordinator's decision.

Three-Phase Commit (3PC)

3PC adds a pre-commit phase to eliminate the blocking window. Participants move to a pre-committed state, allowing recovery even if the coordinator fails.

Why 3PC still loses in practice: Pre-commit solves coordinator crash but is vulnerable to network partitions — a participant can auto-commit on timeout while another aborts, causing inconsistency. 3PC requires a synchronous network assumption that does not hold in real distributed systems.

2PC vs 3PC Comparison

Protocol	Blocking on Coordinator Crash	Partition Tolerance	Message Rounds	Latency	Practical Use
2PC	Yes — participants block indefinitely	No	4 (P+A per phase)	Low	Database XA transactions (rarely cross-service)
3PC	No — auto-commit via timeout	No (partition splits brain)	6	Higher	Theoretical; almost never used in production
Saga	No — compensations handle failure	Yes	Async	Variable	Microservices standard

Bottom line: 2PC works inside a single database cluster (e.g., PostgreSQL with XA). Across independent services with independent databases, 2PC is too fragile. Saga is the production-grade alternative.

Saga Pattern — Deep Dive

The Saga pattern (described at a high level in Chapter 13) is the standard approach for long-running distributed transactions. Each local transaction publishes events or sends commands; compensating transactions undo completed steps on failure.

Orchestration-Based Saga

A central saga orchestrator (often implemented as a state machine) drives every step and explicitly handles failure paths.

Choreography-Based Saga

No central coordinator. Each service emits events; downstream services react and either continue the flow or emit compensating events.

Compensation Logic

Compensating transactions must be idempotent — they may be retried if the compensating command itself fails. Design them as explicit reversal operations:

Forward Step	Compensating Transaction	Idempotency Key
CreateOrder (PENDING)	CancelOrder → CANCELLED	orderId
ReserveInventory	ReleaseInventory	orderId + SKU
ChargePayment	RefundPayment	orderId + paymentId
SendShipment	(cannot compensate — issue return label instead)	orderId

Note: some steps are non-compensatable (shipping a physical parcel). Design your saga to attempt non-compensatable steps last, maximizing the chance all prior steps can be rolled back cleanly.

Saga Pattern Comparison

Dimension	Orchestration	Choreography
Coupling	Orchestrator knows all participants	Services know only event schemas
Visibility	Full state in orchestrator — queryable	Distributed state — requires distributed tracing
Debugging	Straightforward — one place to inspect state	Hard — events across many logs
Scalability	Orchestrator can be a bottleneck	Fully distributed
Error handling	Centralized compensation logic	Each service must handle its own compensations
Best For	Complex flows, regulatory audit, many compensations	Simple stable flows, few participants

Change Data Capture (CDC)

CDC captures row-level changes from a database's write-ahead log (WAL) and streams them as events to downstream consumers — without modifying application code.

How It Works

Log-based CDC (Debezium): Reads the database's replication slot / binary log. Zero impact on write path — no application changes required. Every INSERT, UPDATE, DELETE becomes a structured event with before/after values.

CDC Approaches Comparison

Approach	How It Works	Latency	DB Impact	Complexity	Best For
Log-based (Debezium, Maxwell)	Read WAL / binlog directly	Sub-second	Near zero	Medium (connector ops)	Production; low-latency, any throughput
Trigger-based	DB triggers write to outbox table	Low	High (triggers on every write)	Low	Simple setups; low write volume
Polling	App queries `WHERE updated_at > last_poll`	Seconds	Medium (full-table scan risk)	Low	Legacy systems without WAL access
Dual-write	App writes to DB and publishes event	Low	Zero	High (must be atomic)	Avoid — risks inconsistency if app crashes between writes

The outbox pattern (avoiding dual-write): Write the event to an outbox table in the same transaction as the business write. A CDC connector reads the outbox and publishes to Kafka. Atomicity is guaranteed by the DB transaction; publishing is decoupled from the business write.

CDC Use Cases

Cache invalidation: Invalidate Redis entries when the source record changes, without polling
Search index sync: Keep Elasticsearch in sync with the primary database in near-real-time
Analytics pipelines: Stream operational data to data warehouses (Snowflake, BigQuery) without Extract, Transform, Load (ETL) batch jobs
Audit logs: Capture every data change with before/after values for compliance
Microservice data sync: Replicate a bounded subset of data to a downstream service's own database without shared schema coupling

The Transactional Outbox Pattern

The Transactional Outbox is the canonical solution to the dual-write problem — the risk of inconsistency when a service must both write to its database and publish an event to a message broker atomically. A naive implementation writes to the DB, then publishes to Kafka; if the process crashes between those two steps, the event is silently lost.

The Dual-Write Problem

// BROKEN — app crashes after DB write, before Kafka publish
db.insert(order)          // succeeds
// ← crash here →
kafka.publish(OrderPlaced) // never runs → downstream never knows the order exists

The Transactional Outbox is the standard fix. As of 2026, it is the recommended pattern in every major distributed systems reference and is supported natively by CDC connectors (Debezium) and frameworks like Eventuate Tram and Axon.

The fix: Write the event to an outbox table in the same database transaction as the business write. A CDC connector (or a polling relay process) reads the outbox asynchronously and publishes to Kafka. Atomicity is guaranteed by the database; publishing is decoupled.

Guarantees:

The order row and outbox row are written atomically — if the app crashes between write and publish, the CDC connector will eventually pick up the outbox row and publish
CDC provides at-least-once delivery to Kafka — consumers must be idempotent
No distributed transaction or two-phase commit required; consistency comes from single-DB transaction + eventual CDC processing

Outbox table schema (minimal):

sql

CREATE TABLE outbox (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  aggregate_type TEXT NOT NULL,        -- e.g. 'Order'
  aggregate_id   TEXT NOT NULL,        -- e.g. '42'
  event_type     TEXT NOT NULL,        -- e.g. 'OrderPlaced'
  payload        JSONB NOT NULL,
  processed_at   TIMESTAMPTZ           -- NULL = pending
);

Two relay implementations:

CDC-based (preferred): Debezium reads the database WAL and publishes outbox rows to Kafka automatically. Sub-second latency; no polling load on the database. Requires a PostgreSQL or MySQL replication slot.
Polling relay: A background job queries WHERE processed_at IS NULL ORDER BY occurred_at LIMIT 100 on a schedule. Simpler to set up; adds database read load; latency depends on polling interval (1–5 seconds typical).

Cross-reference: Chapter 11 — Message Queues covers Kafka internals (topics, partitions, consumer groups) that CDC pipelines rely on. Chapter 15 — Data Replication covers consensus protocols used by distributed transaction coordinators.

Notification Systems

Architecture Overview

Notification delivery is a natural fit for event-driven architecture. Events trigger notifications across multiple channels.

Channel Comparison

Channel	Latency	Deliverability	Cost per msg	Best For
Push notification	< 1s	Medium (tokens expire)	~Free	Real-time alerts
Email	1-30s	High (with warm-up)	$0.0001	Transactional, marketing
SMS	1-5s	Very high	$0.01-0.05	2FA, critical alerts
In-app	< 100ms	Very high (if online)	Free	Activity feeds
Webhook	< 1s	Medium (retries needed)	Free	Developer integrations

Delivery Guarantees

Use message queues for reliable delivery
Implement retries with exponential backoff
Dead letter queue for permanently failed notifications
Idempotency keys to prevent duplicate sends
Rate limiting per user per channel (no notification spam)

Bounce & Feedback Handling

Email bounces: hard bounce (remove address) vs soft bounce (retry)
Push token invalidation: FCM/APNS return error codes
SMS delivery receipts: track failed deliveries

Chapter	Relevance
Ch11 — Message Queues	Kafka/SQS internals that event-driven patterns depend on
Ch13 — Microservices	Saga pattern for distributed transactions across services
Ch15 — Replication & Consistency	Consistency guarantees in event-sourced distributed systems
Ch09 — SQL Databases	Distributed transactions (2PC) vs Saga trade-offs

Practice Questions

Beginner

Order Fulfillment Saga: Design an order fulfillment system using event-driven architecture. An order must trigger inventory reservation, payment charging, and shipping label creation — but if payment fails, inventory must be released. Model this with both choreography and orchestration, then choose one and justify why.
Hint
Choreography uses events between services (looser coupling, harder to trace); orchestration uses a central coordinator (easier to reason about compensations) — multi-step compensation workflows favor orchestration.

Intermediate

Event Sourcing Trade-offs: Your team proposes event sourcing for a high-traffic e-commerce catalog (10K product updates/day, 1M reads/day). Evaluate the costs (event log replay time, read complexity, schema migration) vs benefits (audit history, temporal queries). Would you use event sourcing here, and what CQRS read model strategy would you apply?
Hint
Event sourcing shines for financial ledgers and audit-heavy domains; for a product catalog where reads dominate and historical states are rarely needed, the operational overhead typically outweighs the benefits.
CQRS Feed Design: A social platform's feed query is slow because it joins users, follows, and posts tables in real time. Propose a CQRS design where the write side remains relational but the read side materializes a feed per user in Redis. What events trigger feed updates, and how do you handle the eventual consistency gap when a user posts and immediately refreshes?
Hint
Fan-out on write (push model) updates each follower's Redis feed when a post event is published; for read-your-own-writes, read the user's own posts from the write model and merge with the cached feed at query time.
Schema Evolution: You publish an OrderPlaced event with fields { orderId, userId, total }. You need to add { discountCode, itemCount } and rename total to subtotal. Which changes are backward-compatible? Which require a new schema version or topic? Describe the consumer migration strategy.
Hint
Adding optional fields is backward-compatible; renaming is a breaking change — publish both `total` and `subtotal` in a transitional version, then deprecate `total` after all consumers have migrated.

Advanced

Idempotency Mechanisms: Your notification service consumes OrderShipped events and sends an SMS. A Kafka partition rebalance causes the same event to be delivered twice within 30 seconds. Describe three distinct mechanisms to prevent duplicate SMS (deduplication key in DB, idempotency token with provider API, consumer-side state machine). Compare each on: implementation complexity, latency impact, and failure modes.
Hint
DB deduplication key is the most reliable (persistent) but adds a write per message; provider-level idempotency keys work if the SMS provider supports them; a state machine (processed/unprocessed enum) prevents duplicates but requires careful distributed locking.

References & Further Reading

Building Event-Driven Microservices — Adam Bellemare (O'Reilly)
Designing Data-Intensive Applications — Martin Kleppmann, Chapter 11
Martin Fowler — Event-Driven Architecture
Apache Kafka Documentation
AWS EventBridge Documentation

Chapter 14: Event-Driven Architecture ​

Mind Map ​

What Is Event-Driven Architecture? ​

Request-Driven vs Event-Driven ​

Event Types ​

Pub/Sub Pattern ​

Key Pub/Sub Properties ​

Event Sourcing ​

The Core Idea ​

Write and Read Flow ​

Pros and Cons ​

CQRS — Command Query Responsibility Segregation ​

The Pattern ​

When to Use CQRS ​

Choreography vs Orchestration ​

Choreography — Decentralized Reaction ​

Orchestration — Centralized Coordination ​

Comparison ​

Event Schema Evolution ​

Idempotency ​

Why Events Are Delivered More Than Once ​

How to Handle Duplicates ​

Real-World Cases ​

Uber — Event-Driven Dispatch ​

Netflix — Event Pipeline for Personalization ​

Key Takeaway ​

Distributed Transaction Protocols ​

Two-Phase Commit (2PC) ​

Three-Phase Commit (3PC) ​

2PC vs 3PC Comparison ​

Saga Pattern — Deep Dive ​

Orchestration-Based Saga ​

Choreography-Based Saga ​

Compensation Logic ​

Saga Pattern Comparison ​

Change Data Capture (CDC) ​

How It Works ​

CDC Approaches Comparison ​

CDC Use Cases ​

The Transactional Outbox Pattern ​

Notification Systems ​

Architecture Overview ​

Channel Comparison ​

Delivery Guarantees ​

Bounce & Feedback Handling ​

Related Chapters ​

Practice Questions ​

Beginner ​

Intermediate ​

Advanced ​

References & Further Reading ​