Skip to contentSkip to content
0/47 chapters completed (0%)

Chapter 11: Message Queues & Async Processing

Chapter banner

Synchronous systems are fragile. Asynchronous systems are resilient. A message queue is the contract between the two — it lets producers and consumers evolve, scale, and fail independently without ever blocking each other.


Mind Map


Why Async Processing

Synchronous request-response works fine for simple CRUD. It breaks down when:

  • A downstream service is slow or temporarily unavailable
  • A burst of traffic hits faster than consumers can handle
  • Multiple services need to react to the same event
  • Work is too expensive to complete within an HTTP request timeout

Synchronous vs Asynchronous Architecture

Key benefits:

BenefitExplanation
DecouplingProducer doesn't know or care about consumers — they can be added or removed without touching the producer
Traffic spike absorptionQueue acts as a buffer; producers publish at peak rate, consumers process at sustainable rate
ReliabilityIf a consumer crashes, messages stay in the queue and are reprocessed when it recovers
Independent scalingScale consumers horizontally without changing producers
Temporal decouplingProducer and consumer don't need to be online at the same time

Message Queue Fundamentals

The core pattern: Producer → Queue → Consumer

Message anatomy:

  • Payload — the actual data (JSON, Avro, Protobuf)
  • Key — used for partitioning/routing (Kafka) or routing key (RabbitMQ)
  • Headers / Metadata — timestamp, message ID, content type, correlation ID
  • TTL — how long the message lives in the queue before expiry

Delivery Guarantees

How a system handles failures during message delivery determines its correctness guarantees.

The Three Levels

GuaranteeHow it worksRiskTypical Use Case
At-most-onceFire and forget. No retries. ACK before processing.Messages can be lostMetrics, logs — loss is acceptable
At-least-onceACK after processing. Retry on failure.Messages can be duplicatedMost business systems — duplicates are handled
Exactly-onceIdempotent consumer + transactional producer + at-least-once deliveryComplexity, lower throughputPayments, inventory — correctness critical

How Exactly-Once Is Achieved

Exactly-once is not a single mechanism — it is a combination:

  1. Idempotent producer (Kafka) — assigns sequence numbers so the broker deduplicates retried sends from the same producer session
  2. Transactional producer (Kafka Transactions) — writes to multiple partitions atomically; either all succeed or none are visible
  3. Idempotent consumer — the application assigns a deduplication key to each operation (e.g., order_id), checks whether it was already applied, and skips if so
Exactly-once = At-least-once delivery + Idempotent consumer

In practice, true end-to-end exactly-once requires both the broker and the downstream storage to participate in the transaction. Kafka Streams achieves this internally; external sinks require careful design.


Apache Kafka Deep Dive

Kafka is a distributed commit log designed for high-throughput, durable, replayable event streaming.

Architecture

Core Concepts

Broker — a single Kafka server. A cluster has multiple brokers for redundancy and parallelism.

Topic — a named, ordered, append-only log. Topics are split into partitions.

Partition — the unit of parallelism. Each partition is an ordered, immutable sequence of messages with monotonically increasing offsets. Partitions are replicated across brokers.

Consumer Group — a set of consumers sharing a group ID. Each partition is consumed by exactly one consumer within a group, enabling horizontal scaling. Multiple groups can read the same topic independently (pub/sub semantics).

Offset — the position of a message within a partition. Consumers commit offsets to track progress. Restarting a consumer replays from the last committed offset.

Offset Management

StrategyCommit TimingRisk
Auto-commit (every 5s)Periodic, automaticCan lose uncommitted work on crash
Manual commit after processAfter successful processingAt-least-once; safe for most cases
Manual commit before processBefore processingAt-most-once; fast but lossy
TransactionalAtomic with downstream writeExactly-once; complex

Log Compaction

Normal Kafka topics are retention-based — old messages are deleted after a time or size limit. Log compaction keeps only the latest message per key, making the log act like a changelog or snapshot:

  • Useful for materializing state (e.g., latest user profile per user_id)
  • Enables new consumers to rebuild full state by replaying the compacted log
  • Used extensively for Kafka Streams state stores and database change data capture (CDC)

Kafka Use Cases

  • Event streaming — real-time clickstream, IoT sensor data
  • Log aggregation — centralize application logs from hundreds of services
  • Change Data Capture (CDC) — Debezium captures database changes as Kafka events
  • Stream processing — Kafka Streams / ksqlDB for real-time transformations
  • Event sourcing — durable, replayable record of all state changes

RabbitMQ

RabbitMQ implements the AMQP 0-9-1 protocol. Unlike Kafka's pull-based log, RabbitMQ uses a push-based broker model: the broker routes messages to queues via exchanges, then pushes them to subscribed consumers.

Exchange Routing

Exchange types:

TypeRouting LogicUse Case
DirectExact match on routing keyError routing, task dispatch
TopicWildcard pattern (* = one word, # = zero or more)Categorized events
FanoutBroadcast to all bound queuesNotifications, cache invalidation
HeadersMatch on message header attributesComplex routing rules

RabbitMQ strengths:

  • Complex routing logic without consumer-side filtering
  • Per-message acknowledgement and TTL
  • Dead letter exchanges (DLX) for failed messages
  • Plugin ecosystem (delayed messages, priority queues)

Amazon SQS

SQS is AWS's fully managed message queue — no brokers to provision, patch, or scale.

Standard Queue:

  • Near-unlimited throughput
  • At-least-once delivery (a message may be delivered more than once)
  • Best-effort ordering (not guaranteed)
  • Visibility timeout — after delivery, message becomes invisible to other consumers for a configurable period; if not deleted within that window, it reappears

FIFO Queue:

  • Exactly-once processing (within the 5-minute deduplication window)
  • Strict ordering per MessageGroupId
  • Max 3,000 messages/second with batching

Dead Letter Queue (DLQ): After maxReceiveCount failed delivery attempts, the message is moved to a DLQ for inspection and replay without blocking the main queue.


Kafka vs RabbitMQ vs SQS

DimensionApache KafkaRabbitMQAmazon SQS
ThroughputMillions of msg/sec50k–100k+ msg/sec (streams: 1M+)Near-unlimited (Standard)
Message OrderingPer partitionPer queueNo guarantee (Standard) / per group (FIFO)
Message RetentionDays to forever (configurable)Until ACKed (default)4 days default, up to 14 days
Message ReplayYes — rewind offsetNo — consumed = deletedNo
ProtocolCustom TCP binaryAMQP 0-9-1HTTPS / AWS SDK
Push vs PullPull (consumer polls)Push (broker delivers)Pull (long polling)
RoutingTopic + partition keyExchanges + binding keysQueue URL (single destination)
Delivery GuaranteeAt-least-once; exactly-once with transactionsAt-least-onceAt-least-once; exactly-once (FIFO)
Scaling ModelAdd partitions + consumersAdd consumers or cluster nodesFully managed, auto-scales
Ops OverheadHigh (ZooKeeper/KRaft, tuning)Medium (cluster management)None (managed service)
Latency~5–15 ms (batching)Sub-millisecond~1–20 ms
Primary Use CaseEvent streaming, CDC, log aggregationTask dispatch, complex routingSimple async decoupling on AWS

Decision guide:

  • Need replay or event sourcing → Kafka
  • Need complex routing or per-message TTL → RabbitMQ
  • On AWS and want zero ops → SQS
  • Need high throughput with ordering → Kafka + partition by key

Task Queues

Task queues are higher-level abstractions on top of message queues, designed for job dispatch rather than raw event streaming.

Worker Pattern

Celery (Python) — uses Redis or RabbitMQ as broker; supports chaining, groups, chords, canvas for complex workflows.

Sidekiq (Ruby) — Redis-backed; uses threads for concurrency; built-in retry with exponential backoff.

Priority Queues

Tasks with higher priority jump ahead in the queue. Implemented by:

  • Multiple queues with polling priority (Sidekiq: poll high → default → low)
  • Broker-native priority (RabbitMQ priority queues, max 255 levels)

Retry Logic Best Practices

  • Exponential backoff with jitterdelay = min(cap, base * 2^attempt) + random_jitter
  • Max retries — set a limit; after exhaustion, move to Dead Letter Queue
  • Idempotency — retried jobs must produce the same result as the first run
  • Alerting on DLQ depth — monitor for jobs that never succeed

Back Pressure

Back pressure is the mechanism by which a consumer signals to the producer that it cannot keep up. Without it, queues grow unbounded until memory is exhausted or latency becomes unacceptable.

The Problem

Producer: 10,000 msg/sec  →  Queue fills up  →  Consumer: 1,000 msg/sec

Strategies

StrategyMechanismTrade-off
DropDiscard new messages when queue is fullFast; loses data — acceptable for metrics/logs
BufferExpand in-memory queueTemporary relief; risks OOM on sustained overload
ThrottleSlow down the producer (rate limit)Preserves data; requires producer cooperation
BlockProducer blocks until queue has capacitySimple; can cause cascading slowdown upstream
Reject + retryReturn 429 to producer; let it retry laterDecouples pacing; requires producer to handle 429

Reactive Streams (RxJava, Akka Streams, Project Reactor) formalize back pressure in code: a consumer requests N items from the producer, preventing the producer from overwhelming it.

In Kafka, consumers naturally control pace because they poll the broker. The queue depth (consumer lag) is a key metric — when lag grows, scale out consumers.


Patterns

Fan-out: One Message to Many Consumers

One producer publishes a single event; multiple independent consumers each receive a copy.

Kafka: multiple consumer groups on the same topic, each maintaining their own offset. RabbitMQ: fanout exchange bound to N queues.

Fan-in: Many Producers to One Consumer

Multiple producers emit events into a single queue; one consumer (or consumer group) processes all of them.

Use case: centralised audit log, combined metrics ingestion.

Request-Reply over Queues

Asynchronous request-reply decouples caller from callee while still allowing the caller to get a result.

The caller creates a temporary reply queue (or uses a shared one with correlationId filtering). Used in RPC-over-MQ patterns.


Real-World Cases

LinkedIn — Kafka at 7 Trillion Messages per Day

LinkedIn built Kafka in 2010 specifically because no existing system could handle their activity feed requirements. Today:

  • 7 trillion messages/day across their Kafka clusters
  • Single largest Kafka deployment in the world
  • Used for activity tracking, metrics, log aggregation, and stream processing with Samza
  • Key insight: treating logs as first-class data streams unlocked entirely new product capabilities (who viewed your profile, trending posts)

Uber — Async Processing for Reliability

Uber's ride dispatch, surge pricing, and notification systems all rely heavily on async queues:

  • Cherami (Uber's internal queue) was built when SQS latency was too high and Kafka's model didn't fit task-queue semantics
  • microtransactions — each step of the ride lifecycle (request, match, pickup, complete) publishes an event; downstream services (billing, driver payments, analytics) react asynchronously
  • Resilience pattern: if the notification service is down during a ride, the events buffer in the queue; once service recovers, all notifications are delivered in order

See also: Chapter 14 — Event-Driven Architecture for how these patterns compose into full system designs.

For async cache invalidation patterns (where message queues propagate cache eviction events), see Chapter 7 — Caching.


Key Takeaway

Message queues transform brittle synchronous chains into resilient, independently scalable systems. The choice of broker matters far less than the discipline of designing idempotent consumers, monitoring queue depth, and handling back pressure gracefully. A queue is only as reliable as the consumer that processes it.


Designing Idempotent Consumers

Why Idempotency Matters

  • At-least-once delivery means messages CAN be delivered multiple times
  • Without idempotent consumers, duplicate processing causes bugs (double charges, duplicate records)

Idempotency Strategies

StrategyHow It WorksBest For
Idempotency keyStore processed message IDs in a set; skip duplicatesPayment processing
Natural idempotencyOperations that produce same result regardless of repetition (SET vs INCREMENT)State updates
Optimistic lockingCheck version/timestamp before applying changesConcurrent updates
Deduplication windowTrack message IDs within a time window (e.g., 24h)High-throughput streams

Code Example: Idempotent Kafka Consumer (Go)

go
func processMessage(ctx context.Context, db *sql.DB, msg kafka.Message) error {
    msgID := string(msg.Key)

    // Check if already processed (idempotency key)
    var exists bool
    db.QueryRowContext(ctx,
        "SELECT EXISTS(SELECT 1 FROM processed_messages WHERE message_id = $1)",
        msgID).Scan(&exists)
    if exists {
        return nil // Skip duplicate
    }

    // Process within transaction
    tx, _ := db.BeginTx(ctx, nil)
    defer tx.Rollback()

    // Business logic here
    if err := handleOrder(tx, msg.Value); err != nil {
        return err
    }

    // Mark as processed
    tx.Exec("INSERT INTO processed_messages (message_id, processed_at) VALUES ($1, NOW())", msgID)
    return tx.Commit()
}

Schema Evolution in Event-Driven Systems

The Problem

  • Producers and consumers evolve independently
  • Changing event schemas can break downstream consumers
  • Need backward AND forward compatibility

Schema Compatibility Types

CompatibilityRuleExample
BackwardNew schema can read old dataAdding optional field
ForwardOld schema can read new dataRemoving optional field
FullBoth backward and forwardOnly add/remove optional fields
NoneNo guaranteesChanging field type

Schema Registry Pattern

Best Practices

  • Use Avro or Protobuf (not JSON) for schema enforcement
  • Deploy schema registry (Confluent, Apicurio)
  • Set compatibility mode to BACKWARD or FULL
  • Never remove required fields or change field types
  • Use schema evolution as a deployment gate

Go Implementation Example

The following implements an in-memory publish/subscribe message queue — the core pattern behind Kafka topics and RabbitMQ fanout exchanges.

go
package main

import (
	"fmt"
	"sync"
)

// Message is the unit of data flowing through the queue.
type Message struct {
	Topic   string
	Payload string
}

// Broker manages topics and delivers messages to all subscribers.
type Broker struct {
	mu          sync.RWMutex
	subscribers map[string][]chan Message // topic → list of subscriber channels
}

func NewBroker() *Broker {
	return &Broker{subscribers: make(map[string][]chan Message)}
}

// Subscribe returns a channel that receives all messages published to topic.
// bufSize controls how many messages can queue before the subscriber blocks.
func (b *Broker) Subscribe(topic string, bufSize int) <-chan Message {
	ch := make(chan Message, bufSize)
	b.mu.Lock()
	b.subscribers[topic] = append(b.subscribers[topic], ch)
	b.mu.Unlock()
	return ch
}

// Publish delivers msg to every subscriber of msg.Topic (fan-out pattern).
func (b *Broker) Publish(msg Message) {
	b.mu.RLock()
	subs := b.subscribers[msg.Topic]
	b.mu.RUnlock()

	for _, ch := range subs {
		// Non-blocking send: drop message if subscriber's buffer is full.
		// In production, this is the "drop" back-pressure strategy.
		select {
		case ch <- msg:
		default:
			fmt.Printf("⚠ subscriber buffer full, dropping message on topic %q\n", msg.Topic)
		}
	}
}

func main() {
	broker := NewBroker()

	// Two independent consumer groups subscribe to "order.placed".
	inventory := broker.Subscribe("order.placed", 10)
	email := broker.Subscribe("order.placed", 10)

	var wg sync.WaitGroup
	consume := func(name string, ch <-chan Message, count int) {
		defer wg.Done()
		for i := 0; i < count; i++ {
			msg := <-ch
			fmt.Printf("[%s] received: %s\n", name, msg.Payload)
		}
	}

	wg.Add(2)
	go consume("inventory-service", inventory, 2)
	go consume("email-service", email, 2)

	broker.Publish(Message{Topic: "order.placed", Payload: `{"order_id":"101"}`})
	broker.Publish(Message{Topic: "order.placed", Payload: `{"order_id":"102"}`})

	wg.Wait()
}

Key patterns illustrated:

  • Each Subscribe call gets its own buffered channel — this is how Kafka consumer groups achieve independent offsets (each group sees every message)
  • sync.RWMutex allows concurrent reads (many publishers) with exclusive writes (new subscriptions)
  • The select { case ch <- msg: default: } non-blocking send is the "drop" back-pressure strategy — replace with a blocking send for "buffer" strategy

ChapterRelevance
Ch14 — Event-Driven ArchitectureEvent sourcing and Saga patterns built on message queues
Ch12 — Communication ProtocolsAsync vs sync protocol trade-offs complement queue design
Ch13 — MicroservicesQueues enable async inter-service communication

Practice Questions

Beginner

  1. Async Order Processing: Design an order processing system where placing an order triggers inventory reservation, payment processing, and email notification — all asynchronously. Specify the queues or topics needed, the failure handling for each step, and how you ensure no step is silently skipped.

    Hint Use a single `order.placed` topic with multiple consumer groups (one per downstream service); each consumer handles its own retries and dead-letter queue independently.

Intermediate

  1. Idempotent Payments: Your payment service charges a credit card in response to a Kafka message. The consumer crashes after charging but before committing the offset. The message is redelivered. How do you prevent double-charging using idempotency keys?

    Hint Store a `payment_attempt_id` (derived from the Kafka offset or a UUID in the message) in the database with a UNIQUE constraint before charging; duplicate delivery hits the constraint and is safely ignored.
  2. Back Pressure: Your email service processes at 500 emails/second. Marketing launches a campaign that enqueues 50,000 messages in 10 seconds. Describe three back pressure strategies (queue buffering, rate limiting at producer, consumer scaling) and compare their trade-offs for this workload.

    Hint Queue buffering absorbs the burst but risks memory exhaustion; producer rate limiting prevents the burst entirely but requires coordination; horizontal consumer scaling adds throughput but has a warm-up lag — consider which failure mode is least acceptable.
  3. Kafka Partition Design: You're designing a Kafka topic for user activity events (clicks, views, purchases) for 10M DAU. Users must see their own events in order, but global ordering across all users is not required. How do you set the partition key, how many partitions do you create, and what breaks if you add more partitions later?

    Hint Partition by `user_id` to guarantee per-user ordering; adding partitions later remaps some keys to new partitions, breaking the ordering guarantee for affected users during the transition window.

Advanced

  1. Fan-out vs Fan-in: An e-commerce platform must respond to order placement by: (a) sending a receipt email, (b) updating inventory, (c) triggering loyalty points, and (d) logging to analytics. Compare fan-out (one topic, multiple consumer groups) vs fan-in (multiple queues feeding one processor) on: operational coupling, failure isolation, and scalability. When would you choose each pattern?

    Hint Fan-out decouples consumers so a slow email service doesn't affect inventory updates; fan-in centralizes logic but creates a single point of failure — fan-out is the right default when consumers have different SLAs or scaling needs.

References & Further Reading

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.

Built with VitePress + Dracula Theme