Chapter 5: Distributed System Patterns
Distributed systems are about managing failure gracefully. Every pattern in this chapter is an answer to the question: "What happens when part of the system breaks?"
Mind Map
Introduction
These five patterns operate at a different level than GoF patterns. They address problems that only emerge when you have multiple services, distributed data, and eventual consistency. You cannot solve them with a clever class hierarchy.
| Pattern | Core Problem | Key Trade-off |
|---|---|---|
| CQRS | Read and write models conflict | Read performance vs model complexity |
| Event Sourcing | State history is lost on update | Full audit trail vs replay cost |
| Saga | Transactions span multiple services | Distributed consistency vs compensation complexity |
| Strangler Fig | Legacy migration without downtime | Safety vs maintaining two systems |
| Sidecar | Cross-cutting concerns in polyglot services | Separation of concerns vs operational overhead |
| Transactional Outbox | Atomic DB write + event publish (dual-write) | Guaranteed delivery vs relay complexity |
| Idempotency Key | Retries cause duplicate operations | Retry safety vs key storage overhead |
| Backends for Frontends (BFF) | One API cannot serve all client types optimally | Team autonomy and performance vs operational overhead |
Cross-reference: These patterns relate closely to the System Design handbook — Ch13 Microservices, Ch14 Event-Driven Architecture, and Ch15 Data Replication. This chapter focuses on pattern mechanics and Go implementation rather than system-level architecture.
Pattern 1: CQRS (Command Query Responsibility Segregation)
The Analogy
A restaurant kitchen vs the menu board. The kitchen (command side) handles orders and cooking — it works with normalized, relational data about ingredients, recipes, and inventory. The menu board (query side) shows customers what's available and at what price — it's pre-formatted, denormalized, optimized for display. They have different schemas, different data access patterns, and different scaling needs. The menu board doesn't update in real time every time someone orders a dish; it updates periodically when the kitchen signals a change.
The Problem
A single data model serves both reads and writes. Writes need normalized data for consistency — avoid duplication, maintain referential integrity, validate business rules. Reads need denormalized, pre-joined data — fast, flat queries that avoid expensive joins. Optimizing for one hurts the other.
The orders table is normalized. But the UI needs: order ID, customer name, customer email, item names, item prices, item quantities, total, order status — which means joining orders, customers, order_items, products. On high-traffic systems, this join becomes the bottleneck.
Solution Architecture
The write side handles commands (state-changing operations). The read side handles queries (data retrieval). They use different data stores optimized for their purpose.
BEFORE — Single Model
// BEFORE: One service doing everything — read/write conflict
type OrderService struct {
db *sql.DB
}
// Write: needs normalized tables for integrity
func (s *OrderService) CreateOrder(ctx context.Context, req CreateOrderRequest) (*Order, error) {
tx, err := s.db.BeginTx(ctx, nil)
if err != nil {
return nil, err
}
defer tx.Rollback()
var order Order
err = tx.QueryRowContext(ctx,
"INSERT INTO orders (customer_id, status) VALUES ($1, 'pending') RETURNING id, created_at",
req.CustomerID,
).Scan(&order.ID, &order.CreatedAt)
if err != nil {
return nil, err
}
for _, item := range req.Items {
_, err = tx.ExecContext(ctx,
"INSERT INTO order_items (order_id, product_id, quantity, price) VALUES ($1, $2, $3, $4)",
order.ID, item.ProductID, item.Quantity, item.Price,
)
if err != nil {
return nil, err
}
}
return &order, tx.Commit()
}
// Read: same model, but expensive joins for the display view
func (s *OrderService) GetOrderSummary(ctx context.Context, orderID string) (*OrderSummary, error) {
// 5-table JOIN every time — does not scale at read volume
row := s.db.QueryRowContext(ctx, `
SELECT o.id, c.name, c.email,
COUNT(oi.id) as item_count,
SUM(oi.quantity * oi.price) as total,
o.status
FROM orders o
JOIN customers c ON c.id = o.customer_id
JOIN order_items oi ON oi.order_id = o.id
WHERE o.id = $1
GROUP BY o.id, c.name, c.email, o.status
`, orderID)
// ...
}AFTER — CQRS
// AFTER: Separate command and query models
// ── Command Side ──────────────────────────────────────────────────────────────
// Commands express intent, not state
type CreateOrderCommand struct {
CustomerID string
Items []OrderItem
}
type ReserveItemsCommand struct {
OrderID string
Items []OrderItem
}
// Command handler owns business logic + write model
type OrderCommandHandler struct {
writeRepo OrderWriteRepository
eventBus EventBus
}
func (h *OrderCommandHandler) HandleCreateOrder(ctx context.Context, cmd CreateOrderCommand) (string, error) {
// Validate domain rules
if len(cmd.Items) == 0 {
return "", errors.New("order must have at least one item")
}
order := NewOrder(cmd.CustomerID, cmd.Items)
if err := h.writeRepo.Save(ctx, order); err != nil {
return "", fmt.Errorf("saving order: %w", err)
}
// Publish event for read model projector and downstream services
event := OrderCreatedEvent{
OrderID: order.ID,
CustomerID: order.CustomerID,
Items: order.Items,
CreatedAt: order.CreatedAt,
}
if err := h.eventBus.Publish(ctx, "order.created", event); err != nil {
return "", fmt.Errorf("publishing event: %w", err)
}
return order.ID, nil
}
// Write repository — normalized, transactional
type OrderWriteRepository interface {
Save(ctx context.Context, order *Order) error
FindByID(ctx context.Context, id string) (*Order, error)
Update(ctx context.Context, order *Order) error
}
// ── Query Side ────────────────────────────────────────────────────────────────
// Read model is pre-denormalized — flat structure, optimized for display
type OrderSummary struct {
OrderID string `json:"order_id"`
CustomerName string `json:"customer_name"`
CustomerEmail string `json:"customer_email"`
ItemCount int `json:"item_count"`
Items []ItemLine `json:"items"`
Total float64 `json:"total"`
Status string `json:"status"`
CreatedAt time.Time `json:"created_at"`
}
type ItemLine struct {
ProductName string `json:"product_name"`
Quantity int `json:"quantity"`
UnitPrice float64 `json:"unit_price"`
LineTotal float64 `json:"line_total"`
}
// Query handler — no joins, simple key lookups
type OrderQueryHandler struct {
readRepo OrderReadRepository
}
func (h *OrderQueryHandler) GetOrderSummary(ctx context.Context, orderID string) (*OrderSummary, error) {
return h.readRepo.FindSummaryByID(ctx, orderID)
}
func (h *OrderQueryHandler) ListCustomerOrders(ctx context.Context, customerID string) ([]*OrderSummary, error) {
return h.readRepo.FindByCustomerID(ctx, customerID)
}
// Read repository — optimized store (Redis, Elasticsearch, denormalized Postgres table)
type OrderReadRepository interface {
FindSummaryByID(ctx context.Context, id string) (*OrderSummary, error)
FindByCustomerID(ctx context.Context, customerID string) ([]*OrderSummary, error)
Upsert(ctx context.Context, summary *OrderSummary) error
}
// ── Projector — bridges command side events to read model ────────────────────
type OrderProjector struct {
readRepo OrderReadRepository
customerSvc CustomerService // to enrich with customer name/email
productSvc ProductService // to enrich with product names
}
// Projectors consume events and rebuild/update the read model
func (p *OrderProjector) OnOrderCreated(ctx context.Context, event OrderCreatedEvent) error {
customer, err := p.customerSvc.Get(ctx, event.CustomerID)
if err != nil {
return fmt.Errorf("fetching customer: %w", err)
}
var items []ItemLine
var total float64
for _, item := range event.Items {
product, err := p.productSvc.Get(ctx, item.ProductID)
if err != nil {
return fmt.Errorf("fetching product %s: %w", item.ProductID, err)
}
lineTotal := float64(item.Quantity) * item.Price
total += lineTotal
items = append(items, ItemLine{
ProductName: product.Name,
Quantity: item.Quantity,
UnitPrice: item.Price,
LineTotal: lineTotal,
})
}
summary := &OrderSummary{
OrderID: event.OrderID,
CustomerName: customer.Name,
CustomerEmail: customer.Email,
ItemCount: len(items),
Items: items,
Total: total,
Status: "pending",
CreatedAt: event.CreatedAt,
}
return p.readRepo.Upsert(ctx, summary)
}When to Use
- Read and write patterns differ significantly (e.g., complex queries vs simple writes)
- High read-to-write ratio requiring independent read scaling
- Read model needs data from multiple aggregates (denormalization required)
- Used alongside Event Sourcing (they pair naturally — see Pattern 2)
- Complex domain where optimizing writes for consistency is critical
When NOT to Use
- Simple CRUD where read and write models are nearly identical
- Eventual consistency is unacceptable (CQRS introduces read lag)
- Small team/project — adds architectural complexity without proportional benefit
- Prototyping or MVPs — implement as regular CRUD, extract CQRS when the need is clear
Real-World Usage
- Shopify — order system separates write path (cart checkout, inventory reservation) from read path (order history pages served from read replicas/cache)
- Banking systems — transaction write model is normalized and transactional; account statement read model is pre-computed and denormalized
- Content management systems — publishing (write) uses normalized content model; delivery (read) uses CDN-cached flat HTML/JSON
Related Patterns
- Event Sourcing (Pattern 2) — CQRS pairs naturally; the event store IS the write model, projectors build read models from events
- Repository (Ch04) — each side of CQRS typically implements its own repository
- System Design Ch14 Event-Driven Architecture
Pattern 2: Event Sourcing
The Analogy
A bank account statement. Instead of storing the current balance ($500), you store every transaction that produced it:
AccountOpened ($0)
→ Deposited $1,000
→ Withdrew $300
→ PurchaseCharged $200
→ Current balance = $500The balance is not stored — it is derived by replaying events. This means you can answer: "What was the balance on March 15th?" You can undo a transaction by replaying without it. You have a complete audit trail for regulators.
The Problem
Traditional CRUD overwrites state. UPDATE accounts SET balance = 500 WHERE id = 'acc-123' destroys the previous value. You lose history. Audit requirements, debugging production incidents, and "undo" features all become impossible or require expensive bolt-on audit tables.
Event Store Design
Go Implementation
// ── Core Event Types ──────────────────────────────────────────────────────────
// Event is an immutable fact about something that happened.
// It is never updated or deleted — only appended.
type Event struct {
ID string `json:"id"`
StreamID string `json:"stream_id"` // e.g., "account-123"
Type string `json:"type"` // e.g., "MoneyDeposited"
Data json.RawMessage `json:"data"`
Timestamp time.Time `json:"timestamp"`
Version int `json:"version"` // sequential within stream
CausationID string `json:"causation_id"` // which command caused this
}
// EventStore is append-only — no Update or Delete operations
type EventStore interface {
// Append adds new events to a stream. expectedVersion enables optimistic concurrency:
// if the stream's current version != expectedVersion, return a conflict error.
Append(ctx context.Context, streamID string, events []Event, expectedVersion int) error
// ReadStream returns all events for a stream from a given version onward.
ReadStream(ctx context.Context, streamID string, fromVersion int) ([]Event, error)
// ReadStreamTo returns events up to a specific version (for temporal queries).
ReadStreamTo(ctx context.Context, streamID string, toVersion int) ([]Event, error)
// SubscribeToAll registers a handler for all new events across all streams.
SubscribeToAll(ctx context.Context, handler func(Event)) error
}
// ── Domain Aggregate ──────────────────────────────────────────────────────────
// BankAccount is rebuilt by replaying events — it never stores data directly.
type BankAccount struct {
ID string
Owner string
Balance int64 // in cents
Version int
Status string
// Uncommitted events — accumulated during a command, persisted at the end
uncommitted []Event
}
// Apply updates aggregate state from a single event.
// Apply must be deterministic and side-effect free — it only mutates state.
func (a *BankAccount) Apply(event Event) error {
switch event.Type {
case "AccountOpened":
var data AccountOpenedData
if err := json.Unmarshal(event.Data, &data); err != nil {
return err
}
a.ID = data.AccountID
a.Owner = data.Owner
a.Balance = 0
a.Status = "active"
case "MoneyDeposited":
var data MoneyDepositedData
if err := json.Unmarshal(event.Data, &data); err != nil {
return err
}
a.Balance += data.AmountCents
case "MoneyWithdrawn":
var data MoneyWithdrawnData
if err := json.Unmarshal(event.Data, &data); err != nil {
return err
}
a.Balance -= data.AmountCents
case "AccountClosed":
a.Status = "closed"
default:
return fmt.Errorf("unknown event type: %s", event.Type)
}
a.Version = event.Version
return nil
}
// Withdraw is a command that validates rules and records an event (does not persist yet).
func (a *BankAccount) Withdraw(amountCents int64, reference string) error {
if a.Status != "active" {
return errors.New("account is not active")
}
if amountCents <= 0 {
return errors.New("withdrawal amount must be positive")
}
if a.Balance < amountCents {
return errors.New("insufficient funds")
}
data, _ := json.Marshal(MoneyWithdrawnData{
AmountCents: amountCents,
Reference: reference,
})
event := Event{
ID: uuid.New().String(),
StreamID: fmt.Sprintf("account-%s", a.ID),
Type: "MoneyWithdrawn",
Data: data,
Version: a.Version + 1,
}
// Apply immediately to update in-memory state
a.Apply(event)
// Queue for persistence
a.uncommitted = append(a.uncommitted, event)
return nil
}
// ── Rebuilding State from Events ──────────────────────────────────────────────
// RebuildAccount replays all events for an account stream to reconstruct current state.
func RebuildAccount(events []Event) (*BankAccount, error) {
account := &BankAccount{}
for _, event := range events {
if err := account.Apply(event); err != nil {
return nil, fmt.Errorf("applying event %s (v%d): %w", event.Type, event.Version, err)
}
}
return account, nil
}
// AccountAtVersion rebuilds state up to a specific version — temporal query.
func AccountAtVersion(store EventStore, ctx context.Context, accountID string, version int) (*BankAccount, error) {
events, err := store.ReadStreamTo(ctx, fmt.Sprintf("account-%s", accountID), version)
if err != nil {
return nil, err
}
return RebuildAccount(events)
}
// ── Snapshots — Performance Optimization ─────────────────────────────────────
// For high-traffic aggregates, replaying 10,000 events on every read is too slow.
// Snapshots capture the state at a checkpoint version; you replay only from snapshot + delta.
type AccountSnapshot struct {
AccountID string `json:"account_id"`
State *BankAccount `json:"state"`
Version int `json:"version"`
CreatedAt time.Time `json:"created_at"`
}
type SnapshotStore interface {
Save(ctx context.Context, snapshot AccountSnapshot) error
Load(ctx context.Context, accountID string) (*AccountSnapshot, error)
}
// LoadAccountWithSnapshot loads from snapshot + delta events (efficient)
func LoadAccountWithSnapshot(
ctx context.Context,
accountID string,
eventStore EventStore,
snapshotStore SnapshotStore,
) (*BankAccount, error) {
snapshot, err := snapshotStore.Load(ctx, accountID)
fromVersion := 0
var account *BankAccount
if err == nil && snapshot != nil {
// Start from snapshot version — only replay delta
account = snapshot.State
fromVersion = snapshot.Version + 1
} else {
account = &BankAccount{}
}
events, err := eventStore.ReadStream(ctx, fmt.Sprintf("account-%s", accountID), fromVersion)
if err != nil {
return nil, err
}
for _, event := range events {
if err := account.Apply(event); err != nil {
return nil, err
}
}
// Optionally save a new snapshot if delta was large
if len(events) > 100 {
_ = snapshotStore.Save(ctx, AccountSnapshot{
AccountID: accountID,
State: account,
Version: account.Version,
CreatedAt: time.Now(),
})
}
return account, nil
}
// ── Event Versioning — Schema Evolution ──────────────────────────────────────
// Events are immutable once stored. When the schema changes, use upcasting:
// transform old event format to new format on the way out of the store.
type EventUpcaster interface {
Upcast(event Event) Event
}
// MoneyDepositedV1 → MoneyDepositedV2 (added "currency" field)
type MoneyDepositedUpcaster struct{}
func (u MoneyDepositedUpcaster) Upcast(event Event) Event {
if event.Type != "MoneyDeposited" {
return event
}
var v1 struct {
AmountCents int64 `json:"amount_cents"`
}
if json.Unmarshal(event.Data, &v1) != nil {
return event
}
// Check if already v2 (has currency field)
var v2 struct {
AmountCents int64 `json:"amount_cents"`
Currency string `json:"currency"`
}
if json.Unmarshal(event.Data, &v2) == nil && v2.Currency != "" {
return event // already v2
}
// Upcast: add default currency
v2.AmountCents = v1.AmountCents
v2.Currency = "USD"
data, _ := json.Marshal(v2)
return Event{
ID: event.ID, StreamID: event.StreamID, Type: event.Type,
Data: data, Timestamp: event.Timestamp, Version: event.Version,
}
}When to Use
- Full audit trail required (finance, healthcare, legal, compliance)
- Need temporal queries ("what was the state at time X?")
- Event-driven architecture where downstream systems react to state changes
- Debugging complex distributed behavior (replay to reproduce bugs)
- "Undo" functionality in business workflows
When NOT to Use
- Simple CRUD with no audit or history requirements
- Event schemas change frequently (upcasting becomes complex fast)
- Team is new to event sourcing — high learning curve, easy to get wrong
- Eventual consistency is unacceptable for the primary use case
- Events are too coarse (the whole entity per change) — defeats the purpose
Real-World Usage
- Banking ledgers — the canonical use case; every transaction is an event
- Git — the commit log is event sourcing at the file system level;
git logreplays events;git checkout <sha>is a temporal query - Apache Kafka as event store — Kafka's log is append-only, retained forever with log compaction
- EventStoreDB — purpose-built event store (Greg Young's project, the author who named CQRS/ES)
- Axon Framework — Java event sourcing framework used in enterprise DDD systems
Related Patterns
- CQRS (Pattern 1) — the natural companion; event store = write model; projectors build read models from events
- System Design Ch14 Event-Driven Architecture
Pattern 3: Saga
The Analogy
Booking a trip online — you reserve a flight, then a hotel, then a car rental. These are three separate systems. If the car rental is unavailable at the end, you need to cancel the hotel and flight (compensating actions). There is no single database transaction that spans all three companies' systems. You coordinate through a sequence of actions, each reversible, with explicit compensation steps.
The Problem
Distributed transactions across microservices. An e-commerce order requires:
- Order Service — create the order
- Payment Service — charge the customer
- Inventory Service — reserve the items
- Shipping Service — schedule delivery
Each service owns its own database. Two-Phase Commit (2PC) would require all four databases to lock until every participant confirms — creating a slow, fragile distributed lock that fails if any participant goes down.
Choreography Saga
Each service publishes an event when its step completes. The next service listens and reacts. There is no central coordinator — services are autonomous.
Compensation flow when InventoryService fails:
Orchestration Saga
A central Saga Orchestrator coordinates each step. It knows the full workflow, calls each service explicitly, and drives compensation on failure.
Go Implementation — Orchestration
// ── Saga Orchestrator ─────────────────────────────────────────────────────────
// SagaData holds the shared context passed between steps.
// Use a concrete struct per saga type for type safety.
type OrderSagaData struct {
OrderID string
CustomerID string
Items []OrderItem
PaymentID string // populated by payment step
ShipmentID string // populated by shipping step
}
// SagaStep defines one unit of work plus its compensation.
// Every step MUST have a compensation — if compensation is a no-op, make that explicit.
type SagaStep[T any] struct {
Name string
Execute func(ctx context.Context, data *T) error
Compensate func(ctx context.Context, data *T) error
}
// SagaOrchestrator runs steps in sequence; compensates in reverse on failure.
type SagaOrchestrator[T any] struct {
steps []SagaStep[T]
store SagaStateStore // persist saga state for crash recovery
}
func NewOrderSagaOrchestrator(
orderSvc OrderService,
paymentSvc PaymentService,
inventorySvc InventoryService,
shippingSvc ShippingService,
store SagaStateStore,
) *SagaOrchestrator[OrderSagaData] {
steps := []SagaStep[OrderSagaData]{
{
Name: "CreateOrder",
Execute: func(ctx context.Context, data *OrderSagaData) error {
orderID, err := orderSvc.Create(ctx, data.CustomerID, data.Items)
if err != nil {
return fmt.Errorf("creating order: %w", err)
}
data.OrderID = orderID
return nil
},
Compensate: func(ctx context.Context, data *OrderSagaData) error {
return orderSvc.Cancel(ctx, data.OrderID)
},
},
{
Name: "ProcessPayment",
Execute: func(ctx context.Context, data *OrderSagaData) error {
paymentID, err := paymentSvc.Charge(ctx, data.CustomerID, calculateTotal(data.Items))
if err != nil {
return fmt.Errorf("processing payment: %w", err)
}
data.PaymentID = paymentID
return nil
},
Compensate: func(ctx context.Context, data *OrderSagaData) error {
return paymentSvc.Refund(ctx, data.PaymentID)
},
},
{
Name: "ReserveInventory",
Execute: func(ctx context.Context, data *OrderSagaData) error {
return inventorySvc.Reserve(ctx, data.OrderID, data.Items)
},
Compensate: func(ctx context.Context, data *OrderSagaData) error {
return inventorySvc.Release(ctx, data.OrderID, data.Items)
},
},
{
Name: "ScheduleShipping",
Execute: func(ctx context.Context, data *OrderSagaData) error {
shipmentID, err := shippingSvc.Schedule(ctx, data.OrderID, data.CustomerID)
if err != nil {
return fmt.Errorf("scheduling shipping: %w", err)
}
data.ShipmentID = shipmentID
return nil
},
Compensate: func(ctx context.Context, data *OrderSagaData) error {
return shippingSvc.Cancel(ctx, data.ShipmentID)
},
},
}
return &SagaOrchestrator[OrderSagaData]{steps: steps, store: store}
}
// Run executes the saga. On any failure, compensates completed steps in reverse order.
func (s *SagaOrchestrator[T]) Run(ctx context.Context, sagaID string, data *T) error {
// Persist initial state for crash recovery
if err := s.store.Save(ctx, sagaID, "started", data); err != nil {
return fmt.Errorf("persisting saga state: %w", err)
}
completed := make([]int, 0, len(s.steps))
for i, step := range s.steps {
if err := step.Execute(ctx, data); err != nil {
// Persist failure state before compensating
_ = s.store.Save(ctx, sagaID, fmt.Sprintf("failed-at-%s", step.Name), data)
// Compensate in reverse order
for j := len(completed) - 1; j >= 0; j-- {
compensateStep := s.steps[completed[j]]
if compErr := compensateStep.Compensate(ctx, data); compErr != nil {
// Log compensation failure — requires manual intervention
// Do NOT stop compensating other steps
slog.Error("compensation failed",
"saga_id", sagaID,
"step", compensateStep.Name,
"error", compErr,
)
}
}
_ = s.store.Save(ctx, sagaID, "compensated", data)
return fmt.Errorf("saga failed at step %s (step %d): %w", step.Name, i+1, err)
}
completed = append(completed, i)
// Persist progress after each successful step
_ = s.store.Save(ctx, sagaID, fmt.Sprintf("completed-%s", step.Name), data)
}
_ = s.store.Save(ctx, sagaID, "completed", data)
return nil
}
// SagaStateStore persists saga progress — critical for crash recovery.
// If the orchestrator crashes mid-saga, it can resume from the last saved checkpoint.
type SagaStateStore interface {
Save(ctx context.Context, sagaID string, status string, data any) error
Load(ctx context.Context, sagaID string) (status string, data json.RawMessage, err error)
}
// ── Idempotency — Steps may be retried ───────────────────────────────────────
// Steps must be idempotent: calling them twice has the same effect as calling once.
// Use idempotency keys passed as part of the saga data.
func (svc *paymentServiceImpl) Charge(ctx context.Context, customerID string, amount int64) (string, error) {
idempotencyKey := fmt.Sprintf("saga-%s-payment", ctx.Value("saga_id"))
// Check if already charged with this key
if existing, err := svc.findByIdempotencyKey(ctx, idempotencyKey); err == nil {
return existing.PaymentID, nil // idempotent: return existing result
}
// Actually charge
return svc.stripe.Charge(ctx, customerID, amount, idempotencyKey)
}Choreography vs Orchestration — Trade-offs
| Dimension | Choreography | Orchestration |
|---|---|---|
| Coupling | Loose — services only know about events | Tighter — orchestrator knows all services |
| Visibility | Hard — flow is implicit in event subscriptions | Easy — orchestrator defines the entire flow |
| Debugging | Hard — follow event chain across services | Easier — check orchestrator state |
| Single point of failure | None — no central coordinator | Orchestrator (mitigated by persistence) |
| Adding steps | Low risk — add new listener | Must modify orchestrator |
| Compensations | Each service handles its own rollback | Orchestrator drives compensation explicitly |
| Best for | Simple flows, high service autonomy | Complex flows, explicit control, long-running |
When to Use
- Distributed transactions across microservices that own their own databases
- Long-running business processes (order fulfillment, booking workflows)
- Eventual consistency is acceptable (the system will be consistent, but not immediately)
- Each step has a clear compensating action
When NOT to Use
- Single database — use a regular ACID transaction instead
- Strong consistency required — Saga only provides eventual consistency
- Operations that cannot be compensated (e.g., sending an email — you can send a correction but not "unsend")
- Team is not ready to handle eventual consistency complexity
Real-World Usage
- Uber trip lifecycle — requesting, accepting, route planning, billing across independent services
- E-commerce order fulfillment — every major retailer uses saga-like patterns for checkout
- Temporal.io — workflow engine that implements saga with durable execution (Go SDK available)
- AWS Step Functions — orchestration-style saga as a managed service
Related Patterns
- System Design Ch13 Microservices
- System Design Ch14 Event-Driven Architecture
Pattern 4: Strangler Fig
The Analogy
A strangler fig tree — it begins as a vine that climbs an existing tree. Over years, the fig grows around the host tree, eventually replacing it entirely. The old tree dies inside, but at no point was the forest without a tree. Incremental replacement, never a risky big-bang rewrite.
The Problem
A legacy monolith that cannot be rewritten all at once. You need to migrate to microservices, but:
- The monolith is too large to rewrite safely in one project
- Downtime is unacceptable
- Business cannot pause feature development for a 12-month rewrite
- The "big bang" rewrite has a 70%+ failure rate in industry
Go Implementation — Routing Facade
// ── Strangler Proxy ───────────────────────────────────────────────────────────
// StranglerProxy routes requests between the legacy monolith and new microservices.
// As more paths are extracted, migratedPaths grows and the monolith shrinks.
type StranglerProxy struct {
legacy *httputil.ReverseProxy
newService *httputil.ReverseProxy
migratedPaths []string
featureFlags FeatureFlagClient
}
func NewStranglerProxy(legacyURL, newServiceURL string, flags FeatureFlagClient) *StranglerProxy {
return &StranglerProxy{
legacy: newReverseProxy(legacyURL),
newService: newReverseProxy(newServiceURL),
featureFlags: flags,
migratedPaths: []string{
"/api/orders",
"/api/order-items",
},
}
}
func (p *StranglerProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
// Check feature flag first — allows per-user or per-request rollout control
if p.featureFlags.IsEnabled(r.Context(), "use-new-order-service", userIDFromRequest(r)) {
for _, path := range p.migratedPaths {
if strings.HasPrefix(r.URL.Path, path) {
p.newService.ServeHTTP(w, r)
return
}
}
}
// Fall back to legacy monolith
p.legacy.ServeHTTP(w, r)
}
func newReverseProxy(targetURL string) *httputil.ReverseProxy {
target, _ := url.Parse(targetURL)
return httputil.NewSingleHostReverseProxy(target)
}
// ── Parallel Running — Dark Launch ───────────────────────────────────────────
// ParallelRunner sends requests to both old and new, compares responses.
// Traffic goes to old system; new system runs in shadow mode.
// Differences are logged for investigation — not returned to the user.
type ParallelRunner struct {
primary http.Handler // legacy — response returned to user
shadow http.Handler // new service — response compared silently
reporter DifferenceReporter
}
func (p *ParallelRunner) ServeHTTP(w http.ResponseWriter, r *http.Request) {
// Clone request for shadow (body can only be read once)
body, _ := io.ReadAll(r.Body)
r.Body = io.NopCloser(bytes.NewBuffer(body))
shadowReq := r.Clone(r.Context())
shadowReq.Body = io.NopCloser(bytes.NewBuffer(body))
// Serve primary — this is what the user sees
primaryRecorder := httptest.NewRecorder()
p.primary.ServeHTTP(primaryRecorder, r)
// Shadow call in background — do NOT block user response
go func() {
shadowRecorder := httptest.NewRecorder()
p.shadow.ServeHTTP(shadowRecorder, shadowReq)
if primaryRecorder.Code != shadowRecorder.Code ||
primaryRecorder.Body.String() != shadowRecorder.Body.String() {
p.reporter.Report(DifferenceReport{
Path: r.URL.Path,
PrimaryStatus: primaryRecorder.Code,
ShadowStatus: shadowRecorder.Code,
PrimaryBody: primaryRecorder.Body.String(),
ShadowBody: shadowRecorder.Body.String(),
Timestamp: time.Now(),
})
}
}()
// Copy primary response to actual client
for k, v := range primaryRecorder.Header() {
w.Header()[k] = v
}
w.WriteHeader(primaryRecorder.Code)
w.Write(primaryRecorder.Body.Bytes())
}
// ── Data Synchronization During Migration ────────────────────────────────────
// While migrating, both old and new systems may need to read/write the same data.
// Use a dual-write strategy: write to both, read from one (gradually shift reads).
type DualWriteOrderRepository struct {
legacy LegacyOrderRepository
newRepo OrderRepository
readFrom string // "legacy" or "new" — controlled by feature flag
flags FeatureFlagClient
}
func (r *DualWriteOrderRepository) Save(ctx context.Context, order *Order) error {
// Write to both systems — new system is kept in sync
if err := r.legacy.Save(ctx, order); err != nil {
return fmt.Errorf("legacy save: %w", err)
}
if err := r.newRepo.Save(ctx, order); err != nil {
// Log but don't fail — new system is not authoritative yet
slog.Warn("new repo save failed during migration", "error", err, "order_id", order.ID)
}
return nil
}
func (r *DualWriteOrderRepository) FindByID(ctx context.Context, id string) (*Order, error) {
if r.flags.IsEnabled(ctx, "read-from-new-order-service", "") {
return r.newRepo.FindByID(ctx, id)
}
return r.legacy.FindByID(ctx, id)
}When to Use
- Migrating a legacy monolith to microservices (the primary use case)
- Risk-averse incremental approach — each extraction is independently verifiable
- Team must maintain both old and new during migration (acceptable trade-off)
- Clear service boundaries exist in the monolith (natural seams to cut along)
When NOT to Use
- Greenfield project — no legacy to strangle
- The legacy system is too entangled to isolate any piece (the seams don't exist)
- Business has appetite for a planned big-bang rewrite with a long freeze (rare but valid)
- The monolith is small enough to rewrite in a sprint
Real-World Usage
- Shopify — decomposed a massive Rails monolith over years using the strangler pattern; Storefront Renderer was extracted first
- Amazon — Jeff Bezos's famous "you must communicate through APIs" mandate; teams extracted services from a monolith over years
- Spotify — backend services progressively extracted with a routing layer in front
- Netflix — migration from data center monolith to AWS microservices used this pattern
Feature Flags + Strangler Fig
Feature Flags are a natural companion for Strangler Fig migrations. Rather than routing by URL path alone, you can gate traffic to the new service per user, per percentage, or per region — enabling safe canary rollouts and instant rollback without a proxy config change. The code example above already shows this via FeatureFlagClient.IsEnabled(). Tools like LaunchDarkly, Unleash, and Flagsmith provide production-grade flag evaluation; alternatively a simple Redis key or environment variable works for low-scale migrations.
Related Patterns
- Proxy (Ch02) — the routing facade IS a proxy pattern
- Facade (Ch02) — the strangler facade hides migration complexity from clients
- System Design Ch13 Microservices
Pattern 5: Sidecar
The Analogy
A motorcycle sidecar. The motorcycle (main service) focuses entirely on driving — business logic. The sidecar (auxiliary process) handles passengers and cargo — cross-cutting concerns. They travel together, share a physical connection, but have separate purposes. The motorcycle doesn't need to know what's in the sidecar.
The Problem
Every microservice needs:
- Structured logging forwarded to a central aggregation system
- Metrics emitted to Prometheus or Datadog
- Mutual TLS (mTLS) for service-to-service encryption
- Service discovery and load balancing
- Config reload without restart
Implementing these in every service (especially across Go, Java, Python, Node.js polyglot services) is massive duplication. The infrastructure team needs to update logging in 47 services.
Kubernetes Pod Model
In Kubernetes, a pod is the unit of deployment — it can contain multiple containers that share:
- Network namespace — same IP address, communicate via
localhost - Volumes — shared file system paths
- Lifecycle — started and stopped together
This co-location model makes the sidecar pattern native to Kubernetes.
Go Implementation
// ── Main Service — Writes to Shared Log File ──────────────────────────────────
// The main application just writes structured JSON logs to a file.
// It has no knowledge of Elasticsearch, log aggregation, or forwarding.
// The sidecar handles all that.
func setupLogger(logPath string) (*slog.Logger, error) {
logFile, err := os.OpenFile(logPath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
if err != nil {
return nil, fmt.Errorf("opening log file: %w", err)
}
// JSON handler — structured, machine-parseable
handler := slog.NewJSONHandler(logFile, &slog.HandlerOptions{
Level: slog.LevelInfo,
})
return slog.New(handler), nil
}
func main() {
// Log path is a shared volume in the Kubernetes pod spec
logger, err := setupLogger("/var/log/app/service.log")
if err != nil {
log.Fatal(err)
}
slog.SetDefault(logger)
http.HandleFunc("/orders", func(w http.ResponseWriter, r *http.Request) {
slog.Info("handling request",
"method", r.Method,
"path", r.URL.Path,
"user_agent", r.UserAgent(),
"request_id", r.Header.Get("X-Request-ID"),
)
// ... business logic
})
slog.Info("service starting", "port", 8080)
log.Fatal(http.ListenAndServe(":8080", nil))
}
// ── Sidecar: Log Forwarder ────────────────────────────────────────────────────
// This runs as a separate container in the same pod.
// It reads the shared log file and forwards to a central system.
type LogForwarder struct {
logPath string
endpoint string
client *http.Client
}
func (f *LogForwarder) Run(ctx context.Context) error {
slog.Info("log forwarder starting", "path", f.logPath, "endpoint", f.endpoint)
file, err := os.Open(f.logPath)
if err != nil {
return fmt.Errorf("opening log file: %w", err)
}
defer file.Close()
// Seek to end — only forward new lines
file.Seek(0, io.SeekEnd)
reader := bufio.NewReader(file)
ticker := time.NewTicker(100 * time.Millisecond)
defer ticker.Stop()
var buffer []string
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
// Read all available lines
for {
line, err := reader.ReadString('\n')
if len(line) > 0 {
buffer = append(buffer, strings.TrimRight(line, "\n"))
}
if err == io.EOF {
break
}
if err != nil {
slog.Error("reading log file", "error", err)
break
}
}
// Batch forward if we have lines
if len(buffer) > 0 {
if err := f.forward(ctx, buffer); err != nil {
slog.Error("forwarding logs", "error", err, "count", len(buffer))
// Keep buffer — retry next tick
} else {
buffer = buffer[:0] // clear on success
}
}
}
}
}
func (f *LogForwarder) forward(ctx context.Context, lines []string) error {
payload := map[string]interface{}{
"entries": lines,
"timestamp": time.Now().UTC(),
}
body, _ := json.Marshal(payload)
req, _ := http.NewRequestWithContext(ctx, "POST", f.endpoint, bytes.NewReader(body))
req.Header.Set("Content-Type", "application/json")
resp, err := f.client.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode >= 400 {
return fmt.Errorf("endpoint returned %d", resp.StatusCode)
}
return nil
}
// ── Sidecar: Config Watcher ───────────────────────────────────────────────────
// A config-watching sidecar monitors a ConfigMap (Kubernetes) or config file
// and signals the main service to reload — without a restart.
type ConfigWatcher struct {
configPath string
onUpdate func(newConfig []byte)
}
func (w *ConfigWatcher) Watch(ctx context.Context) error {
lastHash := ""
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(5 * time.Second):
data, err := os.ReadFile(w.configPath)
if err != nil {
slog.Error("reading config", "error", err)
continue
}
hash := fmt.Sprintf("%x", sha256.Sum256(data))
if hash != lastHash {
slog.Info("config changed, signaling reload")
w.onUpdate(data)
lastHash = hash
}
}
}
}Service Mesh — Sidecar at Scale and Ambient Alternatives
Emerging Trend: Ambient / Sidecar-less Mesh (2025–2026)
Newer service mesh deployments increasingly use ambient mode (Istio Ambient, Cilium with eBPF) where mTLS, observability, and policy are enforced at the node layer via eBPF rather than per-pod sidecars. This eliminates sidecar resource overhead (~100–200 MB RAM per pod). In 2026, ambient mesh is production-ready but requires Kubernetes 1.28+ and Linux kernel ≥5.10. The sidecar model remains valid and widely deployed; ambient mesh is an emerging alternative, not a replacement.
When every pod has an Envoy sidecar, the sidecars form a service mesh. The mesh handles:
- mTLS everywhere — automatic certificate rotation, no TLS code in application
- Distributed tracing — Envoy injects trace headers (Jaeger, Zipkin) transparently
- Circuit breaking — Envoy retries and circuit breaks without application changes
- Canary routing — mesh routes X% of traffic to new service version
- Observability — metrics for every RPC across the entire fleet
When to Use
- Cross-cutting concerns (logging, metrics, mTLS) across polyglot microservices
- Service mesh architectures (Istio, Linkerd, Consul Connect)
- Kubernetes deployments — pod model makes sidecar first-class
- Infrastructure concerns must be separated from business logic
- Same sidecar can be updated across fleet without touching application code
When NOT to Use
- Monolithic application — no pods, no natural sidecar deployment
- Latency-sensitive paths where even localhost overhead matters (sub-millisecond requirements)
- Simple deployments without container orchestration (Docker Compose, bare VMs)
- Team is not yet operating Kubernetes — operational complexity is real
Real-World Usage
- Istio/Envoy — the canonical service mesh sidecar; every pod in the mesh has an Envoy sidecar
- Fluentd / Filebeat / Vector — log collection sidecars in every pod
- HashiCorp Vault Agent — sidecar that fetches and renews secrets, writes to shared volume
- Dapr — Microsoft's Distributed Application Runtime deploys as a sidecar in every pod, providing pub/sub, state store, and service invocation APIs
- Linkerd — lightweight Rust-based service mesh with a small-footprint sidecar proxy
Related Patterns
- Adapter (Ch02) — sidecar is the adapter pattern at infrastructure level; it translates between the app and infrastructure APIs
- Decorator (Ch02) — sidecar decorates service with additional behavior without modifying it
- System Design Ch23 Cloud-Native
Pattern 6: Transactional Outbox
The Analogy
You write a cheque and put it in an outbox on your desk. Later, a mail assistant picks it up and sends it. The critical insight: writing the cheque and mailing it are separate steps. If you tried to write the cheque and hand-deliver it in a single action, one failure could leave you with no record of what happened. The outbox guarantees the cheque exists before anyone tries to deliver it.
The Problem
In a microservice, you need to update the database AND publish an event — but you cannot do both atomically with a standard message broker. This is the dual-write problem:
If the service crashes between the DB write and the event publish, the order exists in the database but no downstream service (inventory, shipping, email) ever reacts. Your system is now silently inconsistent.
Naive solutions fail:
- Publish first, then DB write: Event fires but order is never saved — double charge risk
- Two-phase commit across DB + broker: Too slow, most brokers don't support it
- Eventual retry without tracking: No way to know which events were successfully published
Solution Architecture
The outbox table lives in the same database as the orders table. Writing the order and writing the outbox event happen in a single ACID transaction — they either both commit or both roll back. The relay process reads unpublished events and delivers them asynchronously.
Go Implementation
// outbox.go
package outbox
import (
"context"
"database/sql"
"encoding/json"
"fmt"
"log/slog"
"time"
)
// OutboxEvent is a row in the outbox table.
// Schema:
// CREATE TABLE outbox_events (
// id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
// topic TEXT NOT NULL,
// payload JSONB NOT NULL,
// published BOOLEAN NOT NULL DEFAULT false,
// created_at TIMESTAMPTZ NOT NULL DEFAULT now()
// );
type OutboxEvent struct {
ID string
Topic string
Payload json.RawMessage
Published bool
CreatedAt time.Time
}
// OrderCreatedPayload is the event schema for order creation.
type OrderCreatedPayload struct {
OrderID string `json:"order_id"`
CustomerID string `json:"customer_id"`
TotalCents int64 `json:"total_cents"`
CreatedAt time.Time `json:"created_at"`
}
// OrderService uses a single DB transaction to write the order + outbox event atomically.
type OrderService struct {
db *sql.DB
}
// PlaceOrder writes order + outbox event in a single transaction.
// Either both are committed or neither is — no dual-write inconsistency.
func (s *OrderService) PlaceOrder(ctx context.Context, customerID string, totalCents int64) (string, error) {
tx, err := s.db.BeginTx(ctx, nil)
if err != nil {
return "", fmt.Errorf("begin tx: %w", err)
}
defer tx.Rollback()
// Step 1: write the order
var orderID string
err = tx.QueryRowContext(ctx,
`INSERT INTO orders (customer_id, total_cents, status)
VALUES ($1, $2, 'pending')
RETURNING id`,
customerID, totalCents,
).Scan(&orderID)
if err != nil {
return "", fmt.Errorf("insert order: %w", err)
}
// Step 2: write the outbox event IN THE SAME TRANSACTION
payload, _ := json.Marshal(OrderCreatedPayload{
OrderID: orderID,
CustomerID: customerID,
TotalCents: totalCents,
CreatedAt: time.Now(),
})
_, err = tx.ExecContext(ctx,
`INSERT INTO outbox_events (topic, payload) VALUES ($1, $2)`,
"order.created", payload,
)
if err != nil {
return "", fmt.Errorf("insert outbox event: %w", err)
}
// Single COMMIT covers both inserts atomically
if err := tx.Commit(); err != nil {
return "", fmt.Errorf("commit: %w", err)
}
return orderID, nil
}
// ── Outbox Relay — publishes unpublished events ───────────────────────────────
// Publisher abstracts the event broker (Kafka, SNS, RabbitMQ, etc.)
type Publisher interface {
Publish(ctx context.Context, topic string, payload json.RawMessage) error
}
// OutboxRelay polls the outbox table and delivers unpublished events.
// In production, use a CDC connector (Debezium) for lower latency
// and to avoid polling overhead. The polling approach is simpler to deploy.
type OutboxRelay struct {
db *sql.DB
publisher Publisher
interval time.Duration
batchSize int
}
func NewOutboxRelay(db *sql.DB, publisher Publisher) *OutboxRelay {
return &OutboxRelay{
db: db,
publisher: publisher,
interval: 500 * time.Millisecond,
batchSize: 50,
}
}
// Run polls the outbox and publishes pending events until ctx is cancelled.
func (r *OutboxRelay) Run(ctx context.Context) error {
slog.Info("outbox relay starting", "interval", r.interval, "batch_size", r.batchSize)
ticker := time.NewTicker(r.interval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
if err := r.processBatch(ctx); err != nil {
slog.Error("outbox relay batch failed", "error", err)
// Keep running — transient errors should not stop the relay
}
}
}
}
func (r *OutboxRelay) processBatch(ctx context.Context) error {
// Fetch unpublished events — FOR UPDATE SKIP LOCKED enables
// multiple relay instances without double-processing
rows, err := r.db.QueryContext(ctx, `
SELECT id, topic, payload
FROM outbox_events
WHERE published = false
ORDER BY created_at
LIMIT $1
FOR UPDATE SKIP LOCKED
`, r.batchSize)
if err != nil {
return fmt.Errorf("query outbox: %w", err)
}
defer rows.Close()
var events []OutboxEvent
for rows.Next() {
var e OutboxEvent
if err := rows.Scan(&e.ID, &e.Topic, &e.Payload); err != nil {
return fmt.Errorf("scan outbox row: %w", err)
}
events = append(events, e)
}
if err := rows.Err(); err != nil {
return fmt.Errorf("rows error: %w", err)
}
for _, event := range events {
if err := r.publisher.Publish(ctx, event.Topic, event.Payload); err != nil {
// Log but continue — other events should not be blocked
slog.Error("publishing outbox event", "id", event.ID, "topic", event.Topic, "error", err)
continue
}
// Mark as published only after successful delivery
if _, err := r.db.ExecContext(ctx,
`UPDATE outbox_events SET published = true WHERE id = $1`,
event.ID,
); err != nil {
slog.Error("marking outbox event published", "id", event.ID, "error", err)
}
}
return nil
}CDC as an Alternative to Polling
Change Data Capture (CDC) with Debezium reads the database's write-ahead log (WAL) directly — no polling, near-zero latency, and no extra DB load. Debezium streams changes to Kafka topics. This is the production-grade approach at scale. The polling relay above is simpler to deploy and sufficient for lower-volume systems.
When to Use
- Any service that must update a database AND publish an event atomically — this is the canonical dual-write fix
- Saga choreography: each saga step writes its state + publishes its event in one transaction
- CQRS: command side writes aggregate + publishes domain event in one transaction
- Any integration where "event lost after DB write" would cause silent inconsistency
When NOT to Use
- Your broker supports distributed transactions with your DB (rare, non-standard)
- The operation is read-only — no state change, no outbox needed
- Eventual delivery is unacceptable — outbox guarantees at-least-once delivery, consumers must handle duplicates (idempotency keys help — see Pattern 7)
- Simple in-process event bus where durability is not required
Trade-offs
| Dimension | Impact |
|---|---|
| Consistency guarantee | Strong — event is guaranteed if transaction commits |
| Latency | Small added latency from relay polling (CDC reduces this) |
| Complexity | Requires outbox table + relay process; CDC adds Debezium setup |
| At-least-once delivery | Relay may re-deliver on crash; consumers need idempotency |
| DB coupling | Outbox table lives in same DB — cannot easily span DB boundaries |
Related Patterns
- Saga (Pattern 3) — Outbox is the recommended delivery mechanism for Saga events; ensures saga step events are not lost
- Event Sourcing (Pattern 2) — event store is naturally outbox-compatible; appending to the store is the "write"
- Idempotency Key (Pattern 7) — pairs with Outbox to handle at-least-once delivery safely
- System Design Ch14 Event-Driven Architecture
Pattern 7: Idempotency Key
The Analogy
You submit a bank transfer form and the bank's website times out. Did the transfer go through? You submit again — but now you risk a double transfer. Banks solve this by printing a unique transaction reference on every form. If you submit the same form twice, the bank detects the duplicate reference and returns the original result without processing the transfer a second time. The reference number is the idempotency key.
The Problem
Retries are necessary in distributed systems (see Ch04 — Retry with Backoff), but retrying without idempotency guarantees can cause duplicate operations:
Client → POST /charges (amount: $100)
Server processes payment → network drops before response
Client retries → POST /charges (amount: $100)
Server processes AGAIN → customer charged $200This is especially dangerous for non-idempotent operations: charges, fund transfers, email sends, order placements. The Retry-After problem: you don't know if the server processed the first request before the network failed.
Solution: Store Result by Key
Go Implementation
// idempotency.go
package idempotency
import (
"context"
"encoding/json"
"errors"
"fmt"
"log/slog"
"net/http"
"time"
)
// ErrKeyConflict is returned when a key is already in use by another in-flight request.
var ErrKeyConflict = errors.New("idempotency key is being processed by another request")
// Result is the stored response for a given idempotency key.
type Result struct {
StatusCode int `json:"status_code"`
Body json.RawMessage `json:"body"`
StoredAt time.Time `json:"stored_at"`
}
// Store persists and retrieves idempotency key results.
// Production implementations: Redis (with TTL), PostgreSQL (with UPSERT).
type Store interface {
// Get returns the stored result for a key, or (nil, nil) if not found.
Get(ctx context.Context, key string) (*Result, error)
// Set stores a result for a key. TTL controls expiry (e.g., 24h).
Set(ctx context.Context, key string, result Result, ttl time.Duration) error
// Lock acquires an exclusive lock on a key while it is being processed.
// Returns ErrKeyConflict if the key is already locked.
Lock(ctx context.Context, key string, ttl time.Duration) (unlock func(), err error)
}
// IdempotencyMiddleware wraps HTTP handlers with idempotency key enforcement.
// The client MUST supply the Idempotency-Key header for protected endpoints.
func IdempotencyMiddleware(store Store, ttl time.Duration) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
key := r.Header.Get("Idempotency-Key")
if key == "" {
// No key — pass through (non-protected endpoints)
next.ServeHTTP(w, r)
return
}
ctx := r.Context()
// 1. Check for cached result
cached, err := store.Get(ctx, key)
if err != nil {
slog.Error("idempotency store get failed", "key", key, "error", err)
http.Error(w, "internal error", http.StatusInternalServerError)
return
}
if cached != nil {
// Return stored result — no processing
slog.Info("idempotency cache hit", "key", key, "status", cached.StatusCode)
w.Header().Set("Content-Type", "application/json")
w.Header().Set("Idempotency-Replayed", "true")
w.WriteHeader(cached.StatusCode)
w.Write(cached.Body)
return
}
// 2. Lock the key to prevent concurrent duplicate requests
unlock, err := store.Lock(ctx, key, 30*time.Second)
if errors.Is(err, ErrKeyConflict) {
http.Error(w, "request with this key is already in progress", http.StatusConflict)
return
}
if err != nil {
slog.Error("idempotency lock failed", "key", key, "error", err)
http.Error(w, "internal error", http.StatusInternalServerError)
return
}
defer unlock()
// 3. Execute the handler, capturing response
recorder := &responseRecorder{header: w.Header(), statusCode: http.StatusOK}
next.ServeHTTP(recorder, r)
// 4. Store result for future retries (only for success/client-error responses)
// Do NOT cache 5xx — those should be retried against the real handler
if recorder.statusCode < 500 {
result := Result{
StatusCode: recorder.statusCode,
Body: recorder.body,
StoredAt: time.Now(),
}
if err := store.Set(ctx, key, result, ttl); err != nil {
slog.Error("idempotency store set failed", "key", key, "error", err)
// Non-fatal — response already written; log and continue
}
}
// Write actual response to client
w.WriteHeader(recorder.statusCode)
w.Write(recorder.body)
})
}
}
// responseRecorder captures the handler's response for storage.
type responseRecorder struct {
header http.Header
statusCode int
body []byte
}
func (r *responseRecorder) Header() http.Header { return r.header }
func (r *responseRecorder) WriteHeader(statusCode int) { r.statusCode = statusCode }
func (r *responseRecorder) Write(b []byte) (int, error) {
r.body = append(r.body, b...)
return len(b), nil
}
// ── Client-Side: Generating Idempotency Keys ──────────────────────────────────
// generateIdempotencyKey creates a stable key scoped to a specific user operation.
// The same user + intent + session always produces the same key — enabling safe retry.
// Different intents produce different keys — preventing cross-operation collisions.
func generateIdempotencyKey(userID, intent, sessionID string) string {
// In production: use a UUID generated client-side and persisted per user action.
// The key should be opaque to the server — just a unique string.
// Stripe recommends: generate once per user action, store on the client,
// reuse for all retries of that action.
return fmt.Sprintf("%s-%s-%s", userID, intent, sessionID)
}
// Usage: client generates key once per "charge attempt" and sends it in every retry
// req.Header.Set("Idempotency-Key", generateIdempotencyKey(userID, "checkout-2024", sessionID))IETF Draft: Idempotency-Key Header
The Idempotency-Key HTTP header is standardized as an IETF draft (draft-ietf-httpapi-idempotency-key-header). Stripe popularized the pattern; PayPal, Adyen, and most major payment APIs have adopted it. As of 2026 the draft remains in progress but the pattern is universally implemented.
When to Use
- Any non-idempotent operation that must support retries: payments, fund transfers, order placements, email sends
- APIs that return results to clients who may retry on network error
- Saga steps that can be retried by the orchestrator after partial failure
- External webhook receivers — ensure you do not process the same event twice
When NOT to Use
- Read-only operations — GET is naturally idempotent
- Operations that are already idempotent by design (PUT with full resource, DELETE)
- Low-stakes fire-and-forget writes where duplicate is acceptable
- Internal database writes in a single transaction — use the DB's own transaction semantics
Trade-offs
| Dimension | Impact |
|---|---|
| Safety | Eliminates double-charge, double-create, double-send on retry |
| Storage cost | Keys and results stored per operation (TTL limits growth) |
| Latency | One extra store lookup per request (< 1ms with Redis) |
| Key management | Client must generate, store, and reuse keys correctly |
| At-least-once delivery | Server may process before storing result; use a transactional approach for critical ops |
Related Patterns
- Retry with Backoff (Ch04) — Retry is only safe when operations are idempotent; Idempotency Key makes non-idempotent operations retry-safe
- Transactional Outbox (Pattern 6) — Outbox delivers at-least-once; consumers use idempotency keys to deduplicate
- Saga (Pattern 3) — each saga step should be idempotent so the orchestrator can safely retry failed steps
Pattern 8: Backends for Frontends (BFF)
The Analogy
A hotel concierge desk has different staff for different guest types. Business travelers get a concierge who books conference rooms, arranges car service, and prints boarding passes. Leisure guests get a concierge who books restaurants, arranges tours, and maps out local attractions. Both desks serve the same hotel, but each is optimized for a completely different set of needs. A single desk trying to serve everyone equally would serve nobody well.
The BFF pattern applies this insight to API design: each client type (mobile, web browser, voice assistant, IoT device) gets a dedicated backend service shaped for its specific needs.
The Problem
A single shared API must serve clients with radically different requirements:
Mobile app needs: small payloads, offline-first, battery-aware, push notifications
Web dashboard needs: rich data, real-time updates, large datasets, complex filters
Voice assistant needs: ultra-minimal text, no images, very fast responses
IoT device needs: tiny binary payloads, offline tolerance, rare connectivityA single backend either over-fetches for simple clients or under-fetches for rich clients. Over time, the API accumulates field additions for each client — creating a bloated, hard-to-evolve interface. Team coordination becomes a bottleneck: mobile and web teams must negotiate every API change through a shared backend team.
Solution Architecture
Each BFF is a thin aggregation layer — it calls downstream services, combines responses, and shapes data specifically for its client. Teams gain autonomy: the mobile team owns the Mobile BFF and can evolve it without coordinating with the web team.
Go Implementation
// mobile_bff.go — BFF for the mobile app client
package mobilebff
import (
"context"
"encoding/json"
"fmt"
"log/slog"
"net/http"
"time"
)
// ── Downstream service clients (injected via DI) ──────────────────────────────
type OrderSummary struct {
OrderID string `json:"order_id"`
Status string `json:"status"`
TotalCents int64 `json:"total_cents"`
CreatedAt time.Time `json:"created_at"`
}
type UserProfile struct {
UserID string `json:"user_id"`
Name string `json:"name"`
AvatarURL string `json:"avatar_url"`
}
type OrderService interface {
ListRecentOrders(ctx context.Context, userID string, limit int) ([]OrderSummary, error)
}
type UserService interface {
GetProfile(ctx context.Context, userID string) (*UserProfile, error)
}
// ── Mobile-specific response shapes ──────────────────────────────────────────
// MobileHomeResponse is shaped for the mobile app's home screen.
// It is compact, pre-aggregated, and contains only what the screen needs.
// A web dashboard would want more fields; a voice assistant would want far fewer.
type MobileHomeResponse struct {
User MobileUser `json:"user"`
RecentOrders []MobileOrder `json:"recent_orders"`
BadgeCount int `json:"badge_count"` // unread notifications
}
type MobileUser struct {
Name string `json:"name"`
AvatarURL string `json:"avatar_url"`
}
type MobileOrder struct {
OrderID string `json:"order_id"`
Status string `json:"status"`
Total string `json:"total"` // pre-formatted: "$29.99"
}
// ── BFF Handler ───────────────────────────────────────────────────────────────
type MobileBFF struct {
orders OrderService
users UserService
}
func NewMobileBFF(orders OrderService, users UserService) *MobileBFF {
return &MobileBFF{orders: orders, users: users}
}
// GetHomeScreen is the mobile-specific aggregator.
// It fetches from multiple downstream services in parallel and
// returns a pre-shaped response the mobile app can render directly.
func (b *MobileBFF) GetHomeScreen(w http.ResponseWriter, r *http.Request) {
userID := userIDFromContext(r.Context())
ctx := r.Context()
// Parallel fetch from downstream services — do NOT serialize these
type result struct {
profile *UserProfile
orders []OrderSummary
profErr error
ordErr error
}
ch := make(chan result, 1)
go func() {
var res result
// Both calls in parallel via goroutines
done := make(chan struct{}, 2)
go func() {
res.profile, res.profErr = b.users.GetProfile(ctx, userID)
done <- struct{}{}
}()
go func() {
res.orders, res.ordErr = b.orders.ListRecentOrders(ctx, userID, 5)
done <- struct{}{}
}()
<-done
<-done
ch <- res
}()
res := <-ch
if res.profErr != nil {
slog.Error("mobile bff: fetching user profile", "user_id", userID, "error", res.profErr)
http.Error(w, "failed to load profile", http.StatusInternalServerError)
return
}
if res.ordErr != nil {
slog.Error("mobile bff: fetching orders", "user_id", userID, "error", res.ordErr)
http.Error(w, "failed to load orders", http.StatusInternalServerError)
return
}
// Shape the response for mobile
mobileOrders := make([]MobileOrder, len(res.orders))
for i, o := range res.orders {
mobileOrders[i] = MobileOrder{
OrderID: o.OrderID,
Status: o.Status,
Total: fmt.Sprintf("$%.2f", float64(o.TotalCents)/100),
}
}
resp := MobileHomeResponse{
User: MobileUser{Name: res.profile.Name, AvatarURL: res.profile.AvatarURL},
RecentOrders: mobileOrders,
BadgeCount: 0, // fetched from notification service in real impl
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(resp)
}
// helper — in real code this reads from JWT claims in context
func userIDFromContext(ctx context.Context) string {
if v, ok := ctx.Value("user_id").(string); ok {
return v
}
return ""
}When to Use
- Multiple distinct client types with significantly different data needs (mobile, web, voice, IoT, partner API)
- Different team ownership — mobile team wants to evolve the mobile API independently of the web team
- Performance-critical clients — mobile apps need minimal payloads to reduce battery and bandwidth usage
- Aggregation at the edge — the BFF aggregates multiple downstream service calls so clients make one request instead of five
- Client-specific auth/security — mobile clients might use device certificates; web uses session cookies; BFF handles the translation
When NOT to Use
- Only one client type — a single API is simpler and avoids the duplication of N BFFs
- BFF logic is identical across clients — duplication without differentiation
- The team cannot own multiple BFF services operationally — overhead without staffing
- Downstream services already serve well-shaped client-specific responses
Trade-offs
| Dimension | Impact |
|---|---|
| Team autonomy | Each client team owns and evolves their BFF independently |
| Performance | BFF parallel-fetches downstream services; client makes one request |
| Duplication risk | Common aggregation logic may be duplicated across BFFs — extract shared libraries |
| Operational overhead | N more services to deploy, monitor, and scale |
| Security surface | Each BFF is a separate auth boundary — plan security per client type |
Related Patterns
- Facade (Ch02) — BFF is an application-layer facade; it hides multiple downstream services behind a client-specific interface
- Adapter (Ch02) — BFF adapts downstream service responses to client-specific shapes
- Strangler Fig (Pattern 4) — BFF is often introduced during migration to decouple clients from the monolith
- Aggregator Gateway — when a single BFF serves all clients with per-request shaping, it becomes an API gateway pattern
CQRS + Event Sourcing: The Natural Combination
These two patterns pair together so frequently they are often discussed as a single pattern. Here is why and how:
Why they fit together:
- Event store = natural write model for CQRS — commands produce events, events go to the store
- Projectors = CQRS read model builders — each projection is an optimized read model for a different query pattern
- Multiple read models from one event stream — the same events project to Postgres for queries, Elasticsearch for search, ClickHouse for analytics
- Rebuild read models from scratch — if a projector has a bug, replay all events to rebuild the read model correctly
- Temporal queries for free — replay events to point-in-time for the read model
The trade-off: Combined complexity is high. Use when you genuinely need both full audit history AND query model flexibility. Don't use for CRUD.
Pattern Comparison Table
| Pattern | Scope | Consistency Model | Complexity | Primary Trade-off |
|---|---|---|---|---|
| CQRS | Single/multi-service | Eventual (read lag) | Medium | Read performance vs model complexity |
| Event Sourcing | Single service (event store) | Eventual | High | Full history + replay vs storage cost |
| Saga | Multi-service | Eventual (with compensation) | High | Distributed consistency vs compensation complexity |
| Strangler Fig | System migration | N/A — migration tool | Medium | Safe migration vs maintaining two systems |
| Sidecar | Infrastructure layer | N/A — infrastructure | Low–Medium | Concerns separation vs operational overhead |
| Transactional Outbox | Single service ↔ broker | At-least-once guaranteed | Low–Medium | Guaranteed delivery vs relay/CDC complexity |
| Idempotency Key | Per-operation | Exactly-once semantics | Low | Retry safety vs key storage + client discipline |
| BFF | Client ↔ services boundary | N/A — aggregation | Medium | Team autonomy vs N-service operational overhead |
Decision guide:
Practice Questions
Easy
1. CQRS without Event Sourcing
When would you use CQRS without Event Sourcing? Describe a concrete scenario and explain what the write model stores (if not events) and how the read model gets updated.
Hint
A CMS with heavy read traffic but simple writes — articles are stored as rows in Postgres (write model). A read model is a denormalized Redis cache updated by a database trigger or a CDC (Change Data Capture) pipeline when articles change. No need for event history — just separate models for performance.Medium
2. Saga for Travel Booking
Design a Saga for a travel booking system: the user books a flight, hotel, and car rental as a bundle. Each is a separate service with its own database.
- Draw the happy path and the compensation path when the car rental is unavailable
- Choose choreography or orchestration and justify your choice
- Identify which steps are hardest to compensate and why
Hint
Orchestration fits better here — the booking flow is sequential and explicit, and a travel platform benefits from seeing the full saga state in one place (for customer support). Compensation: cancel car = easy (reservation not confirmed); cancel hotel = easy (refundable window); cancel flight = hard (cancellation fees, airline APIs may not be idempotent). The flight cancellation compensation must handle partial refunds and retry scenarios.3. Strangler Fig Migration Plan
You are the lead engineer at an e-commerce company with a 10-year-old Rails monolith. The monolith handles: product catalog, user accounts, orders, payments, inventory, and search. Plan a Strangler Fig migration. Which service do you extract first, and why? What is your data synchronization strategy during migration?
Hint
Extract **Search** first — it has the clearest boundary (read-only from the monolith's perspective), can be rebuilt from a product catalog snapshot, and has no write operations that need dual-write. After search, extract **payments** (hardest) or **catalog** (safer). Data sync: use CDC (Debezium on the monolith DB) to stream changes to the new service's DB during the migration period.Hard
4. Event-Sourced Banking System with Temporal Queries
Design an event-sourced bank account system that supports:
- Current balance queries (sub-10ms)
- "Balance as of date X" queries
- Monthly statement generation (all transactions in a month)
- 99.99% account accuracy — no balance can ever be lost
Address:
- How do snapshots work and when do you create them?
- How do you handle event schema evolution when the
MoneyTransferredevent gains a newcurrencyfield? - What happens if the projector that builds the balance read model crashes halfway through replaying events?
Hint
Snapshots: create after every N events (e.g., 100) or on a daily schedule, stored with the version number. For current balance, load latest snapshot + replay delta events — O(delta) not O(all). Schema evolution: use upcasters that transform old events on read (v1 → v2 by adding `currency: "USD"` default). Projector crash: projectors must be idempotent and track their last processed event version; on restart, resume from the last committed version checkpoint. Use an outbox pattern to guarantee events reach the projector exactly-once.Related Chapters
| Chapter | Relevance |
|---|---|
| Ch02 — Structural Patterns | Proxy (Strangler), Adapter (Sidecar), Decorator (Sidecar) |
| Ch04 — Modern Application Patterns | Repository (CQRS read/write repos), Circuit Breaker (Saga resilience) |
| Ch13 — Microservices | Service boundaries for Saga and Strangler Fig |
| Ch14 — Event-Driven Architecture | Event buses for CQRS projectors and Choreography Saga |
| Ch15 — Data Replication | Eventual consistency guarantees underlying CQRS and Event Sourcing |
Key Takeaway
Distributed system patterns solve problems that cannot be solved with code structure alone. CQRS and Event Sourcing address the fundamental conflict between optimizing for reads vs writes. Saga addresses the impossibility of atomic distributed transactions. Strangler Fig addresses the business reality that you cannot pause development for a big-bang rewrite. Sidecar addresses the duplication tax of polyglot infrastructure. Each pattern adds complexity — apply them only when the problem is real.
References & Further Reading
- Microsoft Azure — CQRS Pattern — Architecture reference with trade-off analysis
- Martin Fowler — Event Sourcing — Original definition and motivation
- microservices.io — Saga Pattern — Chris Richardson's canonical reference
- ByteByteGo — Saga Pattern Demystified — Orchestration vs Choreography visual guide
- AWS — Saga Orchestration Pattern — AWS prescriptive guidance
- Martin Fowler — Strangler Fig Application — Original article naming the pattern
- Microsoft Azure — Sidecar Pattern — Architecture reference
- Temporal.io — Saga Pattern — Production implementation guide

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.