Chapter 16: Security & Reliability

Security is not a feature you add at the end — it is a property you design in from the start. The same is true of reliability: systems that survive failure are built to expect it.

Mind Map

Authentication vs Authorization

These two concepts are consistently confused in interviews and in code. They are distinct concerns with different scopes:

Concept	Question Answered	Example	Enforcement Point
Authentication (AuthN)	Who are you?	Verifying username + password	Login endpoint, API gateway
Authorization (AuthZ)	What can you do?	Can this user delete this resource?	Business logic, middleware

Authentication always precedes authorization. A system cannot determine what an identity is allowed to do before confirming that identity. However, authorization decisions can change without re-authenticating — a user's role may be revoked while their session remains active, which is why token expiry and revocation matter.

Common authorization models include Role-Based Access Control (RBAC), which assigns permissions to roles and roles to users, and Attribute-Based Access Control (ABAC), which evaluates policies against user, resource, and environment attributes for finer-grained decisions.

OAuth 2.0 Authorization Code Flow

OAuth 2.0 is an authorization framework (not an authentication protocol). It delegates access without sharing credentials. The most secure grant type for user-facing applications is the Authorization Code grant with PKCE.

Key security properties:

state parameter prevents CSRF on the redirect
code_challenge / code_verifier (PKCE) prevents authorization code interception
access_token is short-lived (15 min) to limit blast radius of leaks
refresh_token is long-lived but must be stored securely (httpOnly cookie, not localStorage)

JWT: Structure and Validation

A JSON Web Token is a self-contained credential — the resource server can verify it without calling the auth server on every request.

Structure: base64url(header).base64url(payload).base64url(signature)

eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9    ← Header
.
eyJzdWIiOiJ1c2VyXzEyMyIsInJvbGUiOiJhZG1pbiIsImV4cCI6MTcwMDAwMDAwMH0   ← Payload
.
[RSASSA-PKCS1-v1_5 signature]             ← Signature

Header (algorithm + type):

json

{ "alg": "RS256", "typ": "JWT" }

Payload (claims — never put secrets here, it is base64-encoded not encrypted):

json

{
  "sub": "user_123",
  "role": "admin",
  "iat": 1700000000,
  "exp": 1700000900
}

Validation flow a resource server must execute:

Refresh token rotation: When an access token expires, the client sends the refresh token to receive a new access token (and optionally a new refresh token). If a refresh token is stolen and used, the original holder's next use detects the double-use, triggering revocation of the entire token family.

Encryption

TLS 1.3 Handshake (Simplified)

Transport Layer Security (TLS) establishes an encrypted channel before any application data is transmitted. TLS 1.3 reduced the handshake from 2 round trips (TLS 1.2) to 1 round trip:

The key_share uses Ephemeral Diffie-Hellman — the session key is never transmitted, it is derived independently on both sides. This provides Forward Secrecy: compromising the server's private key later does not decrypt past sessions.

At-Rest Encryption

Layer	Mechanism	Who Manages Keys
Full disk	AES-256 (dm-crypt, FileVault)	OS / cloud provider
Database column	Application-level AES-256	Application + KMS
Object storage	SSE-S3 / SSE-KMS	Cloud provider
Secrets	Vault, AWS Secrets Manager	Dedicated secrets service

Key management is the hard part. Keys encrypted by other keys must stop somewhere — a Hardware Security Module (HSM) or cloud-managed key material is the root of trust.

Rate Limiting Algorithms

Rate limiting protects services from overload, abuse, and cost runaway. Four common algorithms each have different trade-offs:

1. Token Bucket

Tokens refill at a fixed rate. Burst traffic up to bucket capacity is allowed. Used by AWS API Gateway, Stripe.

2. Fixed Window Counter

Simple but has a boundary burst problem: 100 req/min allows 100 at 0:59 and 100 at 1:00 — effectively 200 in 2 seconds.

3. Sliding Window Log

Maintains a timestamped log of each request. On each request, purge entries older than the window, count remaining entries.

Pros: Perfectly accurate
Cons: Memory grows with request volume — stores every timestamp

4. Sliding Window Counter

Approximation that combines fixed window simplicity with sliding accuracy:

estimated_count = prev_window_count × (1 − elapsed_fraction) + current_window_count

Algorithm Comparison:

Algorithm	Accuracy	Memory	Burst Handling	Complexity
Token Bucket	High	O(1)	Allows bursts up to capacity	Low
Fixed Window	Low (boundary burst)	O(1)	Hard cutoff at boundary	Lowest
Sliding Window Log	Exact	O(requests)	Smooth enforcement	Medium
Sliding Window Counter	High (~0.003% error)	O(1)	Smooth, approximate	Low

Distributed rate limiting requires a shared store (Redis with atomic INCR + EXPIRE). Per-node counters are simpler but allow N×limit burst across N nodes.

DDoS Mitigation

A Distributed Denial of Service attack exhausts resources (bandwidth, CPU, connections) to make a service unavailable. Defense is layered:

Strategy	Layer	How It Helps	Example
CDN absorption	L3/L4/L7	Anycast distributes attack traffic across PoPs	Cloudflare absorbs 100 Tbps
Rate limiting	L7	Caps requests per IP / ASN	Drop IPs > 1000 req/min
Web Application Firewall (WAF) rules	L7	Block malformed HTTP, known attack signatures	AWS WAF, ModSecurity
IP reputation	L3/L4	Block known botnet/scanner IPs	MaxMind, AbuseIPDB feeds
Anycast routing	L3	Spread volumetric traffic across global PoPs	BGP anycast
SYN cookies	L4	Defend TCP SYN flood without state	Linux kernel default
Connection limits	L4	Cap concurrent connections per source	nginx `limit_conn`

Real-World — Cloudflare DDoS Mitigation: Cloudflare operates 300+ PoPs using anycast. A volumetric attack targeting a single origin is distributed across the network — each PoP absorbs a fraction. Layer 7 attacks are filtered by their WAF and machine-learning-based bot detection. The 2023 largest-ever HTTP DDoS (71M req/sec) was mitigated automatically.

Input Validation

Never trust user input. Validate, sanitize, and parameterize at every boundary.

XSS (Cross-Site Scripting)

Attack: Injecting script into content rendered by other users' browsers.

Prevention checklist:

[ ] HTML-encode all user-supplied output (< → <)
[ ] Use Content-Security-Policy header to restrict script sources
[ ] Use httpOnly cookie flag (JavaScript cannot read cookies)
[ ] Avoid innerHTML; use textContent or framework templating

SQL Injection

Attack: Embedding SQL syntax in user input to manipulate queries.

Prevention checklist:

[ ] Use parameterized queries / prepared statements — never string-concatenate SQL
[ ] Use an ORM (Hibernate, SQLAlchemy, Prisma) that parameterizes by default
[ ] Apply least-privilege DB users (app user cannot DROP TABLE)
[ ] Validate input type and length before it reaches the database layer

CSRF (Cross-Site Request Forgery)

Attack: Tricking an authenticated user's browser into making unintended requests.

Prevention checklist:

[ ] Use CSRF tokens (unpredictable, tied to session, validated server-side)
[ ] Use SameSite=Strict or SameSite=Lax cookie attribute
[ ] Validate Origin / Referer headers on state-changing requests
[ ] Require re-authentication for high-impact actions (fund transfers, email change)

Zero-Trust Architecture

Zero-trust is the principle that no network location — inside the corporate perimeter or outside it — is inherently trusted. Every request must be authenticated, authorized, and encrypted regardless of source. As of 2026, zero-trust is mainstream rather than aspirational: service meshes (mTLS between every service), short-lived credentials, and per-request authorization are standard patterns in cloud-native architectures.

The three pillars of zero-trust in a microservices context:

Identity-based access — every service has a cryptographic identity (SPIFFE/SPIRE workload identity, mTLS certificates rotated automatically by the mesh). "It came from inside the cluster" is not an access decision.
Least-privilege authorization — services can only call the specific endpoints they need, enforced by mesh policy (Istio AuthorizationPolicy, OPA/Gatekeeper). Network segmentation is not enough.
Continuous verification — tokens are short-lived (minutes, not hours); credentials rotate automatically; every access decision is logged for audit.

Supply-Chain Security

The 2021 SolarWinds and Log4Shell incidents established software supply-chain attacks as a tier-1 threat class: attackers compromise a build tool, dependency, or CI/CD pipeline to inject malicious code into software that end-users trust. Supply-chain security is now a required discipline for production systems shipping software to users.

The Threat Model

Any link in this chain — a dependency, the CI environment, or a developer's credentials — is an attack surface. Supply-chain security controls address each link.

Key Controls (as of 2026)

Software Bill of Materials (SBOM)

An SBOM is a machine-readable inventory of every component in a software artifact — libraries, transitive dependencies, their versions, and license metadata. When a new vulnerability is disclosed (e.g., Log4Shell, CVE-2021-44228), an SBOM lets you immediately query: "which of our 500 services depend on the affected version?"

Standard formats: CycloneDX (JSON/XML, widely adopted) and SPDX (ISO/IEC 5962:2021)
Generation tools: syft (generates from container images or source trees), trivy (generates + scans), cdxgen
Storage: Attach SBOM to OCI image as an attestation (cosign attach sbom) or store in artifact registry
US Executive Order 14028 (2021) mandates SBOM for software sold to federal agencies; many enterprises have adopted the same requirement for internal services

SLSA (Supply-chain Levels for Software Artifacts)

SLSA (pronounced "salsa") is a security framework from Google that defines four levels of supply-chain integrity, from no controls (Level 0) to hermetically reproducible builds with end-to-end provenance (Level 4). Most teams target SLSA Level 2–3 in 2026:

Level	Requirements	Practical Steps
0	No guarantees	Starting point
1	Provenance exists	Generate SLSA provenance metadata in CI
2	Provenance from hosted build, signed	Use GitHub Actions + `slsa-github-generator`; sign artifacts
3	Tamper-resistant build, verifiable source	Hardened build environment; signed commits; hermetic builds
4	Reproducible, two-party review	Advanced; hermetic + 2-party approval required

Sigstore / cosign

Sigstore is a Linux Foundation project that provides keyless code signing using OpenID Connect (OIDC) identity — no long-lived private keys to manage. cosign signs container images; rekor records the signature in a tamper-evident transparency log.

bash

# Sign a container image using GitHub Actions OIDC identity (keyless)
cosign sign --yes ghcr.io/myorg/myservice:v1.2.3

# Verify before deploying
cosign verify --certificate-identity-regexp="https://github.com/myorg/myservice/.*" \
              --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
              ghcr.io/myorg/myservice:v1.2.3

This gives you a cryptographic proof that the image was built by your CI pipeline, not modified by an attacker who gained registry access.

Practical implementation priority:

Control	Effort	Risk Reduction	Start Here?
Dependency scanning (`trivy`, `snyk`, `dependabot`)	Low	High (catches known CVEs)	Yes — enable immediately
SBOM generation (`syft`) in CI	Low	Medium (enables instant CVE triage)	Yes — add to build pipeline
Signed container images (`cosign`)	Medium	High (prevents tampered images)	Yes — after SBOM
SLSA Level 2 provenance	Medium	Medium (audit trail for builds)	Yes — use `slsa-github-generator`
Branch protection + signed commits	Low	High (prevents source injection)	Yes — Git config
SLSA Level 3+	High	High	Later — requires build system investment

Supply-Chain Security and Idempotency

For reliability, the same idempotency pattern used in payments (see Shopify case study) applies to deployment pipelines: a deploy job that crashes mid-run must be safely retryable without double-applying changes. Infrastructure-as-code tools (Terraform, Pulumi) handle this natively; custom scripts must be written with idempotency in mind.

Reliability Patterns

Retry with Exponential Backoff and Jitter

Retrying failed requests immediately causes thundering-herd. Exponential backoff with jitter spreads retries over time:

wait = min(cap, base × 2^attempt) + random(0, base)

Do not retry on: 4xx errors (client mistakes), non-idempotent operations without idempotency keys.

Circuit Breaker

See Chapter 13 for the full circuit breaker pattern (Closed → Open → Half-Open state machine). In the context of security and reliability: a circuit breaker prevents a failing downstream dependency from cascading failures into your service, maintaining availability degraded rather than failed.

Bulkhead Pattern

Named after ship hull partitions that prevent one flooded compartment from sinking the entire ship.

Apply bulkheads at: connection pools per downstream service, thread pools per request type, CPU/memory limits per container (via cgroups/Kubernetes resource limits).

Graceful Degradation Strategies

Scenario	Degraded Behavior	User Experience
Recommendation service down	Return empty recommendations	Page loads without "You may also like"
Search service slow	Return cached results	Stale results shown with banner
Payment processor timeout	Queue for async retry	"We're processing your payment"
Auth service flapping	Serve cached session	User remains logged in temporarily
Image service down	Show placeholder	Broken image replaced with fallback

The key principle: identify which features are critical-path (cannot be degraded) vs. non-critical (can return defaults or be hidden) and design accordingly.

Disaster Recovery

RPO vs RTO

Metric	Definition	Question It Answers	Typical Target
RPO (Recovery Point Objective)	Max acceptable data loss	"How much data can we lose?"	0s (sync replication) to 24h
RTO (Recovery Time Objective)	Max acceptable downtime	"How long can we be down?"	Seconds (active-active) to hours

Lower RPO and RTO require more expensive infrastructure. The relationship is roughly exponential: going from RTO=1h to RTO=1min may cost 10× more.

Backup Strategies

Strategy	Description	RTO	RPO	Cost
Hot standby	Active replica in sync, traffic switchable in seconds	Seconds	Near-zero	Highest (~2× infrastructure)
Warm standby	Replica running, data lagging, needs promotion	Minutes	Minutes	Medium (~1.5×)
Cold standby	Backups stored, no running replica, restore on failure	Hours	Hours	Lowest
Pilot light	Minimal infrastructure pre-provisioned, scales on activation	10–30 min	Minutes	Low-medium

Multi-Region Failover

Failover checklist:

[ ] DNS TTL set low (30–60s) before planned failover; low TTL costs more DNS queries normally
[ ] Replica is caught up (check replication lag) before promoting
[ ] Application connection strings use DNS names, not hardcoded IPs
[ ] Run failover drills quarterly — untested DR is not DR

Real-World — Netflix Chaos Engineering: Netflix runs Chaos Monkey in production, randomly terminating EC2 instances. Chaos Kong kills entire AWS regions. The philosophy: if failures happen regularly during business hours when engineers are alert, you are forced to build genuine resilience rather than relying on MTTR.

OAuth 2.0 Authorization Flows

OAuth 2.0 defines several "grant types" — each optimized for a different client context. The existing section covers the Authorization Code + PKCE flow. This section maps all major flows and when to use each.

Flow Comparison

Flow	Best For	Token Location	Security Level	Client Secret Required
Authorization Code + PKCE	Web apps, mobile, SPA	Server-side or httpOnly cookie	Highest	No (PKCE replaces it)
Authorization Code (no PKCE)	Traditional server-side web apps	Server-side session	High	Yes
Client Credentials	Machine-to-machine, background services	Server memory / secrets manager	High (no user)	Yes
Device Code	Smart TVs, CLI tools, limited-input devices	Server-side	Medium	No
Implicit (deprecated)	Legacy SPA	URL fragment (insecure)	Low — do not use	No

Authorization Code + PKCE Flow (Web / Mobile)

This is the flow shown in the existing section above. PKCE (Proof Key for Code Exchange) replaces the client secret for public clients that cannot store secrets securely (e.g., single-page apps, mobile apps).

PKCE mechanics:

Client generates a random code_verifier (43–128 chars)
Client computes code_challenge = BASE64URL(SHA256(code_verifier))
Authorization request includes code_challenge and code_challenge_method=S256
Token request includes code_verifier — server re-hashes and compares

Even if an attacker intercepts the authorization_code, they cannot exchange it without the original code_verifier.

Client Credentials Flow (Machine-to-Machine)

No user is involved. A backend service authenticates directly as itself.

Use case: Microservice A calling Microservice B, scheduled jobs calling APIs, CI/CD pipelines accessing deployment APIs.

Security note: client_secret must be stored in a secrets manager (AWS Secrets Manager, HashiCorp Vault) — never in source code or environment variables committed to git.

Device Code Flow (Input-Constrained Devices)

Use case: Logging into Netflix on a smart TV, GitHub CLI authentication, IoT device provisioning.

JWT Deep-Dive

The existing section covers JWT structure and validation. This section adds claim semantics, session vs token comparison, and security pitfalls.

Standard Claims Reference

Claim	Full Name	Purpose	Example Value
`iss`	Issuer	Who created the token	`"https://auth.example.com"`
`sub`	Subject	Who the token represents (user ID)	`"user_abc123"`
`aud`	Audience	Which service(s) should accept this token	`"api.example.com"`
`exp`	Expiration	Unix timestamp after which token is invalid	`1700000900`
`iat`	Issued At	Unix timestamp when token was created	`1700000000`
`nbf`	Not Before	Token not valid before this timestamp	`1700000000`
`jti`	JWT ID	Unique token ID — enables revocation tracking	`"abc-def-123"`

Custom claims (application-specific):

json

{
  "sub": "user_123",
  "role": "admin",
  "org_id": "org_456",
  "permissions": ["read:reports", "write:settings"],
  "exp": 1700000900
}

JWT Algorithm Selection

Algorithm	Type	Key Type	Use Case
`HS256`	Symmetric HMAC	Single shared secret	Internal services (all share same secret)
`RS256`	Asymmetric RSA	Private key signs, public key verifies	Cross-service (distribute public key only)
`ES256`	Asymmetric ECDSA	Private key signs, public key verifies	Same as RS256 but smaller tokens

Rule: Use RS256 or ES256 for any token that crosses a trust boundary. HS256 is fine for internal service-to-service when all parties share the secret.

Session-Based vs Token-Based Auth

Property	Session (Cookie)	Token (JWT)
Server state	Session stored server-side (DB/Redis)	Stateless — no server state
Revocation	Instant — delete session from store	Hard — token valid until expiry
Scalability	Session store becomes hot dependency	Scales easily — no shared state
Token size	Cookie: ~100 bytes (session ID only)	JWT: ~500–2000 bytes in headers
Cross-domain	Cookies limited to same origin / CORS	Bearer token works cross-domain
Mobile/API clients	Awkward — cookie handling varies	Natural — Authorization header
Best for	Traditional web apps, instant logout critical	APIs, microservices, mobile apps

JWT Security Pitfalls

Pitfall	Risk	Mitigation
`alg: none` attack	Attacker removes signature, claims any identity	Always explicitly specify allowed algorithms in validation
Weak `HS256` secret	Brute-forceable secret → forge any token	Minimum 256-bit random secret; prefer `RS256`
No `aud` validation	Token for Service A accepted by Service B	Always validate `aud` claim matches current service
Long expiry	Stolen token usable for hours/days	Access tokens: 5–15 min; use refresh tokens for long sessions
JWT in localStorage	Readable by any JavaScript (XSS risk)	Store in `httpOnly` cookie; if localStorage, accept XSS risk explicitly
No `jti` tracking	Cannot revoke individual tokens before expiry	Track `jti` in Redis for high-security actions; accept cost

Rate Limiting Algorithms — Full Comparison

The existing section covers four algorithms. This section adds Leaky Bucket and provides deeper implementation guidance.

Token Bucket (Detailed)

Tokens accumulate up to a capacity. Each request consumes one token. Tokens refill at rate per second.

Key properties:

Burst of up to capacity requests is immediately allowed
Long-term rate enforced by refill speed
Implementation: (last_tokens + (now - last_refill) * rate) — no timer needed, calculate on each request

Leaky Bucket

Requests enter a fixed-size queue. A worker processes (drains) the queue at a constant rate. If the queue is full, the request is dropped.

Key difference from Token Bucket: Leaky Bucket produces a smooth, constant output rate regardless of input burst pattern. Token Bucket allows bursts to pass through immediately.

Fixed Window Counter

Window: [0s—60s] counter=0 → increments to 100 → resets at 60s → [60s—120s] counter=0

Boundary burst problem:

[0:59] 100 requests → allowed (window 1, counter=100)
[1:00] 100 requests → allowed (window 2 starts, counter=0 → 100)
Result: 200 requests in 2 seconds despite "100/min" limit

Sliding Window Log

Stores a timestamp for every request in the current window. On each request:

Remove entries older than window_size
Count remaining entries
If count < limit → allow and add new timestamp; else → reject

Redis sorted set: ZADD key timestamp "requestID"
                  ZREMRANGEBYSCORE key 0 (now - window_ms)
                  count = ZCARD key

Exact accuracy but memory grows with request volume — O(requests_per_window) per user.

Sliding Window Counter (Hybrid)

Estimates the count using weighted average between current and previous window:

estimated = prev_count × (1 − elapsed/window_size) + curr_count

Example: Window=60s, prev_count=80, curr_count=10, elapsed=15s into current window:

estimated = 80 × (1 − 15/60) + 10 = 80 × 0.75 + 10 = 60 + 10 = 70

Memory: O(1) per user — only store two counters per window.

Algorithm Comparison

Algorithm	Burst Handling	Memory	Accuracy	Smoothness	Complexity	Best For
Token Bucket	Allows bursts up to capacity	O(1)	High	Bursty output	Low	APIs allowing short bursts (Stripe, AWS)
Leaky Bucket	Absorbs bursts, constant output	O(queue)	High	Smooth output	Low-Medium	Protecting downstream at constant rate
Fixed Window	Hard cutoff (boundary burst risk)	O(1)	Low	Not smooth	Lowest	Simple internal quotas
Sliding Window Log	Perfectly smooth	O(requests)	Exact	Smooth	Medium	Low-volume, exact enforcement
Sliding Window Counter	Smooth, approximate	O(1)	~99.997%	Smooth	Low	Production APIs (Cloudflare, Kong)

Distributed Rate Limiting with Redis

Single-node rate limiting is insufficient for multi-instance services. Use Redis atomic operations:

-- Token Bucket in Redis (Lua script for atomicity)
local tokens = tonumber(redis.call('GET', key) or capacity)
local now = tonumber(ARGV[1])
local last = tonumber(redis.call('GET', key..':ts') or now)
local refill = math.min(capacity, tokens + (now - last) * rate)
if refill >= 1 then
    redis.call('SET', key, refill - 1)
    redis.call('SET', key..':ts', now)
    return 1  -- allowed
else
    return 0  -- rejected
end

Per-node vs centralized trade-off:

Approach	Accuracy	Latency	Failure Mode
Per-node counter	Allows N×limit burst (N = node count)	Zero (local)	Node failure loses counter
Redis centralized	Accurate	+1–2ms per request	Redis outage = no rate limiting
Redis + local fallback	Approximate (slightly over)	+1–2ms normally, 0ms on Redis failure	Graceful degradation

Cross-references: Rate limiting at the API gateway layer → Ch13 — Microservices. Load balancer traffic shaping → Ch06 — Load Balancing.

Trade-offs & Comparisons

Approach	Benefit	Cost	When to Choose
Sync replication (RPO=0)	No data loss on failover	Higher write latency	Financial transactions
Async replication (low cost)	Low write latency	Potential data loss	Analytics, content delivery
Active-active multi-region	RTO < 5s	Conflict resolution complexity	Global, revenue-critical
JWT (stateless tokens)	No server-side session store	Cannot revoke without token rotation	Scalable APIs
Session cookies (stateful)	Instant revocation	Session store becomes critical dependency	Traditional web apps
Sliding window rate limit	Smooth, accurate	Slightly more complex than fixed window	Production APIs

Key Takeaway: Security and reliability are not features to bolt on — they emerge from deliberate design choices: short-lived tokens, layered input validation, isolated failure domains via bulkheads, and tested recovery procedures. The most dangerous assumption in system design is that your dependencies will stay up.

Case Study: Shopify's Payment Resilience

Shopify processes hundreds of billions of dollars in Gross Merchandise Volume annually. For a merchant, a failed or duplicated payment is existential — it means lost revenue or angry customers demanding refunds. This case study maps the reliability patterns in this chapter to Shopify's actual payment architecture.

Context and Challenges

Challenge	Consequence if Ignored	Scale
Payment gateway failures	Lost sales during checkout	Shopify integrates 100+ payment providers
Double-charge prevention	Duplicate charges, chargebacks, merchant liability	Any retry without idempotency → duplicate charge
Partial failures	Payment debited but order not created	Distributed transaction across services
Reconciliation drift	Internal ledger disagrees with Stripe/Braintree	Discovered only at end-of-month audit

Pattern 1: Idempotency Keys

Every payment request is tagged with a globally unique idempotency key generated by the client before the first attempt. If the network fails mid-request, the client retries with the same key — the payment provider de-duplicates based on the key and returns the original result without re-processing the charge.

Key design rules for idempotency keys:

Generated client-side (not server-side) so the key survives server crashes
Stored with TTL (e.g., 24h) — long enough to cover retries, short enough to reclaim memory
Associated with the full response, not just a success flag — lets clients recover partial state

Pattern 2: Circuit Breakers on Payment Providers

Shopify integrates multiple payment providers (Stripe, Braintree, Adyen, etc.). If one provider degrades, a circuit breaker isolates that provider and routes new requests to alternatives — maintaining checkout availability even when a provider has an incident.

The state machine is identical to the circuit breaker pattern in Chapter 13. The Shopify-specific addition: when the circuit opens, the load balancer weight for that provider drops to 0 rather than returning errors to users.

Pattern 3: Async Payment Processing

Not all payment operations are synchronous. Subscription renewals, delayed captures, and refunds are processed asynchronously through a queue. This isolates the checkout path from batch operations and provides guaranteed delivery even when downstream services are slow.

Architecture (see Chapter 11 — Message Queues for queue patterns):

Checkout → publishes payment.capture_requested event to durable queue
Payment worker consumes the event, calls provider, emits payment.succeeded or payment.failed
Order service subscribes to payment.succeeded to fulfill the order
Dead-letter queue captures failed messages after 3 retries for manual inspection

Why async for subscriptions specifically: Shopify processes millions of subscription renewals in a daily batch window. Processing them synchronously would require holding millions of open connections to payment providers. The queue decouples ingestion rate from processing rate, smoothing load across the window.

Pattern 4: Reconciliation Jobs

Even with idempotency keys and circuit breakers, state mismatches occur: network timeouts after a provider charges but before Shopify receives confirmation, provider-side corrections, partial refunds. Reconciliation jobs run on a schedule (hourly for high-value merchants, daily for standard) to detect and fix mismatches.

Reconciliation is the safety net that catches everything the online path missed. See Chapter 14 — Event-Driven Architecture for the event sourcing approach that makes reconciliation audits tractable: each state transition is a logged event, so the full history is reconstructable.

Pattern Comparison

Pattern	Problem Solved	Implementation	Trade-off
Idempotency keys	Duplicate charges on retry	Client-generated UUID + Redis lookup	Key storage cost; TTL must outlast retry window
Circuit breaker	Gateway outage kills checkout	Per-provider error rate threshold → open/half-open/close	False opens under transient spikes; needs careful tuning
Async queue	Checkout blocked by slow provider	Durable queue + worker pool	Eventual consistency; UX must handle "payment processing" state
Reconciliation	Silent mismatches between systems	Periodic batch compare of internal vs external ledger	Latency: mismatches detected hours later, not instantly

Key Takeaway

Financial systems require defense-in-depth: no single pattern prevents all failure modes. Idempotency prevents duplicates but not gateway outages. Circuit breakers prevent cascading failures but not data mismatches. Async queues decouple services but introduce eventual consistency. Reconciliation catches everything the online path missed but only after the fact. The complete system requires all four layers.

Code Example: Token Bucket Rate Limiter (Go)

type TokenBucket struct {
    mu         sync.Mutex
    tokens     float64
    maxTokens  float64
    refillRate float64  // tokens per second
    lastRefill time.Time
}

func (tb *TokenBucket) Allow() bool {
    tb.mu.Lock()
    defer tb.mu.Unlock()

    now := time.Now()
    elapsed := now.Sub(tb.lastRefill).Seconds()
    tb.tokens = min(tb.maxTokens, tb.tokens+elapsed*tb.refillRate)
    tb.lastRefill = now

    if tb.tokens >= 1 {
        tb.tokens--
        return true
    }
    return false
}

Code Example: Circuit Breaker (Go)

type CircuitBreaker struct {
    mu           sync.Mutex
    state        string  // "closed", "open", "half-open"
    failures     int
    threshold    int
    lastFailure  time.Time
    cooldown     time.Duration
}

func (cb *CircuitBreaker) Execute(fn func() error) error {
    cb.mu.Lock()
    if cb.state == "open" {
        if time.Since(cb.lastFailure) > cb.cooldown {
            cb.state = "half-open"
        } else {
            cb.mu.Unlock()
            return errors.New("circuit breaker is open")
        }
    }
    cb.mu.Unlock()

    err := fn()

    cb.mu.Lock()
    defer cb.mu.Unlock()
    if err != nil {
        cb.failures++
        cb.lastFailure = time.Now()
        if cb.failures >= cb.threshold {
            cb.state = "open"
        }
        return err
    }
    cb.failures = 0
    cb.state = "closed"
    return nil
}

Chapter	Relevance
Ch05 — DNS	DNSSEC and DNS-layer DDoS mitigation
Ch06 — Load Balancing	Rate limiting and WAF at the LB/API gateway layer
Ch13 — Microservices	Auth (JWT/OAuth2) in API gateway security model
Ch17 — Monitoring & Observability	Security event detection via observability pipeline

Practice Questions

Beginner

JWT Validation: A user complains they were logged out even though their session "should still be valid." Walk through every JWT validation step that could cause a rejection — which claims (exp, iss, aud, nbf) are checked and what failure does each indicate?
Hint
Check `exp` (token expired), `nbf` (token not yet valid — clock skew issue), `iss` (wrong issuer — misconfigured auth server), `aud` (wrong audience — token issued for a different service); also verify the signature with the correct public key.

Intermediate

DDoS Mitigation: Your API is receiving 500,000 requests/second from 50,000 different IP addresses. Per-IP rate limiting is ineffective. What additional mitigation layers would you apply, in what order, and at which network/application layer does each operate?
Hint
Layer in order: CDN-level anycast absorption (Cloudflare/Akamai), BGP-level traffic scrubbing, challenge-response (CAPTCHA) for suspected bots, then application-level behavioral analysis (request pattern anomalies).
Bulkhead + Circuit Breaker: Your payment service calls fraud detection, currency conversion, and ledger sequentially. If fraud detection becomes slow (P99 = 8s), all payment requests time out. Design a reliability architecture using bulkheads (separate thread pools) and circuit breakers for each dependency to isolate failures.
Hint
Give each downstream service its own connection pool (bulkhead) so a slow fraud detection service exhausts only its pool, not the shared thread pool; add a circuit breaker per service with a 2s timeout threshold.
RPO vs Cost Decision: A startup is choosing between RPO=1h ($2K/month, cold standby) and RPO=1min ($10K/month, warm standby with continuous replication). What business questions do you ask to help them decide, and how do you translate the answer into a cost-of-downtime calculation?
Hint
Ask: what is the revenue per minute during peak hours, and what is the cost per data-loss incident (regulatory fines, customer churn) — if one hour of lost transactions exceeds $8K, the warm standby pays for itself.

Advanced

Rate Limiting Algorithms: Compare token bucket and sliding window counter algorithms for rate limiting across: burst handling accuracy, memory usage per user, implementation complexity, and behavior at window boundaries. Which algorithm would you choose for a payment API (strict accuracy required) vs a social media feed API (burst-tolerant)?
Hint
Token bucket allows smooth bursts (good for feeds); sliding window log is most accurate but uses O(requests) memory; fixed window counter has a boundary doubling flaw; the sliding window counter approximation balances accuracy and memory — choose based on whether bursts are acceptable.

References & Further Reading

"Release It!" — Michael Nygard (circuit breaker patterns)
OWASP Top 10
OAuth 2.0 RFC 6749
"The SRE Book" — Google
Cloudflare rate limiting blog posts

Chapter 16: Security & Reliability ​

Mind Map ​

Authentication vs Authorization ​

OAuth 2.0 Authorization Code Flow ​

JWT: Structure and Validation ​

Encryption ​

TLS 1.3 Handshake (Simplified) ​

At-Rest Encryption ​

Rate Limiting Algorithms ​

1. Token Bucket ​

2. Fixed Window Counter ​

3. Sliding Window Log ​

4. Sliding Window Counter ​

DDoS Mitigation ​

Input Validation ​

XSS (Cross-Site Scripting) ​

SQL Injection ​

CSRF (Cross-Site Request Forgery) ​

Zero-Trust Architecture ​

Supply-Chain Security ​

The Threat Model ​

Key Controls (as of 2026) ​

Reliability Patterns ​

Retry with Exponential Backoff and Jitter ​

Circuit Breaker ​

Bulkhead Pattern ​

Graceful Degradation Strategies ​

Disaster Recovery ​

RPO vs RTO ​

Backup Strategies ​

Multi-Region Failover ​

OAuth 2.0 Authorization Flows ​

Flow Comparison ​

Authorization Code + PKCE Flow (Web / Mobile) ​

Client Credentials Flow (Machine-to-Machine) ​

Device Code Flow (Input-Constrained Devices) ​

JWT Deep-Dive ​

Standard Claims Reference ​

JWT Algorithm Selection ​

Session-Based vs Token-Based Auth ​

JWT Security Pitfalls ​

Rate Limiting Algorithms — Full Comparison ​

Token Bucket (Detailed) ​

Leaky Bucket ​

Fixed Window Counter ​

Sliding Window Log ​

Sliding Window Counter (Hybrid) ​

Algorithm Comparison ​

Distributed Rate Limiting with Redis ​

Trade-offs & Comparisons ​

Case Study: Shopify's Payment Resilience ​

Context and Challenges ​

Pattern 1: Idempotency Keys ​

Pattern 2: Circuit Breakers on Payment Providers ​

Pattern 3: Async Payment Processing ​

Pattern 4: Reconciliation Jobs ​

Pattern Comparison ​

Key Takeaway ​

Code Example: Token Bucket Rate Limiter (Go) ​

Code Example: Circuit Breaker (Go) ​

Related Chapters ​

Practice Questions ​

Beginner ​

Intermediate ​

Advanced ​

References & Further Reading ​

Chapter 16: Security & Reliability

Mind Map

Authentication vs Authorization

OAuth 2.0 Authorization Code Flow

JWT: Structure and Validation

Encryption

TLS 1.3 Handshake (Simplified)

At-Rest Encryption

Rate Limiting Algorithms

1. Token Bucket

2. Fixed Window Counter

3. Sliding Window Log

4. Sliding Window Counter

DDoS Mitigation

Input Validation

XSS (Cross-Site Scripting)

SQL Injection

CSRF (Cross-Site Request Forgery)

Zero-Trust Architecture

Supply-Chain Security

The Threat Model

Key Controls (as of 2026)

Reliability Patterns

Retry with Exponential Backoff and Jitter

Circuit Breaker

Bulkhead Pattern

Graceful Degradation Strategies

Disaster Recovery

RPO vs RTO

Backup Strategies

Multi-Region Failover

OAuth 2.0 Authorization Flows

Flow Comparison

Authorization Code + PKCE Flow (Web / Mobile)

Client Credentials Flow (Machine-to-Machine)

Device Code Flow (Input-Constrained Devices)

JWT Deep-Dive

Standard Claims Reference

JWT Algorithm Selection

Session-Based vs Token-Based Auth

JWT Security Pitfalls

Rate Limiting Algorithms — Full Comparison

Token Bucket (Detailed)

Leaky Bucket

Fixed Window Counter

Sliding Window Log

Sliding Window Counter (Hybrid)

Algorithm Comparison

Distributed Rate Limiting with Redis

Trade-offs & Comparisons

Case Study: Shopify's Payment Resilience

Context and Challenges

Pattern 1: Idempotency Keys

Pattern 2: Circuit Breakers on Payment Providers

Pattern 3: Async Payment Processing

Pattern 4: Reconciliation Jobs

Pattern Comparison

Key Takeaway

Code Example: Token Bucket Rate Limiter (Go)

Code Example: Circuit Breaker (Go)

Related Chapters

Practice Questions

Beginner

Intermediate

Advanced

References & Further Reading