Skip to contentSkip to content
0/47 chapters completed (0%)

Chapter 16: Security & Reliability ​

Chapter banner

Security is not a feature you add at the end β€” it is a property you design in from the start. The same is true of reliability: systems that survive failure are built to expect it.


Mind Map ​


Authentication vs Authorization ​

These two concepts are consistently confused in interviews and in code. They are distinct concerns with different scopes:

ConceptQuestion AnsweredExampleEnforcement Point
Authentication (AuthN)Who are you?Verifying username + passwordLogin endpoint, API gateway
Authorization (AuthZ)What can you do?Can this user delete this resource?Business logic, middleware

Authentication always precedes authorization. A system cannot determine what an identity is allowed to do before confirming that identity. However, authorization decisions can change without re-authenticating β€” a user's role may be revoked while their session remains active, which is why token expiry and revocation matter.

Common authorization models include Role-Based Access Control (RBAC), which assigns permissions to roles and roles to users, and Attribute-Based Access Control (ABAC), which evaluates policies against user, resource, and environment attributes for finer-grained decisions.


OAuth 2.0 Authorization Code Flow ​

OAuth 2.0 is an authorization framework (not an authentication protocol). It delegates access without sharing credentials. The most secure grant type for user-facing applications is the Authorization Code grant with PKCE.

Key security properties:

  • state parameter prevents CSRF on the redirect
  • code_challenge / code_verifier (PKCE) prevents authorization code interception
  • access_token is short-lived (15 min) to limit blast radius of leaks
  • refresh_token is long-lived but must be stored securely (httpOnly cookie, not localStorage)

JWT: Structure and Validation ​

A JSON Web Token is a self-contained credential β€” the resource server can verify it without calling the auth server on every request.

Structure: base64url(header).base64url(payload).base64url(signature)

eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9    ← Header
.
eyJzdWIiOiJ1c2VyXzEyMyIsInJvbGUiOiJhZG1pbiIsImV4cCI6MTcwMDAwMDAwMH0   ← Payload
.
[RSASSA-PKCS1-v1_5 signature]             ← Signature

Header (algorithm + type):

json
{ "alg": "RS256", "typ": "JWT" }

Payload (claims β€” never put secrets here, it is base64-encoded not encrypted):

json
{
  "sub": "user_123",
  "role": "admin",
  "iat": 1700000000,
  "exp": 1700000900
}

Validation flow a resource server must execute:

Refresh token rotation: When an access token expires, the client sends the refresh token to receive a new access token (and optionally a new refresh token). If a refresh token is stolen and used, the original holder's next use detects the double-use, triggering revocation of the entire token family.


Encryption ​

TLS 1.3 Handshake (Simplified) ​

Transport Layer Security (TLS) establishes an encrypted channel before any application data is transmitted. TLS 1.3 reduced the handshake from 2 round trips (TLS 1.2) to 1 round trip:

The key_share uses Ephemeral Diffie-Hellman β€” the session key is never transmitted, it is derived independently on both sides. This provides Forward Secrecy: compromising the server's private key later does not decrypt past sessions.

At-Rest Encryption ​

LayerMechanismWho Manages Keys
Full diskAES-256 (dm-crypt, FileVault)OS / cloud provider
Database columnApplication-level AES-256Application + KMS
Object storageSSE-S3 / SSE-KMSCloud provider
SecretsVault, AWS Secrets ManagerDedicated secrets service

Key management is the hard part. Keys encrypted by other keys must stop somewhere β€” a Hardware Security Module (HSM) or cloud-managed key material is the root of trust.


Rate Limiting Algorithms ​

Rate limiting protects services from overload, abuse, and cost runaway. Four common algorithms each have different trade-offs:

1. Token Bucket ​

Tokens refill at a fixed rate. Burst traffic up to bucket capacity is allowed. Used by AWS API Gateway, Stripe.

2. Fixed Window Counter ​

Simple but has a boundary burst problem: 100 req/min allows 100 at 0:59 and 100 at 1:00 β€” effectively 200 in 2 seconds.

3. Sliding Window Log ​

Maintains a timestamped log of each request. On each request, purge entries older than the window, count remaining entries.

  • Pros: Perfectly accurate
  • Cons: Memory grows with request volume β€” stores every timestamp

4. Sliding Window Counter ​

Approximation that combines fixed window simplicity with sliding accuracy:

estimated_count = prev_window_count Γ— (1 βˆ’ elapsed_fraction) + current_window_count

Algorithm Comparison:

AlgorithmAccuracyMemoryBurst HandlingComplexity
Token BucketHighO(1)Allows bursts up to capacityLow
Fixed WindowLow (boundary burst)O(1)Hard cutoff at boundaryLowest
Sliding Window LogExactO(requests)Smooth enforcementMedium
Sliding Window CounterHigh (~0.003% error)O(1)Smooth, approximateLow

Distributed rate limiting requires a shared store (Redis with atomic INCR + EXPIRE). Per-node counters are simpler but allow NΓ—limit burst across N nodes.


DDoS Mitigation ​

A Distributed Denial of Service attack exhausts resources (bandwidth, CPU, connections) to make a service unavailable. Defense is layered:

StrategyLayerHow It HelpsExample
CDN absorptionL3/L4/L7Anycast distributes attack traffic across PoPsCloudflare absorbs 100 Tbps
Rate limitingL7Caps requests per IP / ASNDrop IPs > 1000 req/min
Web Application Firewall (WAF) rulesL7Block malformed HTTP, known attack signaturesAWS WAF, ModSecurity
IP reputationL3/L4Block known botnet/scanner IPsMaxMind, AbuseIPDB feeds
Anycast routingL3Spread volumetric traffic across global PoPsBGP anycast
SYN cookiesL4Defend TCP SYN flood without stateLinux kernel default
Connection limitsL4Cap concurrent connections per sourcenginx limit_conn

Real-World β€” Cloudflare DDoS Mitigation: Cloudflare operates 300+ PoPs using anycast. A volumetric attack targeting a single origin is distributed across the network β€” each PoP absorbs a fraction. Layer 7 attacks are filtered by their WAF and machine-learning-based bot detection. The 2023 largest-ever HTTP DDoS (71M req/sec) was mitigated automatically.


Input Validation ​

Never trust user input. Validate, sanitize, and parameterize at every boundary.

XSS (Cross-Site Scripting) ​

Attack: Injecting script into content rendered by other users' browsers.

Prevention checklist:

  • [ ] HTML-encode all user-supplied output (< β†’ &lt;)
  • [ ] Use Content-Security-Policy header to restrict script sources
  • [ ] Use httpOnly cookie flag (JavaScript cannot read cookies)
  • [ ] Avoid innerHTML; use textContent or framework templating

SQL Injection ​

Attack: Embedding SQL syntax in user input to manipulate queries.

Prevention checklist:

  • [ ] Use parameterized queries / prepared statements β€” never string-concatenate SQL
  • [ ] Use an ORM (Hibernate, SQLAlchemy, Prisma) that parameterizes by default
  • [ ] Apply least-privilege DB users (app user cannot DROP TABLE)
  • [ ] Validate input type and length before it reaches the database layer

CSRF (Cross-Site Request Forgery) ​

Attack: Tricking an authenticated user's browser into making unintended requests.

Prevention checklist:

  • [ ] Use CSRF tokens (unpredictable, tied to session, validated server-side)
  • [ ] Use SameSite=Strict or SameSite=Lax cookie attribute
  • [ ] Validate Origin / Referer headers on state-changing requests
  • [ ] Require re-authentication for high-impact actions (fund transfers, email change)

Reliability Patterns ​

Retry with Exponential Backoff and Jitter ​

Retrying failed requests immediately causes thundering-herd. Exponential backoff with jitter spreads retries over time:

wait = min(cap, base Γ— 2^attempt) + random(0, base)

Do not retry on: 4xx errors (client mistakes), non-idempotent operations without idempotency keys.

Circuit Breaker ​

See Chapter 13 for the full circuit breaker pattern (Closed β†’ Open β†’ Half-Open state machine). In the context of security and reliability: a circuit breaker prevents a failing downstream dependency from cascading failures into your service, maintaining availability degraded rather than failed.

Bulkhead Pattern ​

Named after ship hull partitions that prevent one flooded compartment from sinking the entire ship.

Apply bulkheads at: connection pools per downstream service, thread pools per request type, CPU/memory limits per container (via cgroups/Kubernetes resource limits).

Graceful Degradation Strategies ​

ScenarioDegraded BehaviorUser Experience
Recommendation service downReturn empty recommendationsPage loads without "You may also like"
Search service slowReturn cached resultsStale results shown with banner
Payment processor timeoutQueue for async retry"We're processing your payment"
Auth service flappingServe cached sessionUser remains logged in temporarily
Image service downShow placeholderBroken image replaced with fallback

The key principle: identify which features are critical-path (cannot be degraded) vs. non-critical (can return defaults or be hidden) and design accordingly.


Disaster Recovery ​

RPO vs RTO ​

MetricDefinitionQuestion It AnswersTypical Target
RPO (Recovery Point Objective)Max acceptable data loss"How much data can we lose?"0s (sync replication) to 24h
RTO (Recovery Time Objective)Max acceptable downtime"How long can we be down?"Seconds (active-active) to hours

Lower RPO and RTO require more expensive infrastructure. The relationship is roughly exponential: going from RTO=1h to RTO=1min may cost 10Γ— more.

Backup Strategies ​

StrategyDescriptionRTORPOCost
Hot standbyActive replica in sync, traffic switchable in secondsSecondsNear-zeroHighest (~2Γ— infrastructure)
Warm standbyReplica running, data lagging, needs promotionMinutesMinutesMedium (~1.5Γ—)
Cold standbyBackups stored, no running replica, restore on failureHoursHoursLowest
Pilot lightMinimal infrastructure pre-provisioned, scales on activation10–30 minMinutesLow-medium

Multi-Region Failover ​

Failover checklist:

  • [ ] DNS TTL set low (30–60s) before planned failover; low TTL costs more DNS queries normally
  • [ ] Replica is caught up (check replication lag) before promoting
  • [ ] Application connection strings use DNS names, not hardcoded IPs
  • [ ] Run failover drills quarterly β€” untested DR is not DR

Real-World β€” Netflix Chaos Engineering: Netflix runs Chaos Monkey in production, randomly terminating EC2 instances. Chaos Kong kills entire AWS regions. The philosophy: if failures happen regularly during business hours when engineers are alert, you are forced to build genuine resilience rather than relying on MTTR.


OAuth 2.0 Authorization Flows ​

OAuth 2.0 defines several "grant types" β€” each optimized for a different client context. The existing section covers the Authorization Code + PKCE flow. This section maps all major flows and when to use each.

Flow Comparison ​

FlowBest ForToken LocationSecurity LevelClient Secret Required
Authorization Code + PKCEWeb apps, mobile, SPAServer-side or httpOnly cookieHighestNo (PKCE replaces it)
Authorization Code (no PKCE)Traditional server-side web appsServer-side sessionHighYes
Client CredentialsMachine-to-machine, background servicesServer memory / secrets managerHigh (no user)Yes
Device CodeSmart TVs, CLI tools, limited-input devicesServer-sideMediumNo
Implicit (deprecated)Legacy SPAURL fragment (insecure)Low β€” do not useNo

Authorization Code + PKCE Flow (Web / Mobile) ​

This is the flow shown in the existing section above. PKCE (Proof Key for Code Exchange) replaces the client secret for public clients that cannot store secrets securely (e.g., single-page apps, mobile apps).

PKCE mechanics:

  1. Client generates a random code_verifier (43–128 chars)
  2. Client computes code_challenge = BASE64URL(SHA256(code_verifier))
  3. Authorization request includes code_challenge and code_challenge_method=S256
  4. Token request includes code_verifier β€” server re-hashes and compares

Even if an attacker intercepts the authorization_code, they cannot exchange it without the original code_verifier.

Client Credentials Flow (Machine-to-Machine) ​

No user is involved. A backend service authenticates directly as itself.

Use case: Microservice A calling Microservice B, scheduled jobs calling APIs, CI/CD pipelines accessing deployment APIs.

Security note: client_secret must be stored in a secrets manager (AWS Secrets Manager, HashiCorp Vault) β€” never in source code or environment variables committed to git.

Device Code Flow (Input-Constrained Devices) ​

Use case: Logging into Netflix on a smart TV, GitHub CLI authentication, IoT device provisioning.


JWT Deep-Dive ​

The existing section covers JWT structure and validation. This section adds claim semantics, session vs token comparison, and security pitfalls.

Standard Claims Reference ​

ClaimFull NamePurposeExample Value
issIssuerWho created the token"https://auth.example.com"
subSubjectWho the token represents (user ID)"user_abc123"
audAudienceWhich service(s) should accept this token"api.example.com"
expExpirationUnix timestamp after which token is invalid1700000900
iatIssued AtUnix timestamp when token was created1700000000
nbfNot BeforeToken not valid before this timestamp1700000000
jtiJWT IDUnique token ID β€” enables revocation tracking"abc-def-123"

Custom claims (application-specific):

json
{
  "sub": "user_123",
  "role": "admin",
  "org_id": "org_456",
  "permissions": ["read:reports", "write:settings"],
  "exp": 1700000900
}

JWT Algorithm Selection ​

AlgorithmTypeKey TypeUse Case
HS256Symmetric HMACSingle shared secretInternal services (all share same secret)
RS256Asymmetric RSAPrivate key signs, public key verifiesCross-service (distribute public key only)
ES256Asymmetric ECDSAPrivate key signs, public key verifiesSame as RS256 but smaller tokens

Rule: Use RS256 or ES256 for any token that crosses a trust boundary. HS256 is fine for internal service-to-service when all parties share the secret.

Session-Based vs Token-Based Auth ​

PropertySession (Cookie)Token (JWT)
Server stateSession stored server-side (DB/Redis)Stateless β€” no server state
RevocationInstant β€” delete session from storeHard β€” token valid until expiry
ScalabilitySession store becomes hot dependencyScales easily β€” no shared state
Token sizeCookie: ~100 bytes (session ID only)JWT: ~500–2000 bytes in headers
Cross-domainCookies limited to same origin / CORSBearer token works cross-domain
Mobile/API clientsAwkward β€” cookie handling variesNatural β€” Authorization header
Best forTraditional web apps, instant logout criticalAPIs, microservices, mobile apps

JWT Security Pitfalls ​

PitfallRiskMitigation
alg: none attackAttacker removes signature, claims any identityAlways explicitly specify allowed algorithms in validation
Weak HS256 secretBrute-forceable secret β†’ forge any tokenMinimum 256-bit random secret; prefer RS256
No aud validationToken for Service A accepted by Service BAlways validate aud claim matches current service
Long expiryStolen token usable for hours/daysAccess tokens: 5–15 min; use refresh tokens for long sessions
JWT in localStorageReadable by any JavaScript (XSS risk)Store in httpOnly cookie; if localStorage, accept XSS risk explicitly
No jti trackingCannot revoke individual tokens before expiryTrack jti in Redis for high-security actions; accept cost

Rate Limiting Algorithms β€” Full Comparison ​

The existing section covers four algorithms. This section adds Leaky Bucket and provides deeper implementation guidance.

Token Bucket (Detailed) ​

Tokens accumulate up to a capacity. Each request consumes one token. Tokens refill at rate per second.

Key properties:

  • Burst of up to capacity requests is immediately allowed
  • Long-term rate enforced by refill speed
  • Implementation: (last_tokens + (now - last_refill) * rate) β€” no timer needed, calculate on each request

Leaky Bucket ​

Requests enter a fixed-size queue. A worker processes (drains) the queue at a constant rate. If the queue is full, the request is dropped.

Key difference from Token Bucket: Leaky Bucket produces a smooth, constant output rate regardless of input burst pattern. Token Bucket allows bursts to pass through immediately.

Fixed Window Counter ​

Window: [0sβ€”60s] counter=0 β†’ increments to 100 β†’ resets at 60s β†’ [60sβ€”120s] counter=0

Boundary burst problem:

[0:59] 100 requests β†’ allowed (window 1, counter=100)
[1:00] 100 requests β†’ allowed (window 2 starts, counter=0 β†’ 100)
Result: 200 requests in 2 seconds despite "100/min" limit

Sliding Window Log ​

Stores a timestamp for every request in the current window. On each request:

  1. Remove entries older than window_size
  2. Count remaining entries
  3. If count < limit β†’ allow and add new timestamp; else β†’ reject
Redis sorted set: ZADD key timestamp "requestID"
                  ZREMRANGEBYSCORE key 0 (now - window_ms)
                  count = ZCARD key

Exact accuracy but memory grows with request volume β€” O(requests_per_window) per user.

Sliding Window Counter (Hybrid) ​

Estimates the count using weighted average between current and previous window:

estimated = prev_count Γ— (1 βˆ’ elapsed/window_size) + curr_count

Example: Window=60s, prev_count=80, curr_count=10, elapsed=15s into current window:

estimated = 80 Γ— (1 βˆ’ 15/60) + 10 = 80 Γ— 0.75 + 10 = 60 + 10 = 70

Memory: O(1) per user β€” only store two counters per window.

Algorithm Comparison ​

AlgorithmBurst HandlingMemoryAccuracySmoothnessComplexityBest For
Token BucketAllows bursts up to capacityO(1)HighBursty outputLowAPIs allowing short bursts (Stripe, AWS)
Leaky BucketAbsorbs bursts, constant outputO(queue)HighSmooth outputLow-MediumProtecting downstream at constant rate
Fixed WindowHard cutoff (boundary burst risk)O(1)LowNot smoothLowestSimple internal quotas
Sliding Window LogPerfectly smoothO(requests)ExactSmoothMediumLow-volume, exact enforcement
Sliding Window CounterSmooth, approximateO(1)~99.997%SmoothLowProduction APIs (Cloudflare, Kong)

Distributed Rate Limiting with Redis ​

Single-node rate limiting is insufficient for multi-instance services. Use Redis atomic operations:

-- Token Bucket in Redis (Lua script for atomicity)
local tokens = tonumber(redis.call('GET', key) or capacity)
local now = tonumber(ARGV[1])
local last = tonumber(redis.call('GET', key..':ts') or now)
local refill = math.min(capacity, tokens + (now - last) * rate)
if refill >= 1 then
    redis.call('SET', key, refill - 1)
    redis.call('SET', key..':ts', now)
    return 1  -- allowed
else
    return 0  -- rejected
end

Per-node vs centralized trade-off:

ApproachAccuracyLatencyFailure Mode
Per-node counterAllows NΓ—limit burst (N = node count)Zero (local)Node failure loses counter
Redis centralizedAccurate+1–2ms per requestRedis outage = no rate limiting
Redis + local fallbackApproximate (slightly over)+1–2ms normally, 0ms on Redis failureGraceful degradation

Cross-references: Rate limiting at the API gateway layer β†’ Ch13 β€” Microservices. Load balancer traffic shaping β†’ Ch06 β€” Load Balancing.


Trade-offs & Comparisons ​

ApproachBenefitCostWhen to Choose
Sync replication (RPO=0)No data loss on failoverHigher write latencyFinancial transactions
Async replication (low cost)Low write latencyPotential data lossAnalytics, content delivery
Active-active multi-regionRTO < 5sConflict resolution complexityGlobal, revenue-critical
JWT (stateless tokens)No server-side session storeCannot revoke without token rotationScalable APIs
Session cookies (stateful)Instant revocationSession store becomes critical dependencyTraditional web apps
Sliding window rate limitSmooth, accurateSlightly more complex than fixed windowProduction APIs

Key Takeaway: Security and reliability are not features to bolt on β€” they emerge from deliberate design choices: short-lived tokens, layered input validation, isolated failure domains via bulkheads, and tested recovery procedures. The most dangerous assumption in system design is that your dependencies will stay up.


Case Study: Shopify's Payment Resilience ​

Shopify processes hundreds of billions of dollars in Gross Merchandise Volume annually. For a merchant, a failed or duplicated payment is existential β€” it means lost revenue or angry customers demanding refunds. This case study maps the reliability patterns in this chapter to Shopify's actual payment architecture.

Context and Challenges ​

ChallengeConsequence if IgnoredScale
Payment gateway failuresLost sales during checkoutShopify integrates 100+ payment providers
Double-charge preventionDuplicate charges, chargebacks, merchant liabilityAny retry without idempotency β†’ duplicate charge
Partial failuresPayment debited but order not createdDistributed transaction across services
Reconciliation driftInternal ledger disagrees with Stripe/BraintreeDiscovered only at end-of-month audit

Pattern 1: Idempotency Keys ​

Every payment request is tagged with a globally unique idempotency key generated by the client before the first attempt. If the network fails mid-request, the client retries with the same key β€” the payment provider de-duplicates based on the key and returns the original result without re-processing the charge.

Key design rules for idempotency keys:

  • Generated client-side (not server-side) so the key survives server crashes
  • Stored with TTL (e.g., 24h) β€” long enough to cover retries, short enough to reclaim memory
  • Associated with the full response, not just a success flag β€” lets clients recover partial state

Pattern 2: Circuit Breakers on Payment Providers ​

Shopify integrates multiple payment providers (Stripe, Braintree, Adyen, etc.). If one provider degrades, a circuit breaker isolates that provider and routes new requests to alternatives β€” maintaining checkout availability even when a provider has an incident.

The state machine is identical to the circuit breaker pattern in Chapter 13. The Shopify-specific addition: when the circuit opens, the load balancer weight for that provider drops to 0 rather than returning errors to users.

Pattern 3: Async Payment Processing ​

Not all payment operations are synchronous. Subscription renewals, delayed captures, and refunds are processed asynchronously through a queue. This isolates the checkout path from batch operations and provides guaranteed delivery even when downstream services are slow.

Architecture (see Chapter 11 β€” Message Queues for queue patterns):

  • Checkout β†’ publishes payment.capture_requested event to durable queue
  • Payment worker consumes the event, calls provider, emits payment.succeeded or payment.failed
  • Order service subscribes to payment.succeeded to fulfill the order
  • Dead-letter queue captures failed messages after 3 retries for manual inspection

Why async for subscriptions specifically: Shopify processes millions of subscription renewals in a daily batch window. Processing them synchronously would require holding millions of open connections to payment providers. The queue decouples ingestion rate from processing rate, smoothing load across the window.

Pattern 4: Reconciliation Jobs ​

Even with idempotency keys and circuit breakers, state mismatches occur: network timeouts after a provider charges but before Shopify receives confirmation, provider-side corrections, partial refunds. Reconciliation jobs run on a schedule (hourly for high-value merchants, daily for standard) to detect and fix mismatches.

Reconciliation is the safety net that catches everything the online path missed. See Chapter 14 β€” Event-Driven Architecture for the event sourcing approach that makes reconciliation audits tractable: each state transition is a logged event, so the full history is reconstructable.

Pattern Comparison ​

PatternProblem SolvedImplementationTrade-off
Idempotency keysDuplicate charges on retryClient-generated UUID + Redis lookupKey storage cost; TTL must outlast retry window
Circuit breakerGateway outage kills checkoutPer-provider error rate threshold β†’ open/half-open/closeFalse opens under transient spikes; needs careful tuning
Async queueCheckout blocked by slow providerDurable queue + worker poolEventual consistency; UX must handle "payment processing" state
ReconciliationSilent mismatches between systemsPeriodic batch compare of internal vs external ledgerLatency: mismatches detected hours later, not instantly

Key Takeaway ​

Financial systems require defense-in-depth: no single pattern prevents all failure modes. Idempotency prevents duplicates but not gateway outages. Circuit breakers prevent cascading failures but not data mismatches. Async queues decouple services but introduce eventual consistency. Reconciliation catches everything the online path missed but only after the fact. The complete system requires all four layers.


Code Example: Token Bucket Rate Limiter (Go) ​

go
type TokenBucket struct {
    mu         sync.Mutex
    tokens     float64
    maxTokens  float64
    refillRate float64  // tokens per second
    lastRefill time.Time
}

func (tb *TokenBucket) Allow() bool {
    tb.mu.Lock()
    defer tb.mu.Unlock()

    now := time.Now()
    elapsed := now.Sub(tb.lastRefill).Seconds()
    tb.tokens = min(tb.maxTokens, tb.tokens+elapsed*tb.refillRate)
    tb.lastRefill = now

    if tb.tokens >= 1 {
        tb.tokens--
        return true
    }
    return false
}

Code Example: Circuit Breaker (Go) ​

go
type CircuitBreaker struct {
    mu           sync.Mutex
    state        string  // "closed", "open", "half-open"
    failures     int
    threshold    int
    lastFailure  time.Time
    cooldown     time.Duration
}

func (cb *CircuitBreaker) Execute(fn func() error) error {
    cb.mu.Lock()
    if cb.state == "open" {
        if time.Since(cb.lastFailure) > cb.cooldown {
            cb.state = "half-open"
        } else {
            cb.mu.Unlock()
            return errors.New("circuit breaker is open")
        }
    }
    cb.mu.Unlock()

    err := fn()

    cb.mu.Lock()
    defer cb.mu.Unlock()
    if err != nil {
        cb.failures++
        cb.lastFailure = time.Now()
        if cb.failures >= cb.threshold {
            cb.state = "open"
        }
        return err
    }
    cb.failures = 0
    cb.state = "closed"
    return nil
}
ChapterRelevance
Ch05 β€” DNSDNSSEC and DNS-layer DDoS mitigation
Ch06 β€” Load BalancingRate limiting and WAF at the LB/API gateway layer
Ch13 β€” MicroservicesAuth (JWT/OAuth2) in API gateway security model
Ch17 β€” Monitoring & ObservabilitySecurity event detection via observability pipeline

Practice Questions ​

Beginner ​

  1. JWT Validation: A user complains they were logged out even though their session "should still be valid." Walk through every JWT validation step that could cause a rejection β€” which claims (exp, iss, aud, nbf) are checked and what failure does each indicate?

    Hint Check `exp` (token expired), `nbf` (token not yet valid β€” clock skew issue), `iss` (wrong issuer β€” misconfigured auth server), `aud` (wrong audience β€” token issued for a different service); also verify the signature with the correct public key.

Intermediate ​

  1. DDoS Mitigation: Your API is receiving 500,000 requests/second from 50,000 different IP addresses. Per-IP rate limiting is ineffective. What additional mitigation layers would you apply, in what order, and at which network/application layer does each operate?

    Hint Layer in order: CDN-level anycast absorption (Cloudflare/Akamai), BGP-level traffic scrubbing, challenge-response (CAPTCHA) for suspected bots, then application-level behavioral analysis (request pattern anomalies).
  2. Bulkhead + Circuit Breaker: Your payment service calls fraud detection, currency conversion, and ledger sequentially. If fraud detection becomes slow (P99 = 8s), all payment requests time out. Design a reliability architecture using bulkheads (separate thread pools) and circuit breakers for each dependency to isolate failures.

    Hint Give each downstream service its own connection pool (bulkhead) so a slow fraud detection service exhausts only its pool, not the shared thread pool; add a circuit breaker per service with a 2s timeout threshold.
  3. RPO vs Cost Decision: A startup is choosing between RPO=1h ($2K/month, cold standby) and RPO=1min ($10K/month, warm standby with continuous replication). What business questions do you ask to help them decide, and how do you translate the answer into a cost-of-downtime calculation?

    Hint Ask: what is the revenue per minute during peak hours, and what is the cost per data-loss incident (regulatory fines, customer churn) β€” if one hour of lost transactions exceeds $8K, the warm standby pays for itself.

Advanced ​

  1. Rate Limiting Algorithms: Compare token bucket and sliding window counter algorithms for rate limiting across: burst handling accuracy, memory usage per user, implementation complexity, and behavior at window boundaries. Which algorithm would you choose for a payment API (strict accuracy required) vs a social media feed API (burst-tolerant)?

    Hint Token bucket allows smooth bursts (good for feeds); sliding window log is most accurate but uses O(requests) memory; fixed window counter has a boundary doubling flaw; the sliding window counter approximation balances accuracy and memory β€” choose based on whether bursts are acceptable.

References & Further Reading ​

  • "Release It!" β€” Michael Nygard (circuit breaker patterns)
  • OWASP Top 10
  • OAuth 2.0 RFC 6749
  • "The SRE Book" β€” Google
  • Cloudflare rate limiting blog posts

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.

Built with VitePress + Dracula Theme