Chapter 16: Security & Reliability β

Security is not a feature you add at the end β it is a property you design in from the start. The same is true of reliability: systems that survive failure are built to expect it.
Mind Map β
Authentication vs Authorization β
These two concepts are consistently confused in interviews and in code. They are distinct concerns with different scopes:
| Concept | Question Answered | Example | Enforcement Point |
|---|---|---|---|
| Authentication (AuthN) | Who are you? | Verifying username + password | Login endpoint, API gateway |
| Authorization (AuthZ) | What can you do? | Can this user delete this resource? | Business logic, middleware |
Authentication always precedes authorization. A system cannot determine what an identity is allowed to do before confirming that identity. However, authorization decisions can change without re-authenticating β a user's role may be revoked while their session remains active, which is why token expiry and revocation matter.
Common authorization models include Role-Based Access Control (RBAC), which assigns permissions to roles and roles to users, and Attribute-Based Access Control (ABAC), which evaluates policies against user, resource, and environment attributes for finer-grained decisions.
OAuth 2.0 Authorization Code Flow β
OAuth 2.0 is an authorization framework (not an authentication protocol). It delegates access without sharing credentials. The most secure grant type for user-facing applications is the Authorization Code grant with PKCE.
Key security properties:
stateparameter prevents CSRF on the redirectcode_challenge/code_verifier(PKCE) prevents authorization code interceptionaccess_tokenis short-lived (15 min) to limit blast radius of leaksrefresh_tokenis long-lived but must be stored securely (httpOnly cookie, not localStorage)
JWT: Structure and Validation β
A JSON Web Token is a self-contained credential β the resource server can verify it without calling the auth server on every request.
Structure: base64url(header).base64url(payload).base64url(signature)
eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9 β Header
.
eyJzdWIiOiJ1c2VyXzEyMyIsInJvbGUiOiJhZG1pbiIsImV4cCI6MTcwMDAwMDAwMH0 β Payload
.
[RSASSA-PKCS1-v1_5 signature] β SignatureHeader (algorithm + type):
{ "alg": "RS256", "typ": "JWT" }Payload (claims β never put secrets here, it is base64-encoded not encrypted):
{
"sub": "user_123",
"role": "admin",
"iat": 1700000000,
"exp": 1700000900
}Validation flow a resource server must execute:
Refresh token rotation: When an access token expires, the client sends the refresh token to receive a new access token (and optionally a new refresh token). If a refresh token is stolen and used, the original holder's next use detects the double-use, triggering revocation of the entire token family.
Encryption β
TLS 1.3 Handshake (Simplified) β
Transport Layer Security (TLS) establishes an encrypted channel before any application data is transmitted. TLS 1.3 reduced the handshake from 2 round trips (TLS 1.2) to 1 round trip:
The key_share uses Ephemeral Diffie-Hellman β the session key is never transmitted, it is derived independently on both sides. This provides Forward Secrecy: compromising the server's private key later does not decrypt past sessions.
At-Rest Encryption β
| Layer | Mechanism | Who Manages Keys |
|---|---|---|
| Full disk | AES-256 (dm-crypt, FileVault) | OS / cloud provider |
| Database column | Application-level AES-256 | Application + KMS |
| Object storage | SSE-S3 / SSE-KMS | Cloud provider |
| Secrets | Vault, AWS Secrets Manager | Dedicated secrets service |
Key management is the hard part. Keys encrypted by other keys must stop somewhere β a Hardware Security Module (HSM) or cloud-managed key material is the root of trust.
Rate Limiting Algorithms β
Rate limiting protects services from overload, abuse, and cost runaway. Four common algorithms each have different trade-offs:
1. Token Bucket β
Tokens refill at a fixed rate. Burst traffic up to bucket capacity is allowed. Used by AWS API Gateway, Stripe.
2. Fixed Window Counter β
Simple but has a boundary burst problem: 100 req/min allows 100 at 0:59 and 100 at 1:00 β effectively 200 in 2 seconds.
3. Sliding Window Log β
Maintains a timestamped log of each request. On each request, purge entries older than the window, count remaining entries.
- Pros: Perfectly accurate
- Cons: Memory grows with request volume β stores every timestamp
4. Sliding Window Counter β
Approximation that combines fixed window simplicity with sliding accuracy:
estimated_count = prev_window_count Γ (1 β elapsed_fraction) + current_window_count
Algorithm Comparison:
| Algorithm | Accuracy | Memory | Burst Handling | Complexity |
|---|---|---|---|---|
| Token Bucket | High | O(1) | Allows bursts up to capacity | Low |
| Fixed Window | Low (boundary burst) | O(1) | Hard cutoff at boundary | Lowest |
| Sliding Window Log | Exact | O(requests) | Smooth enforcement | Medium |
| Sliding Window Counter | High (~0.003% error) | O(1) | Smooth, approximate | Low |
Distributed rate limiting requires a shared store (Redis with atomic INCR + EXPIRE). Per-node counters are simpler but allow NΓlimit burst across N nodes.
DDoS Mitigation β
A Distributed Denial of Service attack exhausts resources (bandwidth, CPU, connections) to make a service unavailable. Defense is layered:
| Strategy | Layer | How It Helps | Example |
|---|---|---|---|
| CDN absorption | L3/L4/L7 | Anycast distributes attack traffic across PoPs | Cloudflare absorbs 100 Tbps |
| Rate limiting | L7 | Caps requests per IP / ASN | Drop IPs > 1000 req/min |
| Web Application Firewall (WAF) rules | L7 | Block malformed HTTP, known attack signatures | AWS WAF, ModSecurity |
| IP reputation | L3/L4 | Block known botnet/scanner IPs | MaxMind, AbuseIPDB feeds |
| Anycast routing | L3 | Spread volumetric traffic across global PoPs | BGP anycast |
| SYN cookies | L4 | Defend TCP SYN flood without state | Linux kernel default |
| Connection limits | L4 | Cap concurrent connections per source | nginx limit_conn |
Real-World β Cloudflare DDoS Mitigation: Cloudflare operates 300+ PoPs using anycast. A volumetric attack targeting a single origin is distributed across the network β each PoP absorbs a fraction. Layer 7 attacks are filtered by their WAF and machine-learning-based bot detection. The 2023 largest-ever HTTP DDoS (71M req/sec) was mitigated automatically.
Input Validation β
Never trust user input. Validate, sanitize, and parameterize at every boundary.
XSS (Cross-Site Scripting) β
Attack: Injecting script into content rendered by other users' browsers.
Prevention checklist:
- [ ] HTML-encode all user-supplied output (
<β<) - [ ] Use Content-Security-Policy header to restrict script sources
- [ ] Use
httpOnlycookie flag (JavaScript cannot read cookies) - [ ] Avoid
innerHTML; usetextContentor framework templating
SQL Injection β
Attack: Embedding SQL syntax in user input to manipulate queries.
Prevention checklist:
- [ ] Use parameterized queries / prepared statements β never string-concatenate SQL
- [ ] Use an ORM (Hibernate, SQLAlchemy, Prisma) that parameterizes by default
- [ ] Apply least-privilege DB users (app user cannot
DROP TABLE) - [ ] Validate input type and length before it reaches the database layer
CSRF (Cross-Site Request Forgery) β
Attack: Tricking an authenticated user's browser into making unintended requests.
Prevention checklist:
- [ ] Use CSRF tokens (unpredictable, tied to session, validated server-side)
- [ ] Use
SameSite=StrictorSameSite=Laxcookie attribute - [ ] Validate
Origin/Refererheaders on state-changing requests - [ ] Require re-authentication for high-impact actions (fund transfers, email change)
Reliability Patterns β
Retry with Exponential Backoff and Jitter β
Retrying failed requests immediately causes thundering-herd. Exponential backoff with jitter spreads retries over time:
wait = min(cap, base Γ 2^attempt) + random(0, base)Do not retry on: 4xx errors (client mistakes), non-idempotent operations without idempotency keys.
Circuit Breaker β
See Chapter 13 for the full circuit breaker pattern (Closed β Open β Half-Open state machine). In the context of security and reliability: a circuit breaker prevents a failing downstream dependency from cascading failures into your service, maintaining availability degraded rather than failed.
Bulkhead Pattern β
Named after ship hull partitions that prevent one flooded compartment from sinking the entire ship.
Apply bulkheads at: connection pools per downstream service, thread pools per request type, CPU/memory limits per container (via cgroups/Kubernetes resource limits).
Graceful Degradation Strategies β
| Scenario | Degraded Behavior | User Experience |
|---|---|---|
| Recommendation service down | Return empty recommendations | Page loads without "You may also like" |
| Search service slow | Return cached results | Stale results shown with banner |
| Payment processor timeout | Queue for async retry | "We're processing your payment" |
| Auth service flapping | Serve cached session | User remains logged in temporarily |
| Image service down | Show placeholder | Broken image replaced with fallback |
The key principle: identify which features are critical-path (cannot be degraded) vs. non-critical (can return defaults or be hidden) and design accordingly.
Disaster Recovery β
RPO vs RTO β
| Metric | Definition | Question It Answers | Typical Target |
|---|---|---|---|
| RPO (Recovery Point Objective) | Max acceptable data loss | "How much data can we lose?" | 0s (sync replication) to 24h |
| RTO (Recovery Time Objective) | Max acceptable downtime | "How long can we be down?" | Seconds (active-active) to hours |
Lower RPO and RTO require more expensive infrastructure. The relationship is roughly exponential: going from RTO=1h to RTO=1min may cost 10Γ more.
Backup Strategies β
| Strategy | Description | RTO | RPO | Cost |
|---|---|---|---|---|
| Hot standby | Active replica in sync, traffic switchable in seconds | Seconds | Near-zero | Highest (~2Γ infrastructure) |
| Warm standby | Replica running, data lagging, needs promotion | Minutes | Minutes | Medium (~1.5Γ) |
| Cold standby | Backups stored, no running replica, restore on failure | Hours | Hours | Lowest |
| Pilot light | Minimal infrastructure pre-provisioned, scales on activation | 10β30 min | Minutes | Low-medium |
Multi-Region Failover β
Failover checklist:
- [ ] DNS TTL set low (30β60s) before planned failover; low TTL costs more DNS queries normally
- [ ] Replica is caught up (check replication lag) before promoting
- [ ] Application connection strings use DNS names, not hardcoded IPs
- [ ] Run failover drills quarterly β untested DR is not DR
Real-World β Netflix Chaos Engineering: Netflix runs Chaos Monkey in production, randomly terminating EC2 instances. Chaos Kong kills entire AWS regions. The philosophy: if failures happen regularly during business hours when engineers are alert, you are forced to build genuine resilience rather than relying on MTTR.
OAuth 2.0 Authorization Flows β
OAuth 2.0 defines several "grant types" β each optimized for a different client context. The existing section covers the Authorization Code + PKCE flow. This section maps all major flows and when to use each.
Flow Comparison β
| Flow | Best For | Token Location | Security Level | Client Secret Required |
|---|---|---|---|---|
| Authorization Code + PKCE | Web apps, mobile, SPA | Server-side or httpOnly cookie | Highest | No (PKCE replaces it) |
| Authorization Code (no PKCE) | Traditional server-side web apps | Server-side session | High | Yes |
| Client Credentials | Machine-to-machine, background services | Server memory / secrets manager | High (no user) | Yes |
| Device Code | Smart TVs, CLI tools, limited-input devices | Server-side | Medium | No |
| Implicit (deprecated) | Legacy SPA | URL fragment (insecure) | Low β do not use | No |
Authorization Code + PKCE Flow (Web / Mobile) β
This is the flow shown in the existing section above. PKCE (Proof Key for Code Exchange) replaces the client secret for public clients that cannot store secrets securely (e.g., single-page apps, mobile apps).
PKCE mechanics:
- Client generates a random
code_verifier(43β128 chars) - Client computes
code_challenge = BASE64URL(SHA256(code_verifier)) - Authorization request includes
code_challengeandcode_challenge_method=S256 - Token request includes
code_verifierβ server re-hashes and compares
Even if an attacker intercepts the authorization_code, they cannot exchange it without the original code_verifier.
Client Credentials Flow (Machine-to-Machine) β
No user is involved. A backend service authenticates directly as itself.
Use case: Microservice A calling Microservice B, scheduled jobs calling APIs, CI/CD pipelines accessing deployment APIs.
Security note: client_secret must be stored in a secrets manager (AWS Secrets Manager, HashiCorp Vault) β never in source code or environment variables committed to git.
Device Code Flow (Input-Constrained Devices) β
Use case: Logging into Netflix on a smart TV, GitHub CLI authentication, IoT device provisioning.
JWT Deep-Dive β
The existing section covers JWT structure and validation. This section adds claim semantics, session vs token comparison, and security pitfalls.
Standard Claims Reference β
| Claim | Full Name | Purpose | Example Value |
|---|---|---|---|
iss | Issuer | Who created the token | "https://auth.example.com" |
sub | Subject | Who the token represents (user ID) | "user_abc123" |
aud | Audience | Which service(s) should accept this token | "api.example.com" |
exp | Expiration | Unix timestamp after which token is invalid | 1700000900 |
iat | Issued At | Unix timestamp when token was created | 1700000000 |
nbf | Not Before | Token not valid before this timestamp | 1700000000 |
jti | JWT ID | Unique token ID β enables revocation tracking | "abc-def-123" |
Custom claims (application-specific):
{
"sub": "user_123",
"role": "admin",
"org_id": "org_456",
"permissions": ["read:reports", "write:settings"],
"exp": 1700000900
}JWT Algorithm Selection β
| Algorithm | Type | Key Type | Use Case |
|---|---|---|---|
HS256 | Symmetric HMAC | Single shared secret | Internal services (all share same secret) |
RS256 | Asymmetric RSA | Private key signs, public key verifies | Cross-service (distribute public key only) |
ES256 | Asymmetric ECDSA | Private key signs, public key verifies | Same as RS256 but smaller tokens |
Rule: Use RS256 or ES256 for any token that crosses a trust boundary. HS256 is fine for internal service-to-service when all parties share the secret.
Session-Based vs Token-Based Auth β
| Property | Session (Cookie) | Token (JWT) |
|---|---|---|
| Server state | Session stored server-side (DB/Redis) | Stateless β no server state |
| Revocation | Instant β delete session from store | Hard β token valid until expiry |
| Scalability | Session store becomes hot dependency | Scales easily β no shared state |
| Token size | Cookie: ~100 bytes (session ID only) | JWT: ~500β2000 bytes in headers |
| Cross-domain | Cookies limited to same origin / CORS | Bearer token works cross-domain |
| Mobile/API clients | Awkward β cookie handling varies | Natural β Authorization header |
| Best for | Traditional web apps, instant logout critical | APIs, microservices, mobile apps |
JWT Security Pitfalls β
| Pitfall | Risk | Mitigation |
|---|---|---|
alg: none attack | Attacker removes signature, claims any identity | Always explicitly specify allowed algorithms in validation |
Weak HS256 secret | Brute-forceable secret β forge any token | Minimum 256-bit random secret; prefer RS256 |
No aud validation | Token for Service A accepted by Service B | Always validate aud claim matches current service |
| Long expiry | Stolen token usable for hours/days | Access tokens: 5β15 min; use refresh tokens for long sessions |
| JWT in localStorage | Readable by any JavaScript (XSS risk) | Store in httpOnly cookie; if localStorage, accept XSS risk explicitly |
No jti tracking | Cannot revoke individual tokens before expiry | Track jti in Redis for high-security actions; accept cost |
Rate Limiting Algorithms β Full Comparison β
The existing section covers four algorithms. This section adds Leaky Bucket and provides deeper implementation guidance.
Token Bucket (Detailed) β
Tokens accumulate up to a capacity. Each request consumes one token. Tokens refill at rate per second.
Key properties:
- Burst of up to
capacityrequests is immediately allowed - Long-term rate enforced by refill speed
- Implementation:
(last_tokens + (now - last_refill) * rate)β no timer needed, calculate on each request
Leaky Bucket β
Requests enter a fixed-size queue. A worker processes (drains) the queue at a constant rate. If the queue is full, the request is dropped.
Key difference from Token Bucket: Leaky Bucket produces a smooth, constant output rate regardless of input burst pattern. Token Bucket allows bursts to pass through immediately.
Fixed Window Counter β
Window: [0sβ60s] counter=0 β increments to 100 β resets at 60s β [60sβ120s] counter=0Boundary burst problem:
[0:59] 100 requests β allowed (window 1, counter=100)
[1:00] 100 requests β allowed (window 2 starts, counter=0 β 100)
Result: 200 requests in 2 seconds despite "100/min" limitSliding Window Log β
Stores a timestamp for every request in the current window. On each request:
- Remove entries older than
window_size - Count remaining entries
- If count < limit β allow and add new timestamp; else β reject
Redis sorted set: ZADD key timestamp "requestID"
ZREMRANGEBYSCORE key 0 (now - window_ms)
count = ZCARD keyExact accuracy but memory grows with request volume β O(requests_per_window) per user.
Sliding Window Counter (Hybrid) β
Estimates the count using weighted average between current and previous window:
estimated = prev_count Γ (1 β elapsed/window_size) + curr_countExample: Window=60s, prev_count=80, curr_count=10, elapsed=15s into current window:
estimated = 80 Γ (1 β 15/60) + 10 = 80 Γ 0.75 + 10 = 60 + 10 = 70Memory: O(1) per user β only store two counters per window.
Algorithm Comparison β
| Algorithm | Burst Handling | Memory | Accuracy | Smoothness | Complexity | Best For |
|---|---|---|---|---|---|---|
| Token Bucket | Allows bursts up to capacity | O(1) | High | Bursty output | Low | APIs allowing short bursts (Stripe, AWS) |
| Leaky Bucket | Absorbs bursts, constant output | O(queue) | High | Smooth output | Low-Medium | Protecting downstream at constant rate |
| Fixed Window | Hard cutoff (boundary burst risk) | O(1) | Low | Not smooth | Lowest | Simple internal quotas |
| Sliding Window Log | Perfectly smooth | O(requests) | Exact | Smooth | Medium | Low-volume, exact enforcement |
| Sliding Window Counter | Smooth, approximate | O(1) | ~99.997% | Smooth | Low | Production APIs (Cloudflare, Kong) |
Distributed Rate Limiting with Redis β
Single-node rate limiting is insufficient for multi-instance services. Use Redis atomic operations:
-- Token Bucket in Redis (Lua script for atomicity)
local tokens = tonumber(redis.call('GET', key) or capacity)
local now = tonumber(ARGV[1])
local last = tonumber(redis.call('GET', key..':ts') or now)
local refill = math.min(capacity, tokens + (now - last) * rate)
if refill >= 1 then
redis.call('SET', key, refill - 1)
redis.call('SET', key..':ts', now)
return 1 -- allowed
else
return 0 -- rejected
endPer-node vs centralized trade-off:
| Approach | Accuracy | Latency | Failure Mode |
|---|---|---|---|
| Per-node counter | Allows NΓlimit burst (N = node count) | Zero (local) | Node failure loses counter |
| Redis centralized | Accurate | +1β2ms per request | Redis outage = no rate limiting |
| Redis + local fallback | Approximate (slightly over) | +1β2ms normally, 0ms on Redis failure | Graceful degradation |
Cross-references: Rate limiting at the API gateway layer β Ch13 β Microservices. Load balancer traffic shaping β Ch06 β Load Balancing.
Trade-offs & Comparisons β
| Approach | Benefit | Cost | When to Choose |
|---|---|---|---|
| Sync replication (RPO=0) | No data loss on failover | Higher write latency | Financial transactions |
| Async replication (low cost) | Low write latency | Potential data loss | Analytics, content delivery |
| Active-active multi-region | RTO < 5s | Conflict resolution complexity | Global, revenue-critical |
| JWT (stateless tokens) | No server-side session store | Cannot revoke without token rotation | Scalable APIs |
| Session cookies (stateful) | Instant revocation | Session store becomes critical dependency | Traditional web apps |
| Sliding window rate limit | Smooth, accurate | Slightly more complex than fixed window | Production APIs |
Key Takeaway: Security and reliability are not features to bolt on β they emerge from deliberate design choices: short-lived tokens, layered input validation, isolated failure domains via bulkheads, and tested recovery procedures. The most dangerous assumption in system design is that your dependencies will stay up.
Case Study: Shopify's Payment Resilience β
Shopify processes hundreds of billions of dollars in Gross Merchandise Volume annually. For a merchant, a failed or duplicated payment is existential β it means lost revenue or angry customers demanding refunds. This case study maps the reliability patterns in this chapter to Shopify's actual payment architecture.
Context and Challenges β
| Challenge | Consequence if Ignored | Scale |
|---|---|---|
| Payment gateway failures | Lost sales during checkout | Shopify integrates 100+ payment providers |
| Double-charge prevention | Duplicate charges, chargebacks, merchant liability | Any retry without idempotency β duplicate charge |
| Partial failures | Payment debited but order not created | Distributed transaction across services |
| Reconciliation drift | Internal ledger disagrees with Stripe/Braintree | Discovered only at end-of-month audit |
Pattern 1: Idempotency Keys β
Every payment request is tagged with a globally unique idempotency key generated by the client before the first attempt. If the network fails mid-request, the client retries with the same key β the payment provider de-duplicates based on the key and returns the original result without re-processing the charge.
Key design rules for idempotency keys:
- Generated client-side (not server-side) so the key survives server crashes
- Stored with TTL (e.g., 24h) β long enough to cover retries, short enough to reclaim memory
- Associated with the full response, not just a success flag β lets clients recover partial state
Pattern 2: Circuit Breakers on Payment Providers β
Shopify integrates multiple payment providers (Stripe, Braintree, Adyen, etc.). If one provider degrades, a circuit breaker isolates that provider and routes new requests to alternatives β maintaining checkout availability even when a provider has an incident.
The state machine is identical to the circuit breaker pattern in Chapter 13. The Shopify-specific addition: when the circuit opens, the load balancer weight for that provider drops to 0 rather than returning errors to users.
Pattern 3: Async Payment Processing β
Not all payment operations are synchronous. Subscription renewals, delayed captures, and refunds are processed asynchronously through a queue. This isolates the checkout path from batch operations and provides guaranteed delivery even when downstream services are slow.
Architecture (see Chapter 11 β Message Queues for queue patterns):
- Checkout β publishes
payment.capture_requestedevent to durable queue - Payment worker consumes the event, calls provider, emits
payment.succeededorpayment.failed - Order service subscribes to
payment.succeededto fulfill the order - Dead-letter queue captures failed messages after 3 retries for manual inspection
Why async for subscriptions specifically: Shopify processes millions of subscription renewals in a daily batch window. Processing them synchronously would require holding millions of open connections to payment providers. The queue decouples ingestion rate from processing rate, smoothing load across the window.
Pattern 4: Reconciliation Jobs β
Even with idempotency keys and circuit breakers, state mismatches occur: network timeouts after a provider charges but before Shopify receives confirmation, provider-side corrections, partial refunds. Reconciliation jobs run on a schedule (hourly for high-value merchants, daily for standard) to detect and fix mismatches.
Reconciliation is the safety net that catches everything the online path missed. See Chapter 14 β Event-Driven Architecture for the event sourcing approach that makes reconciliation audits tractable: each state transition is a logged event, so the full history is reconstructable.
Pattern Comparison β
| Pattern | Problem Solved | Implementation | Trade-off |
|---|---|---|---|
| Idempotency keys | Duplicate charges on retry | Client-generated UUID + Redis lookup | Key storage cost; TTL must outlast retry window |
| Circuit breaker | Gateway outage kills checkout | Per-provider error rate threshold β open/half-open/close | False opens under transient spikes; needs careful tuning |
| Async queue | Checkout blocked by slow provider | Durable queue + worker pool | Eventual consistency; UX must handle "payment processing" state |
| Reconciliation | Silent mismatches between systems | Periodic batch compare of internal vs external ledger | Latency: mismatches detected hours later, not instantly |
Key Takeaway β
Financial systems require defense-in-depth: no single pattern prevents all failure modes. Idempotency prevents duplicates but not gateway outages. Circuit breakers prevent cascading failures but not data mismatches. Async queues decouple services but introduce eventual consistency. Reconciliation catches everything the online path missed but only after the fact. The complete system requires all four layers.
Code Example: Token Bucket Rate Limiter (Go) β
type TokenBucket struct {
mu sync.Mutex
tokens float64
maxTokens float64
refillRate float64 // tokens per second
lastRefill time.Time
}
func (tb *TokenBucket) Allow() bool {
tb.mu.Lock()
defer tb.mu.Unlock()
now := time.Now()
elapsed := now.Sub(tb.lastRefill).Seconds()
tb.tokens = min(tb.maxTokens, tb.tokens+elapsed*tb.refillRate)
tb.lastRefill = now
if tb.tokens >= 1 {
tb.tokens--
return true
}
return false
}Code Example: Circuit Breaker (Go) β
type CircuitBreaker struct {
mu sync.Mutex
state string // "closed", "open", "half-open"
failures int
threshold int
lastFailure time.Time
cooldown time.Duration
}
func (cb *CircuitBreaker) Execute(fn func() error) error {
cb.mu.Lock()
if cb.state == "open" {
if time.Since(cb.lastFailure) > cb.cooldown {
cb.state = "half-open"
} else {
cb.mu.Unlock()
return errors.New("circuit breaker is open")
}
}
cb.mu.Unlock()
err := fn()
cb.mu.Lock()
defer cb.mu.Unlock()
if err != nil {
cb.failures++
cb.lastFailure = time.Now()
if cb.failures >= cb.threshold {
cb.state = "open"
}
return err
}
cb.failures = 0
cb.state = "closed"
return nil
}Related Chapters β
| Chapter | Relevance |
|---|---|
| Ch05 β DNS | DNSSEC and DNS-layer DDoS mitigation |
| Ch06 β Load Balancing | Rate limiting and WAF at the LB/API gateway layer |
| Ch13 β Microservices | Auth (JWT/OAuth2) in API gateway security model |
| Ch17 β Monitoring & Observability | Security event detection via observability pipeline |
Practice Questions β
Beginner β
JWT Validation: A user complains they were logged out even though their session "should still be valid." Walk through every JWT validation step that could cause a rejection β which claims (
exp,iss,aud,nbf) are checked and what failure does each indicate?Hint
Check `exp` (token expired), `nbf` (token not yet valid β clock skew issue), `iss` (wrong issuer β misconfigured auth server), `aud` (wrong audience β token issued for a different service); also verify the signature with the correct public key.
Intermediate β
DDoS Mitigation: Your API is receiving 500,000 requests/second from 50,000 different IP addresses. Per-IP rate limiting is ineffective. What additional mitigation layers would you apply, in what order, and at which network/application layer does each operate?
Hint
Layer in order: CDN-level anycast absorption (Cloudflare/Akamai), BGP-level traffic scrubbing, challenge-response (CAPTCHA) for suspected bots, then application-level behavioral analysis (request pattern anomalies).Bulkhead + Circuit Breaker: Your payment service calls fraud detection, currency conversion, and ledger sequentially. If fraud detection becomes slow (P99 = 8s), all payment requests time out. Design a reliability architecture using bulkheads (separate thread pools) and circuit breakers for each dependency to isolate failures.
Hint
Give each downstream service its own connection pool (bulkhead) so a slow fraud detection service exhausts only its pool, not the shared thread pool; add a circuit breaker per service with a 2s timeout threshold.RPO vs Cost Decision: A startup is choosing between RPO=1h ($2K/month, cold standby) and RPO=1min ($10K/month, warm standby with continuous replication). What business questions do you ask to help them decide, and how do you translate the answer into a cost-of-downtime calculation?
Hint
Ask: what is the revenue per minute during peak hours, and what is the cost per data-loss incident (regulatory fines, customer churn) β if one hour of lost transactions exceeds $8K, the warm standby pays for itself.
Advanced β
Rate Limiting Algorithms: Compare token bucket and sliding window counter algorithms for rate limiting across: burst handling accuracy, memory usage per user, implementation complexity, and behavior at window boundaries. Which algorithm would you choose for a payment API (strict accuracy required) vs a social media feed API (burst-tolerant)?
Hint
Token bucket allows smooth bursts (good for feeds); sliding window log is most accurate but uses O(requests) memory; fixed window counter has a boundary doubling flaw; the sliding window counter approximation balances accuracy and memory β choose based on whether bursts are acceptable.
References & Further Reading β
- "Release It!" β Michael Nygard (circuit breaker patterns)
- OWASP Top 10
- OAuth 2.0 RFC 6749
- "The SRE Book" β Google
- Cloudflare rate limiting blog posts

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.