Skip to contentSkip to content
0/47 chapters completed (0%)

Chapter 17: Monitoring & Observability ​

Chapter banner

Monitoring tells you when something is wrong. Observability tells you why. The difference is whether your system was built to be questioned.


Mind Map ​


The Observability Signals ​

Historically framed as "three pillars," observability in 2026 is better understood as four signals: metrics, logs, traces, and the emerging fourth signal β€” continuous profiling. All four are standardized under OpenTelemetry (OTel), which is now the de facto instrumentation layer across the industry.

DimensionMetricsLogsTracesProfiling
What is it?Numeric measurements over timeTimestamped text/structured recordsRequest journey across servicesContinuous CPU/memory/allocation flame graphs
GranularityAggregated (not per-request)Per-eventPer-request, cross-servicePer-process, sampled continuously
Best forDashboards, alerting, trendsDebugging specific eventsLatency attribution, dependency mappingFinding hot code paths; unexplained CPU/memory cost
When to use"Is something wrong?""What happened at 14:03:22?""Which service added 800ms?""Which function is burning CPU at 4 AM?"
CostLow (aggregates)Medium (storage scales with volume)High (sampling required at scale)Low–medium (eBPF-based has <1% CPU overhead)
Common toolsPrometheus, Datadog, AtlasELK Stack, Loki, CloudWatchJaeger, Zipkin, Grafana TempoParca, Pyroscope, Grafana Profiles
OTel statusGA (stable)GA (stable)GA (stable)RC (targeting GA Q3 2026)

You need all four. Metrics tell you latency spiked; logs tell you which user was affected; traces tell you which service in the call chain caused it; profiling tells you which function inside that service is burning CPU or allocating memory unexpectedly.

Profiling as the 4th Signal (as of 2026)

OpenTelemetry's profiling signal entered Release Candidate in Q1 2026, targeting GA by Q3 2026. It uses eBPF-based continuous CPU/memory sampling and Linux perf format, exportable via OTLP. Production tools like Grafana Pyroscope and Parca can already receive OTel profiling data. Frame it as "emerging GA" β€” not experimental, not yet fully standardized.


Metrics Types ​

Prometheus and most metrics systems define three fundamental types:

TypeDefinitionExample Use CasesExample Values
CounterMonotonically increasing integer, only goes up (resets on restart)Total HTTP requests, total errors, total bytes senthttp_requests_total{method="GET", status="200"} 10482
GaugeArbitrary value that can go up or downCPU %, active connections, queue depth, memory usagequeue_depth{queue="payments"} 42
HistogramSamples observations into configurable buckets, exposes _count, _sum, _bucketRequest latency distribution, request sizehttp_duration_seconds_bucket{le="0.1"} 8234

Why Histograms Matter: Percentiles vs. Averages ​

Averages hide tail latency. A p99 latency of 2s means 1% of your users wait 2 seconds β€” that can be thousands of users. Always alert on percentiles (p95, p99, p999) for latency metrics.

p50 = 50ms  ← Typical user
p95 = 200ms ← Most users
p99 = 800ms ← Tail (worst 1%)

Prometheus computes percentiles from histograms server-side using histogram_quantile(0.99, ...).


Distributed Tracing ​

In a microservices system, a single user request may fan out to 10+ services. Traditional logging β€” per-service β€” cannot answer "which service is slow." Distributed tracing reconstructs the full call tree.

Core Concepts ​

  • Trace: The complete journey of one request, from entry point to all leaf calls
  • Span: One unit of work within a trace (e.g., one service call, one DB query)
  • Correlation ID / Trace ID: A unique ID injected at the entry point and propagated via HTTP headers (X-Trace-ID) to every downstream call
  • Parent-child spans: Each span records its parent span ID, enabling tree reconstruction

Trace Sequence Example ​

Span Tree Visualization ​

abc123 [120ms total]
β”œβ”€β”€ span_001: Auth (User Service)  [15ms]
└── span_002: Order Service        [98ms]
    └── span_003: DB Query         [45ms]

This immediately surfaces that the DB query and Order Service overhead account for 98ms, making optimization target obvious.

Sampling Strategies ​

At 10,000 req/sec, storing every trace is prohibitively expensive. Two approaches:

StrategyHow It WorksProsCons
Head-basedDecide at trace start (random %)Low overhead, simpleMisses rare errors
Tail-basedBuffer full trace, decide after completion based on outcomeCaptures all errors/slow requestsHigher memory/processing cost

Production recommendation: sample 1% normally, 100% of errors and traces > p99 latency threshold.

W3C TraceContext: Trace Propagation Standard ​

Distributed tracing requires trace and span IDs to flow across service boundaries via HTTP headers. W3C TraceContext (RFC 2019) defines the standard:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ^^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^
          version       trace-id (128-bit)         parent-span-id   flags

Every service reads this header, creates a child span with the parent-span-id, and propagates the same trace-id to downstream calls. No proprietary vendor format needed.

Tracing Tools Comparison ​

JaegerZipkinGrafana Tempo
Storage backendsCassandra, Elasticsearch, in-memoryMySQL, Cassandra, ElasticsearchObject storage (S3, GCS, Azure Blob)
UIFull-featured: trace timeline, service dependency graphSimpler: trace list + timelineMinimal native UI; relies on Grafana
SamplingHead-based and adaptive (remote controlled)Head-basedDelegated to OTel Collector
OpenTelemetryNative OTLP supportVia adapterNative OTLP (primary protocol)
IntegrationCNCF project, Kubernetes-nativeWidely adopted, many client libsGrafana stack (pairs with Prometheus + Loki)
ScaleMedium-largeMediumVery large (object storage = cheap at scale)
DeploymentModerate complexitySimpleSimple (no search index needed)
CostFree (Elasticsearch costs extra)FreeFree (storage cost only)
Best forKubernetes microservices, CNCF stackExisting Zipkin instrumentationGrafana-centric observability stacks

OpenTelemetry ​

OpenTelemetry (OTel) is a CNCF project that provides a vendor-neutral standard for instrumentation, collection, and export of telemetry data β€” traces, metrics, and logs β€” under a single unified API and SDK.

Before OTel, each observability vendor required its own SDK. Switching from Jaeger to Datadog meant re-instrumenting every service. OpenTelemetry solves this with a single instrumentation layer and pluggable exporters.

Three Pillars Unified ​

OTLP (OpenTelemetry Protocol) is the wire format. It runs over gRPC (port 4317) or HTTP/protobuf (port 4318). Any backend that speaks OTLP can receive OTel data.

OTel Collector Architecture ​

The Collector is an optional but recommended middle tier. It decouples instrumented apps from backend destinations:

Collector benefits:

  • Tail-based sampling at the Collector level (buffer spans, sample on outcome)
  • Data transformation: filter PII from logs, rename labels, add resource attributes
  • Multi-backend export: send same data to two backends simultaneously (migration, redundancy)
  • Decoupled upgrades: swap backends without touching application code

Auto-Instrumentation vs Manual Instrumentation ​

DimensionAuto-InstrumentationManual Instrumentation
HowAgent/bytecode injection at startup, no code changesDeveloper adds tracer.start_span() calls in code
EffortZero code changesPer-operation instrumentation required
CoverageHTTP clients, DB drivers, frameworks (Spring, Express, Django)Any custom business logic, critical paths
Span qualityGeneric (framework-level, missing business context)Rich (custom attributes: user_id, order_id, feature flags)
LatencySlight overhead from agentMinimal (only instrumented operations)
MaintenanceAgent version updatesCode changes when logic changes
Best forGetting started, standardizing infrastructure callsHigh-value business operations, SLO-critical paths

Recommendation: Use auto-instrumentation as the baseline (catches all HTTP and DB calls), then add manual spans around critical business operations (payment processing, fraud check, recommendation engine) where custom attributes are needed for debugging.

OTel Language Support ​

OTel provides SDKs for all major languages: Java, Python, Go, JavaScript/Node, .NET, Ruby, Rust, C++, PHP. Auto-instrumentation is most mature for Java (via Java agent) and Node.js (via @opentelemetry/auto-instrumentations-node).

Cross-reference: Chapter 16: Reliability Patterns for SLO-driven reliability engineering. Chapter 23: Cloud-Native Architecture for OTel in Kubernetes.


eBPF-Powered Observability (2026) ​

Extended Berkeley Packet Filter (eBPF) allows safe, sandboxed programs to run inside the Linux kernel without modifying kernel source or loading kernel modules. In observability, this unlocks something previously impossible: full application telemetry β€” function calls, network flows, file I/O, system calls β€” with near-zero overhead and zero application code changes.

As of 2026, CNCF survey data shows 67% of teams running Kubernetes at scale have adopted at least one eBPF observability tool, with 300% year-on-year growth. eBPF is no longer an advanced topic β€” it is the practical answer to the "how do I instrument a polyglot service mesh without touching every service's code?" problem.

Why eBPF Changes the Observability Model ​

Traditional observability requires either:

  • SDK instrumentation: developers add tracing/metrics code to each service (coupled, language-specific)
  • Sidecar proxies: Envoy/Linkerd sidecars intercept traffic (works for L4/L7 but adds 50–100ms cold overhead and memory per pod)

eBPF operates at the kernel level, intercepting events from any process on the node β€” regardless of language, framework, or whether a sidecar is present. Key advantages:

PropertyTraditional AgenteBPF-Based
CPU overhead5–15%< 1%
Code changes requiredYes (SDK)None
Language-agnosticNoYes
Kernel/syscall visibilityNoYes (security events, file I/O)
PrecisionMillisecondMicrosecond
Works with sidecarsComplementsCan replace for L4 visibility

The 2026 eBPF Observability Stack ​

Key tools in the stack (as of 2026):

  • Cilium β€” eBPF-native Container Network Interface (CNI). Provides L3–L7 network policy, service mesh connectivity, and full network flow visibility via its companion UI Hubble. When combined with Istio ambient mode (see Ch13), Cilium handles the data-plane while Istio handles control-plane policy β€” replacing per-pod sidecar proxies entirely.
  • Tetragon β€” security-focused eBPF tool from Isovalent (Cilium's creators). Captures process execution, file access, and network connections at the syscall level. Useful for runtime threat detection without a dedicated security agent.
  • Pixie β€” auto-instrumented APM (Application Performance Monitoring) from New Relic. Captures HTTP/1.1, HTTP/2, gRPC, MySQL, PostgreSQL, Redis traffic β€” with request/response bodies β€” without any SDK. Run px run px/http_data and get per-endpoint latency histograms in seconds.
  • Grafana Beyla β€” generates OTel-compatible spans from eBPF-intercepted HTTP, gRPC, and SQL calls. Designed to feed the OTel Collector directly, plugging eBPF-sourced traces into a standard Grafana/Tempo/Jaeger stack with zero application modification.

When to Adopt eBPF Observability ​

SituationRecommendation
Polyglot microservices (Go, Java, Python, Rust mix)Strong fit β€” no per-language SDK needed
Kubernetes-native deploymentsStrong fit β€” node-level eBPF agents deploy as DaemonSets
Security compliance requiring syscall auditUse Tetragon for runtime enforcement
Replacing sidecar service mesh overheadCilium eBPF + ambient Istio reduces memory 90%, latency 25%
Legacy monolith on bare metal (non-Kubernetes)Limited β€” eBPF tools are Kubernetes-optimized
Teams already invested in OTel SDK instrumentationComplement with Beyla/Pixie; don't replace rich manual spans

eBPF Observability and OTel are Complementary

eBPF tools like Beyla and Pixie export OTel-compatible telemetry β€” they feed into the same OTel Collector pipeline as SDK-based instrumentation. The pattern for 2026: use eBPF for infrastructure-level and network-level visibility (zero effort), use OTel SDKs for business-logic-level spans (high-value custom attributes like order_id, user_tier). Both flow into the same Grafana/Jaeger/Tempo backend.


Log Aggregation Pipeline ​

Individual service logs are useless if you cannot search them across all instances. A log aggregation pipeline centralizes logs from every container, VM, and serverless function:

Structured logging (JSON over plaintext) is essential for the search step. A structured log line:

json
{
  "timestamp": "2026-03-12T00:40:00Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc123",
  "user_id": "user_789",
  "message": "Payment timeout after 5000ms",
  "duration_ms": 5000
}

This allows queries like: level:ERROR AND service:order-service AND duration_ms:>3000.


SLI / SLO / SLA ​

Google's Site Reliability Engineering introduced the SLI β†’ SLO β†’ SLA hierarchy as a way to make reliability quantitative and contractual:

Definitions ​

ConceptOwnerConsequence of breachExample
SLIEngineeringNone β€” it is a measurement99.92% of requests succeeded this month
SLOEngineeringInternal alert, error budget consumedTarget: 99.9% success rate
SLABusiness/LegalFinancial penalty, contract clauseGuarantee: 99.5% or credit issued

SLO is always stricter than SLA. The gap between SLO (internal target) and SLA (contractual guarantee) is the safety buffer β€” if engineering hits 99.8% and the SLO was 99.9%, the team is alerted and investigates before breaching the 99.5% SLA.

Common SLI Examples by Metric Type ​

Metric TypeExample SLIExample SLOMeasurement Method
AvailabilityFraction of successful HTTP requests (2xx/3xx)99.9% success rate over 30 daysSynthetic probes + real traffic
Latencyp99 response timep99 < 500ms over 1-hour windowHistogram (Prometheus histogram_quantile)
Error rateFraction of 5xx responses< 0.1% error rateError counter / total request counter
ThroughputRequests processed per second> 1,000 RPS sustainedGauge metric on queue consumer
FreshnessAge of most recent data ingestedData lag < 5 minutesTimestamp comparison metric
DurabilityFraction of written objects successfully retrieved99.999999% (11 nines)Periodic read-back verification

Error Budgets ​

An error budget is the allowable unreliability within an SLO period:

Error budget = 1 βˆ’ SLO target
99.9% SLO β†’ 0.1% budget β†’ 43.8 minutes/month of allowed downtime
99.99% SLO β†’ 0.01% budget β†’ 4.38 minutes/month

The Nines: Downtime Allowance per SLO Target ​

SLO TargetDowntime per MonthDowntime per YearDowntime per WeekCommon Name
99%7.3 hours3.65 days1.68 hoursTwo nines
99.5%3.65 hours1.83 days50.4 minutesβ€”
99.9%43.8 minutes8.77 hours10.1 minutesThree nines
99.95%21.9 minutes4.38 hours5.04 minutesβ€”
99.99%4.38 minutes52.6 minutes60.5 secondsFour nines
99.999%26.3 seconds5.26 minutes6.05 secondsFive nines
99.9999%2.63 seconds31.6 seconds0.605 secondsSix nines

Cost of an additional nine: Each additional nine of availability roughly doubles infrastructure cost and operational complexity. Going from 99.9% to 99.99% is not a 10Γ— improvement β€” it requires eliminating every planned maintenance window, active-active multi-region deployment, and sub-minute failover. Most SaaS products target 99.9% (43 min/month), which is achievable with a single region + good health checks. Five nines (26s/month) requires active-active multi-region with automated failover in under 10 seconds.

Error budget policy: When the budget is exhausted, new feature deployments halt and reliability work takes priority. This creates a natural feedback loop: engineering teams that want to ship features are incentivized to keep the service reliable.

Real-World β€” Google SRE: Google's SRE teams hold joint ownership of error budgets with product teams. If a service exhausts its error budget, the SRE team can unilaterally halt launches. This removes the "reliability vs. velocity" organizational conflict by making reliability a shared engineering metric.


Connecting SLOs to Observability Signals ​

SLOs require SLIs, which require the right observability signals:

Burn rate alerts: Rather than alerting when the budget is fully exhausted, alert when the consumption rate predicts exhaustion. A burn rate of 14.4Γ— means you will exhaust a 30-day budget in 2 days. Alert at 2Γ— burn rate (14-day warning) and 14.4Γ— burn rate (2-hour warning). This is the multi-window, multi-burn-rate alerting pattern from Google's SRE Workbook.


Alerting Strategies ​

Alert Fatigue ​

The biggest failure mode in alerting is alert fatigue β€” too many low-signal alerts cause engineers to ignore them, including critical ones. Symptoms:

  • On-call engineers acknowledge without investigating
  • Alert volume exceeds 10/day on average
  • Many alerts resolve without human action

Solution: Every alert must be actionable. If an alert fires and no action is required, delete or demote it.

Severity Levels ​

SeverityDefinitionResponse SLAExample
P1 / CriticalService down, revenue impactWake on-call immediately, < 5 minPayment API returning 500
P2 / HighDegraded, SLO at riskAlert on-call during business hours, < 30 minp99 latency > 2s
P3 / MediumAnomaly, no immediate user impactTicket, fix in sprintDisk > 80% on non-critical host
P4 / LowInformationalReview weeklyDependency approaching end of support

On-Call Best Practices ​

  • Rotate on-call weekly to distribute burden and knowledge
  • Keep runbooks for every P1/P2 alert β€” reduce MTTR with documented steps
  • Conduct blameless post-mortems within 48 hours of incidents
  • Track Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR) as team metrics
  • Alert on symptoms (user-visible impact) not causes (CPU high) where possible

Health Checks ​

Health checks allow orchestrators (Kubernetes, load balancers) to route traffic away from unhealthy instances automatically. Three probe types:

ProbeQuestionFailure ActionExample Endpoint
StartupHas the app finished initializing?Wait (don't kill yet)/health/startup β€” checks migrations complete
LivenessIs the process alive and not deadlocked?Restart the container/health/live β€” returns 200 if process is responsive
ReadinessCan the app serve traffic right now?Remove from LB pool/health/ready β€” checks DB connection, cache connection

Cross-reference: Chapter 6 covers how load balancers use health checks to remove unhealthy backends from rotation.

Readiness check design: Be conservative. If your app cannot reach its database, it should fail readiness β€” sending traffic that will fail is worse than not sending traffic at all. However, a slow downstream service should not fail readiness if the app can degrade gracefully.


Tools Comparison ​

ToolCategoryWhat It DoesDeploymentStrengthsWeaknessesCost
PrometheusMetricsPull-based metrics collection, storage, PromQLSelf-hostedCNCF standard, powerful query languageNo long-term storage built-in, cardinality limitsFree
GrafanaVisualizationDashboards for metrics, logs, traces from many sourcesSelf-hosted / CloudUniversal frontend, supports 50+ data sourcesRequires data source backendsFree / Paid
ElasticsearchLog storageDistributed search and analytics engineSelf-hosted / CloudFull-text search, flexible schemaResource-intensive, complex to operateFree / Paid
LogstashLog processingETL pipeline for logs β€” parse, filter, enrichSelf-hostedPowerful filter pluginsHeavy JVM resource usageFree
JaegerTracingDistributed trace collection, storage, UISelf-hostedCNCF, OpenTelemetry compatibleNo metrics, no logsFree
DatadogAll-in-one APMMetrics + logs + traces + APM + alertingSaaSLow operational overhead, fast setupExpensive at scalePer-host pricing
New RelicAll-in-one APMFull-stack observability, error trackingSaaSGood out-of-box instrumentationCost scales with data ingestPer-GB ingest
AWS CloudWatchCloud-nativeMetrics + logs for AWS resourcesSaaS (AWS)Zero setup for AWS servicesVendor lock-in, limited query capabilityPer metric/log

Practical guidance:

  • Startups: Datadog or New Relic for speed of setup
  • Mid-size, cost-conscious: Prometheus + Grafana + ELK + Jaeger (more ops burden, much cheaper)
  • AWS-native: CloudWatch + X-Ray + managed Prometheus/Grafana
  • OpenTelemetry: Use the vendor-neutral OTLP standard for instrumentation β€” swap backends without re-instrumenting code

Real-World β€” Netflix Atlas: Netflix built Atlas, their internal metrics platform, to handle billions of time series from thousands of services. Atlas uses in-memory storage optimized for real-time dashboards and pattern-matching queries across tag dimensions. Netflix open-sourced Atlas; its design influenced Prometheus's label model.


Trade-offs & Comparisons ​

DecisionOption AOption BRecommendation
Metrics storagePrometheus (self-hosted)Datadog (SaaS)SaaS if <$5K/month matters less than ops cost
Log samplingStore all logsSample + retain errorsSample at high volume (>10GB/day)
Trace samplingHead-based (simple)Tail-based (smart)Tail-based if budget allows β€” captures all errors
SLO target99.9% (43 min/month budget)99.99% (4 min/month budget)Higher SLO = higher infra cost, diminishing returns
Alert strategyAlert on causes (high CPU)Alert on symptoms (error rate)Symptom-based reduces noise

Key Takeaway: Observability is the foundation of reliability. You cannot improve what you cannot measure, and you cannot debug what you cannot trace. Instrument before you need it β€” adding tracing during an incident is too late. The four signals (metrics, logs, traces, and the emerging profiling signal) are complements, not substitutes. In 2026, OpenTelemetry is the default instrumentation layer β€” use it from day one so backends are swappable. For Kubernetes deployments, pair OTel SDKs with eBPF tools (Pixie, Beyla, Cilium) to achieve full-stack visibility without per-service agent sprawl.


Incident Management Lifecycle ​

The Five Phases ​

Detection ​

  • Automated alerts, not human discovery β€” if an engineer finds the issue before an alert fires, alerting has failed
  • Alert on symptoms, not causes β€” alert on error rate or latency degradation, not "CPU > 80%" (which may be harmless)
  • Multi-signal detection β€” a latency spike confirmed by both metrics and traces is higher confidence than a single signal; reduce false positives by requiring two or more signals to agree

Triage ​

Classify severity within minutes of detection to determine escalation path and response urgency:

SeverityImpactResponse TimeExample
P0 / SEV1Total outage, data lossImmediate, all handsPayment system down
P1 / SEV2Major feature broken15 minLogin failing for 50% of users
P2 / SEV3Minor feature degraded1 hourSearch results slow
P3 / SEV4Cosmetic / low impactNext business dayDashboard chart incorrect

Incident Example: Latency Spike Debugging ​


Designing Effective Dashboards ​

The Four Golden Signals (Google SRE) ​

Every service dashboard should lead with these four panels β€” they cover the majority of user-visible failure modes:

  • Latency β€” how long requests take (show p50, p95, p99 β€” never just average)
  • Traffic β€” how much demand the system is under (requests/sec, events/sec)
  • Errors β€” rate of failed requests (5xx, timeouts, explicit errors)
  • Saturation β€” how "full" the service is (CPU %, queue depth, connection pool usage)

If a service is degraded, at least one of these four will deviate from baseline. Start your dashboard design here; add domain-specific panels only as supplements.

Dashboard Anti-patterns ​

Anti-patternProblemFix
Too many panelsInformation overload slows incident responseMax 8–10 panels per dashboard; link to drill-down dashboards
Only averagesHides tail latency affecting real usersAlways show p50, p95, p99 side by side
No baselineCannot tell if a value is normal or alarmingAdd SLO threshold lines and historical comparison overlays
Wall of textSlow to scan under pressureUse time-series graphs and stat panels, not tables of raw numbers

Cost-Efficient Observability ​

The Cardinality Problem ​

High-cardinality labels cause metric storage to explode. Each unique label combination creates a separate time series:

1,000 unique user_id values
  Γ— 100 metrics per user
  Γ— 60-second resolution
  Γ— 30-day retention
= millions of time series β†’ storage and query cost balloons

Never use user_id, request_id, or session_id as Prometheus label values. Reserve labels for low-cardinality dimensions: service, method, status_code, region.

Strategies ​

  • Sampling: Collect 1% of traces in normal production traffic; use tail-based sampling to capture 100% of error traces and traces exceeding the p99 latency threshold β€” you get full coverage where it matters at a fraction of the cost
  • Aggregation: Pre-aggregate metrics at collection time in the OTel Collector (e.g., sum request counts by service) rather than storing raw per-request data
  • Retention tiers: Hot storage (7 days, full resolution) β†’ Warm storage (30 days, downsampled to 1-minute intervals) β†’ Cold storage (1 year, aggregated to hourly) β€” most incidents are investigated within 7 days; cold data is for trend analysis only
  • Log levels: Emit DEBUG logs only in development environments; INFO, WARN, and ERROR in production β€” a single verbose service can generate gigabytes of low-value log data per day

ChapterRelevance
Ch16 β€” Security & ReliabilityReliability SLOs and incident response complement observability
Ch13 β€” MicroservicesDistributed tracing across microservice boundaries
Ch23 β€” Cloud-NativeCloud-native monitoring: Prometheus, Grafana, CloudWatch

Practice Questions ​

Beginner ​

  1. Distributed Tracing: A microservices request takes 3 seconds end-to-end, but each individual service logs less than 100ms of processing time. How would you use distributed tracing (spans, trace IDs) to locate the missing ~2.7 seconds? What are the most common hidden latency sources in microservice chains?

    Hint Spans capture wall-clock time including network hops and queue wait time that individual service logs don't measure β€” look for gaps between the end of one span and the start of the next child span.

Intermediate ​

  1. Error Budget: Your team's SLO is 99.9% availability (43.8 min/month error budget). After a 2-hour outage, the SRE lead says you have "used 2.7Γ— your monthly error budget in one incident." What does this mean operationally β€” what features or deployments must now be frozen, and for how long?

    Hint Burning the error budget triggers a freeze on non-critical feature releases until the budget resets (typically monthly); the team must focus entirely on reliability improvements before new features ship.
  2. Readiness Probe Design: You are designing a readiness probe for a service that depends on PostgreSQL, Redis, and a third-party payment API. The payment API is sometimes slow (2–5s). How do you design the probe so a slow payment API does not remove your service from the load balancer rotation?

    Hint Separate critical dependencies (PostgreSQL, Redis β€” required for the service to function) from non-critical ones (payment API); probe only critical deps for readiness, and use a separate circuit breaker for the payment API.
  3. Observability Stack Decision: Compare Prometheus + Grafana (self-hosted) vs Datadog (SaaS) for a team of 5 engineers running 50 microservices. What hidden costs on each side are rarely surfaced in vendor comparisons?

    Hint Prometheus hidden costs: storage sizing, alert manager maintenance, and engineering time managing the stack; Datadog hidden costs: per-host + per-custom-metric pricing that scales steeply with microservice count and cardinality.

Advanced ​

  1. Alert Noise Reduction: Your on-call engineer receives 200 alerts per week: 180 auto-resolve in 10 minutes, 15 require investigation but no action, and 5 require actual fixes. Design an alert restructuring plan (severity tiers, grouping, inhibition rules) to reduce noise while ensuring no critical alert is missed.

    Hint Demote self-resolving alerts to warnings or eliminate them; add alert inhibition (suppress child alerts when a parent alert fires); use Alertmanager grouping to collapse 50 pod-restart alerts into one service-level alert β€” target a ratio where >80% of pages require action.

References & Further Reading ​

  • "Site Reliability Engineering" (Google SRE Book) β€” Chapters on Monitoring Distributed Systems and Alerting: https://sre.google/sre-book/table-of-contents/
  • "Observability Engineering" β€” Charity Majors, Liz Fong-Jones, George Miranda (O'Reilly, 2022) β€” the definitive guide to high-cardinality observability and the shift from monitoring to observability
  • OpenTelemetry documentation β€” vendor-neutral instrumentation standard for traces, metrics, and logs: https://opentelemetry.io/docs/
  • "The Art of Monitoring" β€” James Turnbull β€” practical guide to modern monitoring pipelines with Prometheus and the ELK stack
  • Datadog blog: "The Four Golden Signals" β€” https://www.datadoghq.com/blog/monitoring-101-collecting-data/
  • Google SRE Workbook β€” Chapter on Incident Response β€” covers structured incident management, severity classification, and postmortem culture: https://sre.google/workbook/incident-response/

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.

Built with VitePress + Dracula Theme