Skip to contentSkip to content
0/47 chapters completed (0%)

Chapter 17: Monitoring & Observability ​

Chapter banner

Monitoring tells you when something is wrong. Observability tells you why. The difference is whether your system was built to be questioned.


Mind Map ​


The Three Pillars of Observability ​

Observability is built on three complementary signal types. Each answers different questions:

DimensionMetricsLogsTraces
What is it?Numeric measurements over timeTimestamped text/structured recordsRequest journey across services
GranularityAggregated (not per-request)Per-eventPer-request, cross-service
Best forDashboards, alerting, trendsDebugging specific eventsLatency attribution, dependency mapping
When to use"Is something wrong?""What happened at 14:03:22?""Which service added 800ms?"
CostLow (aggregates)Medium (storage scales with volume)High (sampling required at scale)
Common toolsPrometheus, Datadog, AtlasELK Stack, Loki, CloudWatchJaeger, Zipkin, AWS X-Ray
RetentionWeeks–monthsDays–weeksHours–days (sampled)
Cardinality riskHigh (too many labels = explosion)LowMedium

You need all three. Metrics alone tell you latency spiked; logs tell you which user was affected; traces tell you which service in the call chain caused it.


Metrics Types ​

Prometheus and most metrics systems define three fundamental types:

TypeDefinitionExample Use CasesExample Values
CounterMonotonically increasing integer, only goes up (resets on restart)Total HTTP requests, total errors, total bytes senthttp_requests_total{method="GET", status="200"} 10482
GaugeArbitrary value that can go up or downCPU %, active connections, queue depth, memory usagequeue_depth{queue="payments"} 42
HistogramSamples observations into configurable buckets, exposes _count, _sum, _bucketRequest latency distribution, request sizehttp_duration_seconds_bucket{le="0.1"} 8234

Why Histograms Matter: Percentiles vs. Averages ​

Averages hide tail latency. A p99 latency of 2s means 1% of your users wait 2 seconds β€” that can be thousands of users. Always alert on percentiles (p95, p99, p999) for latency metrics.

p50 = 50ms  ← Typical user
p95 = 200ms ← Most users
p99 = 800ms ← Tail (worst 1%)

Prometheus computes percentiles from histograms server-side using histogram_quantile(0.99, ...).


Distributed Tracing ​

In a microservices system, a single user request may fan out to 10+ services. Traditional logging β€” per-service β€” cannot answer "which service is slow." Distributed tracing reconstructs the full call tree.

Core Concepts ​

  • Trace: The complete journey of one request, from entry point to all leaf calls
  • Span: One unit of work within a trace (e.g., one service call, one DB query)
  • Correlation ID / Trace ID: A unique ID injected at the entry point and propagated via HTTP headers (X-Trace-ID) to every downstream call
  • Parent-child spans: Each span records its parent span ID, enabling tree reconstruction

Trace Sequence Example ​

Span Tree Visualization ​

abc123 [120ms total]
β”œβ”€β”€ span_001: Auth (User Service)  [15ms]
└── span_002: Order Service        [98ms]
    └── span_003: DB Query         [45ms]

This immediately surfaces that the DB query and Order Service overhead account for 98ms, making optimization target obvious.

Sampling Strategies ​

At 10,000 req/sec, storing every trace is prohibitively expensive. Two approaches:

StrategyHow It WorksProsCons
Head-basedDecide at trace start (random %)Low overhead, simpleMisses rare errors
Tail-basedBuffer full trace, decide after completion based on outcomeCaptures all errors/slow requestsHigher memory/processing cost

Production recommendation: sample 1% normally, 100% of errors and traces > p99 latency threshold.

W3C TraceContext: Trace Propagation Standard ​

Distributed tracing requires trace and span IDs to flow across service boundaries via HTTP headers. W3C TraceContext (RFC 2019) defines the standard:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ^^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^
          version       trace-id (128-bit)         parent-span-id   flags

Every service reads this header, creates a child span with the parent-span-id, and propagates the same trace-id to downstream calls. No proprietary vendor format needed.

Tracing Tools Comparison ​

JaegerZipkinGrafana Tempo
Storage backendsCassandra, Elasticsearch, in-memoryMySQL, Cassandra, ElasticsearchObject storage (S3, GCS, Azure Blob)
UIFull-featured: trace timeline, service dependency graphSimpler: trace list + timelineMinimal native UI; relies on Grafana
SamplingHead-based and adaptive (remote controlled)Head-basedDelegated to OTel Collector
OpenTelemetryNative OTLP supportVia adapterNative OTLP (primary protocol)
IntegrationCNCF project, Kubernetes-nativeWidely adopted, many client libsGrafana stack (pairs with Prometheus + Loki)
ScaleMedium-largeMediumVery large (object storage = cheap at scale)
DeploymentModerate complexitySimpleSimple (no search index needed)
CostFree (Elasticsearch costs extra)FreeFree (storage cost only)
Best forKubernetes microservices, CNCF stackExisting Zipkin instrumentationGrafana-centric observability stacks

OpenTelemetry ​

OpenTelemetry (OTel) is a CNCF project that provides a vendor-neutral standard for instrumentation, collection, and export of telemetry data β€” traces, metrics, and logs β€” under a single unified API and SDK.

Before OTel, each observability vendor required its own SDK. Switching from Jaeger to Datadog meant re-instrumenting every service. OpenTelemetry solves this with a single instrumentation layer and pluggable exporters.

Three Pillars Unified ​

OTLP (OpenTelemetry Protocol) is the wire format. It runs over gRPC (port 4317) or HTTP/protobuf (port 4318). Any backend that speaks OTLP can receive OTel data.

OTel Collector Architecture ​

The Collector is an optional but recommended middle tier. It decouples instrumented apps from backend destinations:

Collector benefits:

  • Tail-based sampling at the Collector level (buffer spans, sample on outcome)
  • Data transformation: filter PII from logs, rename labels, add resource attributes
  • Multi-backend export: send same data to two backends simultaneously (migration, redundancy)
  • Decoupled upgrades: swap backends without touching application code

Auto-Instrumentation vs Manual Instrumentation ​

DimensionAuto-InstrumentationManual Instrumentation
HowAgent/bytecode injection at startup, no code changesDeveloper adds tracer.start_span() calls in code
EffortZero code changesPer-operation instrumentation required
CoverageHTTP clients, DB drivers, frameworks (Spring, Express, Django)Any custom business logic, critical paths
Span qualityGeneric (framework-level, missing business context)Rich (custom attributes: user_id, order_id, feature flags)
LatencySlight overhead from agentMinimal (only instrumented operations)
MaintenanceAgent version updatesCode changes when logic changes
Best forGetting started, standardizing infrastructure callsHigh-value business operations, SLO-critical paths

Recommendation: Use auto-instrumentation as the baseline (catches all HTTP and DB calls), then add manual spans around critical business operations (payment processing, fraud check, recommendation engine) where custom attributes are needed for debugging.

OTel Language Support ​

OTel provides SDKs for all major languages: Java, Python, Go, JavaScript/Node, .NET, Ruby, Rust, C++, PHP. Auto-instrumentation is most mature for Java (via Java agent) and Node.js (via @opentelemetry/auto-instrumentations-node).

Cross-reference: Chapter 16: Reliability Patterns for SLO-driven reliability engineering. Chapter 23: Cloud-Native Architecture for OTel in Kubernetes.


Log Aggregation Pipeline ​

Individual service logs are useless if you cannot search them across all instances. A log aggregation pipeline centralizes logs from every container, VM, and serverless function:

Structured logging (JSON over plaintext) is essential for the search step. A structured log line:

json
{
  "timestamp": "2026-03-12T00:40:00Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc123",
  "user_id": "user_789",
  "message": "Payment timeout after 5000ms",
  "duration_ms": 5000
}

This allows queries like: level:ERROR AND service:order-service AND duration_ms:>3000.


SLI / SLO / SLA ​

Google's Site Reliability Engineering introduced the SLI β†’ SLO β†’ SLA hierarchy as a way to make reliability quantitative and contractual:

Definitions ​

ConceptOwnerConsequence of breachExample
SLIEngineeringNone β€” it is a measurement99.92% of requests succeeded this month
SLOEngineeringInternal alert, error budget consumedTarget: 99.9% success rate
SLABusiness/LegalFinancial penalty, contract clauseGuarantee: 99.5% or credit issued

SLO is always stricter than SLA. The gap between SLO (internal target) and SLA (contractual guarantee) is the safety buffer β€” if engineering hits 99.8% and the SLO was 99.9%, the team is alerted and investigates before breaching the 99.5% SLA.

Common SLI Examples by Metric Type ​

Metric TypeExample SLIExample SLOMeasurement Method
AvailabilityFraction of successful HTTP requests (2xx/3xx)99.9% success rate over 30 daysSynthetic probes + real traffic
Latencyp99 response timep99 < 500ms over 1-hour windowHistogram (Prometheus histogram_quantile)
Error rateFraction of 5xx responses< 0.1% error rateError counter / total request counter
ThroughputRequests processed per second> 1,000 RPS sustainedGauge metric on queue consumer
FreshnessAge of most recent data ingestedData lag < 5 minutesTimestamp comparison metric
DurabilityFraction of written objects successfully retrieved99.999999% (11 nines)Periodic read-back verification

Error Budgets ​

An error budget is the allowable unreliability within an SLO period:

Error budget = 1 βˆ’ SLO target
99.9% SLO β†’ 0.1% budget β†’ 43.8 minutes/month of allowed downtime
99.99% SLO β†’ 0.01% budget β†’ 4.38 minutes/month

The Nines: Downtime Allowance per SLO Target ​

SLO TargetDowntime per MonthDowntime per YearDowntime per WeekCommon Name
99%7.3 hours3.65 days1.68 hoursTwo nines
99.5%3.65 hours1.83 days50.4 minutesβ€”
99.9%43.8 minutes8.77 hours10.1 minutesThree nines
99.95%21.9 minutes4.38 hours5.04 minutesβ€”
99.99%4.38 minutes52.6 minutes60.5 secondsFour nines
99.999%26.3 seconds5.26 minutes6.05 secondsFive nines
99.9999%2.63 seconds31.6 seconds0.605 secondsSix nines

Cost of an additional nine: Each additional nine of availability roughly doubles infrastructure cost and operational complexity. Going from 99.9% to 99.99% is not a 10Γ— improvement β€” it requires eliminating every planned maintenance window, active-active multi-region deployment, and sub-minute failover. Most SaaS products target 99.9% (43 min/month), which is achievable with a single region + good health checks. Five nines (26s/month) requires active-active multi-region with automated failover in under 10 seconds.

Error budget policy: When the budget is exhausted, new feature deployments halt and reliability work takes priority. This creates a natural feedback loop: engineering teams that want to ship features are incentivized to keep the service reliable.

Real-World β€” Google SRE: Google's SRE teams hold joint ownership of error budgets with product teams. If a service exhausts its error budget, the SRE team can unilaterally halt launches. This removes the "reliability vs. velocity" organizational conflict by making reliability a shared engineering metric.


Connecting SLOs to Observability Signals ​

SLOs require SLIs, which require the right observability signals:

Burn rate alerts: Rather than alerting when the budget is fully exhausted, alert when the consumption rate predicts exhaustion. A burn rate of 14.4Γ— means you will exhaust a 30-day budget in 2 days. Alert at 2Γ— burn rate (14-day warning) and 14.4Γ— burn rate (2-hour warning). This is the multi-window, multi-burn-rate alerting pattern from Google's SRE Workbook.


Alerting Strategies ​

Alert Fatigue ​

The biggest failure mode in alerting is alert fatigue β€” too many low-signal alerts cause engineers to ignore them, including critical ones. Symptoms:

  • On-call engineers acknowledge without investigating
  • Alert volume exceeds 10/day on average
  • Many alerts resolve without human action

Solution: Every alert must be actionable. If an alert fires and no action is required, delete or demote it.

Severity Levels ​

SeverityDefinitionResponse SLAExample
P1 / CriticalService down, revenue impactWake on-call immediately, < 5 minPayment API returning 500
P2 / HighDegraded, SLO at riskAlert on-call during business hours, < 30 minp99 latency > 2s
P3 / MediumAnomaly, no immediate user impactTicket, fix in sprintDisk > 80% on non-critical host
P4 / LowInformationalReview weeklyDependency approaching end of support

On-Call Best Practices ​

  • Rotate on-call weekly to distribute burden and knowledge
  • Keep runbooks for every P1/P2 alert β€” reduce MTTR with documented steps
  • Conduct blameless post-mortems within 48 hours of incidents
  • Track Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR) as team metrics
  • Alert on symptoms (user-visible impact) not causes (CPU high) where possible

Health Checks ​

Health checks allow orchestrators (Kubernetes, load balancers) to route traffic away from unhealthy instances automatically. Three probe types:

ProbeQuestionFailure ActionExample Endpoint
StartupHas the app finished initializing?Wait (don't kill yet)/health/startup β€” checks migrations complete
LivenessIs the process alive and not deadlocked?Restart the container/health/live β€” returns 200 if process is responsive
ReadinessCan the app serve traffic right now?Remove from LB pool/health/ready β€” checks DB connection, cache connection

Cross-reference: Chapter 6 covers how load balancers use health checks to remove unhealthy backends from rotation.

Readiness check design: Be conservative. If your app cannot reach its database, it should fail readiness β€” sending traffic that will fail is worse than not sending traffic at all. However, a slow downstream service should not fail readiness if the app can degrade gracefully.


Tools Comparison ​

ToolCategoryWhat It DoesDeploymentStrengthsWeaknessesCost
PrometheusMetricsPull-based metrics collection, storage, PromQLSelf-hostedCNCF standard, powerful query languageNo long-term storage built-in, cardinality limitsFree
GrafanaVisualizationDashboards for metrics, logs, traces from many sourcesSelf-hosted / CloudUniversal frontend, supports 50+ data sourcesRequires data source backendsFree / Paid
ElasticsearchLog storageDistributed search and analytics engineSelf-hosted / CloudFull-text search, flexible schemaResource-intensive, complex to operateFree / Paid
LogstashLog processingETL pipeline for logs β€” parse, filter, enrichSelf-hostedPowerful filter pluginsHeavy JVM resource usageFree
JaegerTracingDistributed trace collection, storage, UISelf-hostedCNCF, OpenTelemetry compatibleNo metrics, no logsFree
DatadogAll-in-one APMMetrics + logs + traces + APM + alertingSaaSLow operational overhead, fast setupExpensive at scalePer-host pricing
New RelicAll-in-one APMFull-stack observability, error trackingSaaSGood out-of-box instrumentationCost scales with data ingestPer-GB ingest
AWS CloudWatchCloud-nativeMetrics + logs for AWS resourcesSaaS (AWS)Zero setup for AWS servicesVendor lock-in, limited query capabilityPer metric/log

Practical guidance:

  • Startups: Datadog or New Relic for speed of setup
  • Mid-size, cost-conscious: Prometheus + Grafana + ELK + Jaeger (more ops burden, much cheaper)
  • AWS-native: CloudWatch + X-Ray + managed Prometheus/Grafana
  • OpenTelemetry: Use the vendor-neutral OTLP standard for instrumentation β€” swap backends without re-instrumenting code

Real-World β€” Netflix Atlas: Netflix built Atlas, their internal metrics platform, to handle billions of time series from thousands of services. Atlas uses in-memory storage optimized for real-time dashboards and pattern-matching queries across tag dimensions. Netflix open-sourced Atlas; its design influenced Prometheus's label model.


Trade-offs & Comparisons ​

DecisionOption AOption BRecommendation
Metrics storagePrometheus (self-hosted)Datadog (SaaS)SaaS if <$5K/month matters less than ops cost
Log samplingStore all logsSample + retain errorsSample at high volume (>10GB/day)
Trace samplingHead-based (simple)Tail-based (smart)Tail-based if budget allows β€” captures all errors
SLO target99.9% (43 min/month budget)99.99% (4 min/month budget)Higher SLO = higher infra cost, diminishing returns
Alert strategyAlert on causes (high CPU)Alert on symptoms (error rate)Symptom-based reduces noise

Key Takeaway: Observability is the foundation of reliability. You cannot improve what you cannot measure, and you cannot debug what you cannot trace. Instrument before you need it β€” adding tracing during an incident is too late. The three pillars (metrics, logs, traces) are complements, not substitutes.


Incident Management Lifecycle ​

The Five Phases ​

Detection ​

  • Automated alerts, not human discovery β€” if an engineer finds the issue before an alert fires, alerting has failed
  • Alert on symptoms, not causes β€” alert on error rate or latency degradation, not "CPU > 80%" (which may be harmless)
  • Multi-signal detection β€” a latency spike confirmed by both metrics and traces is higher confidence than a single signal; reduce false positives by requiring two or more signals to agree

Triage ​

Classify severity within minutes of detection to determine escalation path and response urgency:

SeverityImpactResponse TimeExample
P0 / SEV1Total outage, data lossImmediate, all handsPayment system down
P1 / SEV2Major feature broken15 minLogin failing for 50% of users
P2 / SEV3Minor feature degraded1 hourSearch results slow
P3 / SEV4Cosmetic / low impactNext business dayDashboard chart incorrect

Incident Example: Latency Spike Debugging ​


Designing Effective Dashboards ​

The Four Golden Signals (Google SRE) ​

Every service dashboard should lead with these four panels β€” they cover the majority of user-visible failure modes:

  • Latency β€” how long requests take (show p50, p95, p99 β€” never just average)
  • Traffic β€” how much demand the system is under (requests/sec, events/sec)
  • Errors β€” rate of failed requests (5xx, timeouts, explicit errors)
  • Saturation β€” how "full" the service is (CPU %, queue depth, connection pool usage)

If a service is degraded, at least one of these four will deviate from baseline. Start your dashboard design here; add domain-specific panels only as supplements.

Dashboard Anti-patterns ​

Anti-patternProblemFix
Too many panelsInformation overload slows incident responseMax 8–10 panels per dashboard; link to drill-down dashboards
Only averagesHides tail latency affecting real usersAlways show p50, p95, p99 side by side
No baselineCannot tell if a value is normal or alarmingAdd SLO threshold lines and historical comparison overlays
Wall of textSlow to scan under pressureUse time-series graphs and stat panels, not tables of raw numbers

Cost-Efficient Observability ​

The Cardinality Problem ​

High-cardinality labels cause metric storage to explode. Each unique label combination creates a separate time series:

1,000 unique user_id values
  Γ— 100 metrics per user
  Γ— 60-second resolution
  Γ— 30-day retention
= millions of time series β†’ storage and query cost balloons

Never use user_id, request_id, or session_id as Prometheus label values. Reserve labels for low-cardinality dimensions: service, method, status_code, region.

Strategies ​

  • Sampling: Collect 1% of traces in normal production traffic; use tail-based sampling to capture 100% of error traces and traces exceeding the p99 latency threshold β€” you get full coverage where it matters at a fraction of the cost
  • Aggregation: Pre-aggregate metrics at collection time in the OTel Collector (e.g., sum request counts by service) rather than storing raw per-request data
  • Retention tiers: Hot storage (7 days, full resolution) β†’ Warm storage (30 days, downsampled to 1-minute intervals) β†’ Cold storage (1 year, aggregated to hourly) β€” most incidents are investigated within 7 days; cold data is for trend analysis only
  • Log levels: Emit DEBUG logs only in development environments; INFO, WARN, and ERROR in production β€” a single verbose service can generate gigabytes of low-value log data per day

ChapterRelevance
Ch16 β€” Security & ReliabilityReliability SLOs and incident response complement observability
Ch13 β€” MicroservicesDistributed tracing across microservice boundaries
Ch23 β€” Cloud-NativeCloud-native monitoring: Prometheus, Grafana, CloudWatch

Practice Questions ​

Beginner ​

  1. Distributed Tracing: A microservices request takes 3 seconds end-to-end, but each individual service logs less than 100ms of processing time. How would you use distributed tracing (spans, trace IDs) to locate the missing ~2.7 seconds? What are the most common hidden latency sources in microservice chains?

    Hint Spans capture wall-clock time including network hops and queue wait time that individual service logs don't measure β€” look for gaps between the end of one span and the start of the next child span.

Intermediate ​

  1. Error Budget: Your team's SLO is 99.9% availability (43.8 min/month error budget). After a 2-hour outage, the SRE lead says you have "used 2.7Γ— your monthly error budget in one incident." What does this mean operationally β€” what features or deployments must now be frozen, and for how long?

    Hint Burning the error budget triggers a freeze on non-critical feature releases until the budget resets (typically monthly); the team must focus entirely on reliability improvements before new features ship.
  2. Readiness Probe Design: You are designing a readiness probe for a service that depends on PostgreSQL, Redis, and a third-party payment API. The payment API is sometimes slow (2–5s). How do you design the probe so a slow payment API does not remove your service from the load balancer rotation?

    Hint Separate critical dependencies (PostgreSQL, Redis β€” required for the service to function) from non-critical ones (payment API); probe only critical deps for readiness, and use a separate circuit breaker for the payment API.
  3. Observability Stack Decision: Compare Prometheus + Grafana (self-hosted) vs Datadog (SaaS) for a team of 5 engineers running 50 microservices. What hidden costs on each side are rarely surfaced in vendor comparisons?

    Hint Prometheus hidden costs: storage sizing, alert manager maintenance, and engineering time managing the stack; Datadog hidden costs: per-host + per-custom-metric pricing that scales steeply with microservice count and cardinality.

Advanced ​

  1. Alert Noise Reduction: Your on-call engineer receives 200 alerts per week: 180 auto-resolve in 10 minutes, 15 require investigation but no action, and 5 require actual fixes. Design an alert restructuring plan (severity tiers, grouping, inhibition rules) to reduce noise while ensuring no critical alert is missed.

    Hint Demote self-resolving alerts to warnings or eliminate them; add alert inhibition (suppress child alerts when a parent alert fires); use Alertmanager grouping to collapse 50 pod-restart alerts into one service-level alert β€” target a ratio where >80% of pages require action.

References & Further Reading ​

  • "Site Reliability Engineering" (Google SRE Book) β€” Chapters on Monitoring Distributed Systems and Alerting: https://sre.google/sre-book/table-of-contents/
  • "Observability Engineering" β€” Charity Majors, Liz Fong-Jones, George Miranda (O'Reilly, 2022) β€” the definitive guide to high-cardinality observability and the shift from monitoring to observability
  • OpenTelemetry documentation β€” vendor-neutral instrumentation standard for traces, metrics, and logs: https://opentelemetry.io/docs/
  • "The Art of Monitoring" β€” James Turnbull β€” practical guide to modern monitoring pipelines with Prometheus and the ELK stack
  • Datadog blog: "The Four Golden Signals" β€” https://www.datadoghq.com/blog/monitoring-101-collecting-data/
  • Google SRE Workbook β€” Chapter on Incident Response β€” covers structured incident management, severity classification, and postmortem culture: https://sre.google/workbook/incident-response/

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.

Built with VitePress + Dracula Theme