Chapter 17: Monitoring & Observability β

Monitoring tells you when something is wrong. Observability tells you why. The difference is whether your system was built to be questioned.
Mind Map β
The Three Pillars of Observability β
Observability is built on three complementary signal types. Each answers different questions:
| Dimension | Metrics | Logs | Traces |
|---|---|---|---|
| What is it? | Numeric measurements over time | Timestamped text/structured records | Request journey across services |
| Granularity | Aggregated (not per-request) | Per-event | Per-request, cross-service |
| Best for | Dashboards, alerting, trends | Debugging specific events | Latency attribution, dependency mapping |
| When to use | "Is something wrong?" | "What happened at 14:03:22?" | "Which service added 800ms?" |
| Cost | Low (aggregates) | Medium (storage scales with volume) | High (sampling required at scale) |
| Common tools | Prometheus, Datadog, Atlas | ELK Stack, Loki, CloudWatch | Jaeger, Zipkin, AWS X-Ray |
| Retention | Weeksβmonths | Daysβweeks | Hoursβdays (sampled) |
| Cardinality risk | High (too many labels = explosion) | Low | Medium |
You need all three. Metrics alone tell you latency spiked; logs tell you which user was affected; traces tell you which service in the call chain caused it.
Metrics Types β
Prometheus and most metrics systems define three fundamental types:
| Type | Definition | Example Use Cases | Example Values |
|---|---|---|---|
| Counter | Monotonically increasing integer, only goes up (resets on restart) | Total HTTP requests, total errors, total bytes sent | http_requests_total{method="GET", status="200"} 10482 |
| Gauge | Arbitrary value that can go up or down | CPU %, active connections, queue depth, memory usage | queue_depth{queue="payments"} 42 |
| Histogram | Samples observations into configurable buckets, exposes _count, _sum, _bucket | Request latency distribution, request size | http_duration_seconds_bucket{le="0.1"} 8234 |
Why Histograms Matter: Percentiles vs. Averages β
Averages hide tail latency. A p99 latency of 2s means 1% of your users wait 2 seconds β that can be thousands of users. Always alert on percentiles (p95, p99, p999) for latency metrics.
p50 = 50ms β Typical user
p95 = 200ms β Most users
p99 = 800ms β Tail (worst 1%)Prometheus computes percentiles from histograms server-side using histogram_quantile(0.99, ...).
Distributed Tracing β
In a microservices system, a single user request may fan out to 10+ services. Traditional logging β per-service β cannot answer "which service is slow." Distributed tracing reconstructs the full call tree.
Core Concepts β
- Trace: The complete journey of one request, from entry point to all leaf calls
- Span: One unit of work within a trace (e.g., one service call, one DB query)
- Correlation ID / Trace ID: A unique ID injected at the entry point and propagated via HTTP headers (
X-Trace-ID) to every downstream call - Parent-child spans: Each span records its parent span ID, enabling tree reconstruction
Trace Sequence Example β
Span Tree Visualization β
abc123 [120ms total]
βββ span_001: Auth (User Service) [15ms]
βββ span_002: Order Service [98ms]
βββ span_003: DB Query [45ms]This immediately surfaces that the DB query and Order Service overhead account for 98ms, making optimization target obvious.
Sampling Strategies β
At 10,000 req/sec, storing every trace is prohibitively expensive. Two approaches:
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Head-based | Decide at trace start (random %) | Low overhead, simple | Misses rare errors |
| Tail-based | Buffer full trace, decide after completion based on outcome | Captures all errors/slow requests | Higher memory/processing cost |
Production recommendation: sample 1% normally, 100% of errors and traces > p99 latency threshold.
W3C TraceContext: Trace Propagation Standard β
Distributed tracing requires trace and span IDs to flow across service boundaries via HTTP headers. W3C TraceContext (RFC 2019) defines the standard:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^
version trace-id (128-bit) parent-span-id flagsEvery service reads this header, creates a child span with the parent-span-id, and propagates the same trace-id to downstream calls. No proprietary vendor format needed.
Tracing Tools Comparison β
| Jaeger | Zipkin | Grafana Tempo | |
|---|---|---|---|
| Storage backends | Cassandra, Elasticsearch, in-memory | MySQL, Cassandra, Elasticsearch | Object storage (S3, GCS, Azure Blob) |
| UI | Full-featured: trace timeline, service dependency graph | Simpler: trace list + timeline | Minimal native UI; relies on Grafana |
| Sampling | Head-based and adaptive (remote controlled) | Head-based | Delegated to OTel Collector |
| OpenTelemetry | Native OTLP support | Via adapter | Native OTLP (primary protocol) |
| Integration | CNCF project, Kubernetes-native | Widely adopted, many client libs | Grafana stack (pairs with Prometheus + Loki) |
| Scale | Medium-large | Medium | Very large (object storage = cheap at scale) |
| Deployment | Moderate complexity | Simple | Simple (no search index needed) |
| Cost | Free (Elasticsearch costs extra) | Free | Free (storage cost only) |
| Best for | Kubernetes microservices, CNCF stack | Existing Zipkin instrumentation | Grafana-centric observability stacks |
OpenTelemetry β
OpenTelemetry (OTel) is a CNCF project that provides a vendor-neutral standard for instrumentation, collection, and export of telemetry data β traces, metrics, and logs β under a single unified API and SDK.
Before OTel, each observability vendor required its own SDK. Switching from Jaeger to Datadog meant re-instrumenting every service. OpenTelemetry solves this with a single instrumentation layer and pluggable exporters.
Three Pillars Unified β
OTLP (OpenTelemetry Protocol) is the wire format. It runs over gRPC (port 4317) or HTTP/protobuf (port 4318). Any backend that speaks OTLP can receive OTel data.
OTel Collector Architecture β
The Collector is an optional but recommended middle tier. It decouples instrumented apps from backend destinations:
Collector benefits:
- Tail-based sampling at the Collector level (buffer spans, sample on outcome)
- Data transformation: filter PII from logs, rename labels, add resource attributes
- Multi-backend export: send same data to two backends simultaneously (migration, redundancy)
- Decoupled upgrades: swap backends without touching application code
Auto-Instrumentation vs Manual Instrumentation β
| Dimension | Auto-Instrumentation | Manual Instrumentation |
|---|---|---|
| How | Agent/bytecode injection at startup, no code changes | Developer adds tracer.start_span() calls in code |
| Effort | Zero code changes | Per-operation instrumentation required |
| Coverage | HTTP clients, DB drivers, frameworks (Spring, Express, Django) | Any custom business logic, critical paths |
| Span quality | Generic (framework-level, missing business context) | Rich (custom attributes: user_id, order_id, feature flags) |
| Latency | Slight overhead from agent | Minimal (only instrumented operations) |
| Maintenance | Agent version updates | Code changes when logic changes |
| Best for | Getting started, standardizing infrastructure calls | High-value business operations, SLO-critical paths |
Recommendation: Use auto-instrumentation as the baseline (catches all HTTP and DB calls), then add manual spans around critical business operations (payment processing, fraud check, recommendation engine) where custom attributes are needed for debugging.
OTel Language Support β
OTel provides SDKs for all major languages: Java, Python, Go, JavaScript/Node, .NET, Ruby, Rust, C++, PHP. Auto-instrumentation is most mature for Java (via Java agent) and Node.js (via @opentelemetry/auto-instrumentations-node).
Cross-reference: Chapter 16: Reliability Patterns for SLO-driven reliability engineering. Chapter 23: Cloud-Native Architecture for OTel in Kubernetes.
Log Aggregation Pipeline β
Individual service logs are useless if you cannot search them across all instances. A log aggregation pipeline centralizes logs from every container, VM, and serverless function:
Structured logging (JSON over plaintext) is essential for the search step. A structured log line:
{
"timestamp": "2026-03-12T00:40:00Z",
"level": "ERROR",
"service": "order-service",
"trace_id": "abc123",
"user_id": "user_789",
"message": "Payment timeout after 5000ms",
"duration_ms": 5000
}This allows queries like: level:ERROR AND service:order-service AND duration_ms:>3000.
SLI / SLO / SLA β
Google's Site Reliability Engineering introduced the SLI β SLO β SLA hierarchy as a way to make reliability quantitative and contractual:
Definitions β
| Concept | Owner | Consequence of breach | Example |
|---|---|---|---|
| SLI | Engineering | None β it is a measurement | 99.92% of requests succeeded this month |
| SLO | Engineering | Internal alert, error budget consumed | Target: 99.9% success rate |
| SLA | Business/Legal | Financial penalty, contract clause | Guarantee: 99.5% or credit issued |
SLO is always stricter than SLA. The gap between SLO (internal target) and SLA (contractual guarantee) is the safety buffer β if engineering hits 99.8% and the SLO was 99.9%, the team is alerted and investigates before breaching the 99.5% SLA.
Common SLI Examples by Metric Type β
| Metric Type | Example SLI | Example SLO | Measurement Method |
|---|---|---|---|
| Availability | Fraction of successful HTTP requests (2xx/3xx) | 99.9% success rate over 30 days | Synthetic probes + real traffic |
| Latency | p99 response time | p99 < 500ms over 1-hour window | Histogram (Prometheus histogram_quantile) |
| Error rate | Fraction of 5xx responses | < 0.1% error rate | Error counter / total request counter |
| Throughput | Requests processed per second | > 1,000 RPS sustained | Gauge metric on queue consumer |
| Freshness | Age of most recent data ingested | Data lag < 5 minutes | Timestamp comparison metric |
| Durability | Fraction of written objects successfully retrieved | 99.999999% (11 nines) | Periodic read-back verification |
Error Budgets β
An error budget is the allowable unreliability within an SLO period:
Error budget = 1 β SLO target
99.9% SLO β 0.1% budget β 43.8 minutes/month of allowed downtime
99.99% SLO β 0.01% budget β 4.38 minutes/monthThe Nines: Downtime Allowance per SLO Target β
| SLO Target | Downtime per Month | Downtime per Year | Downtime per Week | Common Name |
|---|---|---|---|---|
| 99% | 7.3 hours | 3.65 days | 1.68 hours | Two nines |
| 99.5% | 3.65 hours | 1.83 days | 50.4 minutes | β |
| 99.9% | 43.8 minutes | 8.77 hours | 10.1 minutes | Three nines |
| 99.95% | 21.9 minutes | 4.38 hours | 5.04 minutes | β |
| 99.99% | 4.38 minutes | 52.6 minutes | 60.5 seconds | Four nines |
| 99.999% | 26.3 seconds | 5.26 minutes | 6.05 seconds | Five nines |
| 99.9999% | 2.63 seconds | 31.6 seconds | 0.605 seconds | Six nines |
Cost of an additional nine: Each additional nine of availability roughly doubles infrastructure cost and operational complexity. Going from 99.9% to 99.99% is not a 10Γ improvement β it requires eliminating every planned maintenance window, active-active multi-region deployment, and sub-minute failover. Most SaaS products target 99.9% (43 min/month), which is achievable with a single region + good health checks. Five nines (26s/month) requires active-active multi-region with automated failover in under 10 seconds.
Error budget policy: When the budget is exhausted, new feature deployments halt and reliability work takes priority. This creates a natural feedback loop: engineering teams that want to ship features are incentivized to keep the service reliable.
Real-World β Google SRE: Google's SRE teams hold joint ownership of error budgets with product teams. If a service exhausts its error budget, the SRE team can unilaterally halt launches. This removes the "reliability vs. velocity" organizational conflict by making reliability a shared engineering metric.
Connecting SLOs to Observability Signals β
SLOs require SLIs, which require the right observability signals:
Burn rate alerts: Rather than alerting when the budget is fully exhausted, alert when the consumption rate predicts exhaustion. A burn rate of 14.4Γ means you will exhaust a 30-day budget in 2 days. Alert at 2Γ burn rate (14-day warning) and 14.4Γ burn rate (2-hour warning). This is the multi-window, multi-burn-rate alerting pattern from Google's SRE Workbook.
Alerting Strategies β
Alert Fatigue β
The biggest failure mode in alerting is alert fatigue β too many low-signal alerts cause engineers to ignore them, including critical ones. Symptoms:
- On-call engineers acknowledge without investigating
- Alert volume exceeds 10/day on average
- Many alerts resolve without human action
Solution: Every alert must be actionable. If an alert fires and no action is required, delete or demote it.
Severity Levels β
| Severity | Definition | Response SLA | Example |
|---|---|---|---|
| P1 / Critical | Service down, revenue impact | Wake on-call immediately, < 5 min | Payment API returning 500 |
| P2 / High | Degraded, SLO at risk | Alert on-call during business hours, < 30 min | p99 latency > 2s |
| P3 / Medium | Anomaly, no immediate user impact | Ticket, fix in sprint | Disk > 80% on non-critical host |
| P4 / Low | Informational | Review weekly | Dependency approaching end of support |
On-Call Best Practices β
- Rotate on-call weekly to distribute burden and knowledge
- Keep runbooks for every P1/P2 alert β reduce MTTR with documented steps
- Conduct blameless post-mortems within 48 hours of incidents
- Track Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR) as team metrics
- Alert on symptoms (user-visible impact) not causes (CPU high) where possible
Health Checks β
Health checks allow orchestrators (Kubernetes, load balancers) to route traffic away from unhealthy instances automatically. Three probe types:
| Probe | Question | Failure Action | Example Endpoint |
|---|---|---|---|
| Startup | Has the app finished initializing? | Wait (don't kill yet) | /health/startup β checks migrations complete |
| Liveness | Is the process alive and not deadlocked? | Restart the container | /health/live β returns 200 if process is responsive |
| Readiness | Can the app serve traffic right now? | Remove from LB pool | /health/ready β checks DB connection, cache connection |
Cross-reference: Chapter 6 covers how load balancers use health checks to remove unhealthy backends from rotation.
Readiness check design: Be conservative. If your app cannot reach its database, it should fail readiness β sending traffic that will fail is worse than not sending traffic at all. However, a slow downstream service should not fail readiness if the app can degrade gracefully.
Tools Comparison β
| Tool | Category | What It Does | Deployment | Strengths | Weaknesses | Cost |
|---|---|---|---|---|---|---|
| Prometheus | Metrics | Pull-based metrics collection, storage, PromQL | Self-hosted | CNCF standard, powerful query language | No long-term storage built-in, cardinality limits | Free |
| Grafana | Visualization | Dashboards for metrics, logs, traces from many sources | Self-hosted / Cloud | Universal frontend, supports 50+ data sources | Requires data source backends | Free / Paid |
| Elasticsearch | Log storage | Distributed search and analytics engine | Self-hosted / Cloud | Full-text search, flexible schema | Resource-intensive, complex to operate | Free / Paid |
| Logstash | Log processing | ETL pipeline for logs β parse, filter, enrich | Self-hosted | Powerful filter plugins | Heavy JVM resource usage | Free |
| Jaeger | Tracing | Distributed trace collection, storage, UI | Self-hosted | CNCF, OpenTelemetry compatible | No metrics, no logs | Free |
| Datadog | All-in-one APM | Metrics + logs + traces + APM + alerting | SaaS | Low operational overhead, fast setup | Expensive at scale | Per-host pricing |
| New Relic | All-in-one APM | Full-stack observability, error tracking | SaaS | Good out-of-box instrumentation | Cost scales with data ingest | Per-GB ingest |
| AWS CloudWatch | Cloud-native | Metrics + logs for AWS resources | SaaS (AWS) | Zero setup for AWS services | Vendor lock-in, limited query capability | Per metric/log |
Practical guidance:
- Startups: Datadog or New Relic for speed of setup
- Mid-size, cost-conscious: Prometheus + Grafana + ELK + Jaeger (more ops burden, much cheaper)
- AWS-native: CloudWatch + X-Ray + managed Prometheus/Grafana
- OpenTelemetry: Use the vendor-neutral OTLP standard for instrumentation β swap backends without re-instrumenting code
Real-World β Netflix Atlas: Netflix built Atlas, their internal metrics platform, to handle billions of time series from thousands of services. Atlas uses in-memory storage optimized for real-time dashboards and pattern-matching queries across tag dimensions. Netflix open-sourced Atlas; its design influenced Prometheus's label model.
Trade-offs & Comparisons β
| Decision | Option A | Option B | Recommendation |
|---|---|---|---|
| Metrics storage | Prometheus (self-hosted) | Datadog (SaaS) | SaaS if <$5K/month matters less than ops cost |
| Log sampling | Store all logs | Sample + retain errors | Sample at high volume (>10GB/day) |
| Trace sampling | Head-based (simple) | Tail-based (smart) | Tail-based if budget allows β captures all errors |
| SLO target | 99.9% (43 min/month budget) | 99.99% (4 min/month budget) | Higher SLO = higher infra cost, diminishing returns |
| Alert strategy | Alert on causes (high CPU) | Alert on symptoms (error rate) | Symptom-based reduces noise |
Key Takeaway: Observability is the foundation of reliability. You cannot improve what you cannot measure, and you cannot debug what you cannot trace. Instrument before you need it β adding tracing during an incident is too late. The three pillars (metrics, logs, traces) are complements, not substitutes.
Incident Management Lifecycle β
The Five Phases β
Detection β
- Automated alerts, not human discovery β if an engineer finds the issue before an alert fires, alerting has failed
- Alert on symptoms, not causes β alert on error rate or latency degradation, not "CPU > 80%" (which may be harmless)
- Multi-signal detection β a latency spike confirmed by both metrics and traces is higher confidence than a single signal; reduce false positives by requiring two or more signals to agree
Triage β
Classify severity within minutes of detection to determine escalation path and response urgency:
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| P0 / SEV1 | Total outage, data loss | Immediate, all hands | Payment system down |
| P1 / SEV2 | Major feature broken | 15 min | Login failing for 50% of users |
| P2 / SEV3 | Minor feature degraded | 1 hour | Search results slow |
| P3 / SEV4 | Cosmetic / low impact | Next business day | Dashboard chart incorrect |
Incident Example: Latency Spike Debugging β
Designing Effective Dashboards β
The Four Golden Signals (Google SRE) β
Every service dashboard should lead with these four panels β they cover the majority of user-visible failure modes:
- Latency β how long requests take (show p50, p95, p99 β never just average)
- Traffic β how much demand the system is under (requests/sec, events/sec)
- Errors β rate of failed requests (5xx, timeouts, explicit errors)
- Saturation β how "full" the service is (CPU %, queue depth, connection pool usage)
If a service is degraded, at least one of these four will deviate from baseline. Start your dashboard design here; add domain-specific panels only as supplements.
Dashboard Anti-patterns β
| Anti-pattern | Problem | Fix |
|---|---|---|
| Too many panels | Information overload slows incident response | Max 8β10 panels per dashboard; link to drill-down dashboards |
| Only averages | Hides tail latency affecting real users | Always show p50, p95, p99 side by side |
| No baseline | Cannot tell if a value is normal or alarming | Add SLO threshold lines and historical comparison overlays |
| Wall of text | Slow to scan under pressure | Use time-series graphs and stat panels, not tables of raw numbers |
Cost-Efficient Observability β
The Cardinality Problem β
High-cardinality labels cause metric storage to explode. Each unique label combination creates a separate time series:
1,000 unique user_id values
Γ 100 metrics per user
Γ 60-second resolution
Γ 30-day retention
= millions of time series β storage and query cost balloonsNever use user_id, request_id, or session_id as Prometheus label values. Reserve labels for low-cardinality dimensions: service, method, status_code, region.
Strategies β
- Sampling: Collect 1% of traces in normal production traffic; use tail-based sampling to capture 100% of error traces and traces exceeding the p99 latency threshold β you get full coverage where it matters at a fraction of the cost
- Aggregation: Pre-aggregate metrics at collection time in the OTel Collector (e.g., sum request counts by service) rather than storing raw per-request data
- Retention tiers: Hot storage (7 days, full resolution) β Warm storage (30 days, downsampled to 1-minute intervals) β Cold storage (1 year, aggregated to hourly) β most incidents are investigated within 7 days; cold data is for trend analysis only
- Log levels: Emit
DEBUGlogs only in development environments;INFO,WARN, andERRORin production β a single verbose service can generate gigabytes of low-value log data per day
Related Chapters β
| Chapter | Relevance |
|---|---|
| Ch16 β Security & Reliability | Reliability SLOs and incident response complement observability |
| Ch13 β Microservices | Distributed tracing across microservice boundaries |
| Ch23 β Cloud-Native | Cloud-native monitoring: Prometheus, Grafana, CloudWatch |
Practice Questions β
Beginner β
Distributed Tracing: A microservices request takes 3 seconds end-to-end, but each individual service logs less than 100ms of processing time. How would you use distributed tracing (spans, trace IDs) to locate the missing ~2.7 seconds? What are the most common hidden latency sources in microservice chains?
Hint
Spans capture wall-clock time including network hops and queue wait time that individual service logs don't measure β look for gaps between the end of one span and the start of the next child span.
Intermediate β
Error Budget: Your team's SLO is 99.9% availability (43.8 min/month error budget). After a 2-hour outage, the SRE lead says you have "used 2.7Γ your monthly error budget in one incident." What does this mean operationally β what features or deployments must now be frozen, and for how long?
Hint
Burning the error budget triggers a freeze on non-critical feature releases until the budget resets (typically monthly); the team must focus entirely on reliability improvements before new features ship.Readiness Probe Design: You are designing a readiness probe for a service that depends on PostgreSQL, Redis, and a third-party payment API. The payment API is sometimes slow (2β5s). How do you design the probe so a slow payment API does not remove your service from the load balancer rotation?
Hint
Separate critical dependencies (PostgreSQL, Redis β required for the service to function) from non-critical ones (payment API); probe only critical deps for readiness, and use a separate circuit breaker for the payment API.Observability Stack Decision: Compare Prometheus + Grafana (self-hosted) vs Datadog (SaaS) for a team of 5 engineers running 50 microservices. What hidden costs on each side are rarely surfaced in vendor comparisons?
Hint
Prometheus hidden costs: storage sizing, alert manager maintenance, and engineering time managing the stack; Datadog hidden costs: per-host + per-custom-metric pricing that scales steeply with microservice count and cardinality.
Advanced β
Alert Noise Reduction: Your on-call engineer receives 200 alerts per week: 180 auto-resolve in 10 minutes, 15 require investigation but no action, and 5 require actual fixes. Design an alert restructuring plan (severity tiers, grouping, inhibition rules) to reduce noise while ensuring no critical alert is missed.
Hint
Demote self-resolving alerts to warnings or eliminate them; add alert inhibition (suppress child alerts when a parent alert fires); use Alertmanager grouping to collapse 50 pod-restart alerts into one service-level alert β target a ratio where >80% of pages require action.
References & Further Reading β
- "Site Reliability Engineering" (Google SRE Book) β Chapters on Monitoring Distributed Systems and Alerting: https://sre.google/sre-book/table-of-contents/
- "Observability Engineering" β Charity Majors, Liz Fong-Jones, George Miranda (O'Reilly, 2022) β the definitive guide to high-cardinality observability and the shift from monitoring to observability
- OpenTelemetry documentation β vendor-neutral instrumentation standard for traces, metrics, and logs: https://opentelemetry.io/docs/
- "The Art of Monitoring" β James Turnbull β practical guide to modern monitoring pipelines with Prometheus and the ELK stack
- Datadog blog: "The Four Golden Signals" β https://www.datadoghq.com/blog/monitoring-101-collecting-data/
- Google SRE Workbook β Chapter on Incident Response β covers structured incident management, severity classification, and postmortem culture: https://sre.google/workbook/incident-response/

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.