← back

Infrastructure

Monitoring and Observability

Logs, metrics, and traces — the three pillars. Design monitoring infrastructure that helps you detect, diagnose, and resolve issues in production.

Monitoring and Observability

A distributed system with 200 microservices is running smoothly until users start reporting slow page loads. Where is the bottleneck? Is it the database? A downstream API? A network issue? A memory leak? Without observability, you are debugging blind -- restarting services and hoping for the best. With proper observability, you can pinpoint that the User Service's 99th percentile latency spiked at 2:15 PM because its connection pool to the PostgreSQL replica filled up after a deployment removed a connection timeout.

Observability is not monitoring. Monitoring tells you when something is wrong (alerts). Observability tells you why it is wrong (diagnosis). The distinction matters in interviews.

The Three Pillars

Logs

Logs are discrete, timestamped records of events. They are the most granular signal and the oldest form of observability.

#### Unstructured vs Structured Logging

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Unstructured (bad):
  2024-01-15 10:30:05 ERROR Failed to process order 12345 for user 42

Structured (good):
  {
    "timestamp": "2024-01-15T10:30:05Z",
    "level": "error",
    "service": "order-service",
    "message": "Failed to process order",
    "order_id": "12345",
    "user_id": "42",
    "error_type": "PaymentDeclined",
    "trace_id": "abc-123-def",
    "duration_ms": 1250
  }

Structured logs (JSON format) are machine-parseable, enabling queries like "show me all errors for user 42 in the last hour" or "count PaymentDeclined errors per minute."

#### Log Levels

1
2
3
4
5
6
TRACE  → Very detailed, line-by-line execution (disabled in production)
DEBUG  → Diagnostic information for developers
INFO   → Normal operations (request received, order completed)
WARN   → Unexpected but recoverable (retry succeeded, cache miss)
ERROR  → Something failed (request failed, exception caught)
FATAL  → System is unusable (cannot connect to database, out of memory)

In production, set the default level to INFO. Enable DEBUG dynamically for specific services during incident investigation.

#### Centralized Log Aggregation

With hundreds of services, logs must be aggregated into a central system. The standard pipeline:

1
2
Application → Log Shipper (Fluentd/Filebeat) → Message Queue (Kafka)
           → Log Storage (Elasticsearch/Loki) → Dashboard (Kibana/Grafana)

The ELK Stack (Elasticsearch, Logstash, Kibana) is the classic solution. Grafana Loki is a newer, more cost-effective alternative that indexes only metadata (labels), not the full log text.

Metrics

Metrics are numerical measurements collected at regular intervals. They are the most efficient signal for dashboards and alerting because they are compact and aggregatable.

#### Metric Types

Counter: A monotonically increasing value. Resets to zero when the process restarts. Use for: total requests, errors, bytes transferred.

1
2
3
4
5
6
7
8
9
# Prometheus counter example
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Increment on each request
http_requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()

Gauge: A value that goes up and down. Use for: current connections, memory usage, queue depth, temperature.

1
2
3
4
5
6
7
active_connections = Gauge(
    'active_connections',
    'Number of active WebSocket connections'
)
active_connections.set(142)
active_connections.inc()   # 143
active_connections.dec()   # 142

Histogram: Measures the distribution of values (e.g., request latency). Buckets observations into predefined ranges and computes quantiles.

1
2
3
4
5
6
7
8
9
10
request_duration = Histogram(
    'http_request_duration_seconds',
    'Request latency in seconds',
    ['endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# Record a latency observation
with request_duration.labels(endpoint='/api/users').time():
    process_request()

Summary: Similar to histogram but computes quantiles (p50, p95, p99) on the client side. Less flexible than histograms (cannot aggregate across instances) but more accurate for specific quantiles.

#### Key Metrics to Track

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
The RED Method (for request-driven services):
  Rate:     Requests per second
  Errors:   Failed requests per second
  Duration: Latency distribution (p50, p95, p99)

The USE Method (for resources):
  Utilization: How busy is the resource? (CPU: 75%)
  Saturation:  How much queued work? (disk I/O queue depth: 12)
  Errors:      How many errors? (network packet drops: 5/sec)

The Four Golden Signals (Google SRE):
  Latency:    Time to serve a request
  Traffic:    Volume of requests
  Errors:     Rate of failed requests
  Saturation: How "full" the service is (CPU, memory, connections)

Traces (Distributed Tracing)

In a microservices architecture, a single user request might touch 10 services. A trace follows that request across all services, showing exactly where time was spent.

#### Anatomy of a Trace

1
2
3
4
5
6
7
8
9
10
Trace ID: abc-123-def

├── Span: API Gateway (total: 250ms)
│   ├── Span: Auth Service - validate token (15ms)
│   ├── Span: User Service - get profile (80ms)
│   │   └── Span: PostgreSQL query (45ms)
│   ├── Span: Order Service - get recent orders (120ms)
│   │   ├── Span: Redis cache lookup (2ms) - MISS
│   │   └── Span: DynamoDB query (95ms)
│   └── Span: Response serialization (5ms)

A trace is a tree of spans. Each span represents a unit of work with a start time, duration, and metadata (tags, logs). Spans have parent-child relationships that form the call tree.

#### Context Propagation

For distributed tracing to work, each service must propagate the trace context (trace ID, span ID, parent span ID) to downstream calls. This is typically done via HTTP headers:

1
2
3
4
5
HTTP Request Headers:
  traceparent: 00-abc123def-span456-01
  tracestate: vendor=value

This follows the W3C Trace Context standard.

#### OpenTelemetry

OpenTelemetry (OTel) is the emerging standard for observability instrumentation. It provides a single set of APIs, SDKs, and tools for generating logs, metrics, and traces across all major languages.

1
2
3
4
5
6
7
8
9
10
11
12
Architecture:
  Application (OTel SDK) → OTel Collector → Backend (Jaeger, Zipkin, Datadog, etc.)

The Collector:
  - Receives telemetry data from applications
  - Processes it (batching, filtering, enrichment)
  - Exports to one or more backends

Benefits:
  - Vendor-agnostic: switch from Jaeger to Datadog without changing code
  - Auto-instrumentation: libraries for HTTP, gRPC, databases, etc.
  - Unified: logs, metrics, and traces through one framework

Alerting Strategies

Alert on Symptoms, Not Causes

1
2
3
4
Bad alert:  "CPU usage > 80%"          (cause — maybe CPU is busy doing useful work)
Good alert: "p99 latency > 500ms"      (symptom — users are affected)
Good alert: "Error rate > 5%"          (symptom — something is broken)
Good alert: "Successful requests < 100/min" (symptom — traffic has dropped)

Alert Fatigue

Too many alerts desensitize the on-call engineer. Every alert should be actionable -- if you cannot do anything about it, it should not wake someone up at 3 AM.

1
2
3
4
5
6
7
8
9
10
11
12
Severity levels:
  P1 (Critical): Page the on-call engineer immediately.
     Example: "Zero successful requests in the last 5 minutes"

  P2 (High): Alert to Slack/PagerDuty, respond within 1 hour.
     Example: "Error rate > 10% for 10 minutes"

  P3 (Medium): Review next business day.
     Example: "Disk usage > 80%"

  P4 (Low): Informational, review weekly.
     Example: "Certificate expires in 30 days"

Multi-Window, Multi-Burn-Rate Alerts

Google's SRE book introduces burn rate alerting for SLO-based monitoring. Instead of alerting on a fixed threshold, alert when you are consuming your error budget faster than expected.

1
2
3
4
5
6
SLO: 99.9% availability (error budget: 0.1% = 43.2 minutes/month)

Burn rate 1x:  Consuming budget at exactly the expected rate (no alert)
Burn rate 14x: Will exhaust monthly budget in ~3 days → page immediately
Burn rate 6x:  Will exhaust monthly budget in ~7 days → alert to Slack
Burn rate 1x:  On track → no alert

Use multiple windows (5-minute and 1-hour) to catch both sudden spikes and gradual degradation.

Building an Observability Stack

Common Stack (Open Source)

1
2
3
4
Metrics:  Prometheus (collection) + Grafana (visualization)
Logs:     Loki or Elasticsearch (storage) + Grafana or Kibana (visualization)
Traces:   Jaeger or Tempo (storage) + Grafana (visualization)
Alerting: Prometheus Alertmanager or Grafana Alerts

Commercial Solutions

1
2
3
4
5
Datadog:      All-in-one (metrics, logs, traces, APM). Expensive at scale.
New Relic:    Strong APM and error tracking. Per-host pricing.
Honeycomb:    Best for high-cardinality exploratory analysis.
Splunk:       Powerful log search and analytics. Enterprise-focused.
AWS CloudWatch: Native AWS monitoring. Good for AWS-only environments.

Correlation: Connecting the Signals

The real power of observability comes from correlating logs, metrics, and traces:

1
2
3
4
5
6
7
1. Alert fires: "p99 latency > 500ms on Order Service"
2. Check metrics dashboard: latency spike started at 14:15
3. Check traces: slow traces show 400ms spent in PostgreSQL queries
4. Filter logs by trace_id: "Connection pool exhausted, waiting for available connection"
5. Root cause: deployment at 14:10 reduced max_connections from 50 to 5

Resolution: revert deployment, fix configuration, redeploy.

The trace_id is the glue. Include it in every log line, every metric label, and every span. This enables jumping from a metric anomaly to the specific traces and logs that explain it.

Interview Tips

  • Name the three pillars. Logs, metrics, traces. Explain what each is best for: logs for granular debugging, metrics for dashboards and alerting, traces for understanding request flow across services.
  • Know the metric types. Counter, gauge, histogram, summary. Explain when to use each.
  • Mention the RED method or Four Golden Signals. These frameworks show you know which metrics matter.
  • Discuss structured logging. If you mention logging, specify structured (JSON) format with trace IDs. Unstructured logs are nearly useless at scale.
  • Explain context propagation. Distributed tracing requires passing trace context between services. Mention W3C Trace Context headers or OpenTelemetry.
  • Alert on symptoms, not causes. This is a key principle from Google SRE that demonstrates operational maturity.
  • Mention cardinality. High-cardinality labels (user ID, request ID) on metrics explode storage costs. Keep metric labels low-cardinality. Use logs and traces for high-cardinality data.
  • Connect observability to your design. In any system design interview, when you describe a component, briefly mention how you would monitor it. This signals production readiness.