Infrastructure
Logs, metrics, and traces — the three pillars. Design monitoring infrastructure that helps you detect, diagnose, and resolve issues in production.
A distributed system with 200 microservices is running smoothly until users start reporting slow page loads. Where is the bottleneck? Is it the database? A downstream API? A network issue? A memory leak? Without observability, you are debugging blind -- restarting services and hoping for the best. With proper observability, you can pinpoint that the User Service's 99th percentile latency spiked at 2:15 PM because its connection pool to the PostgreSQL replica filled up after a deployment removed a connection timeout.
Observability is not monitoring. Monitoring tells you when something is wrong (alerts). Observability tells you why it is wrong (diagnosis). The distinction matters in interviews.
Logs are discrete, timestamped records of events. They are the most granular signal and the oldest form of observability.
#### Unstructured vs Structured Logging
Unstructured (bad):
2024-01-15 10:30:05 ERROR Failed to process order 12345 for user 42
Structured (good):
{
"timestamp": "2024-01-15T10:30:05Z",
"level": "error",
"service": "order-service",
"message": "Failed to process order",
"order_id": "12345",
"user_id": "42",
"error_type": "PaymentDeclined",
"trace_id": "abc-123-def",
"duration_ms": 1250
}Structured logs (JSON format) are machine-parseable, enabling queries like "show me all errors for user 42 in the last hour" or "count PaymentDeclined errors per minute."
#### Log Levels
TRACE → Very detailed, line-by-line execution (disabled in production)
DEBUG → Diagnostic information for developers
INFO → Normal operations (request received, order completed)
WARN → Unexpected but recoverable (retry succeeded, cache miss)
ERROR → Something failed (request failed, exception caught)
FATAL → System is unusable (cannot connect to database, out of memory)In production, set the default level to INFO. Enable DEBUG dynamically for specific services during incident investigation.
#### Centralized Log Aggregation
With hundreds of services, logs must be aggregated into a central system. The standard pipeline:
Application → Log Shipper (Fluentd/Filebeat) → Message Queue (Kafka)
→ Log Storage (Elasticsearch/Loki) → Dashboard (Kibana/Grafana)The ELK Stack (Elasticsearch, Logstash, Kibana) is the classic solution. Grafana Loki is a newer, more cost-effective alternative that indexes only metadata (labels), not the full log text.
Metrics are numerical measurements collected at regular intervals. They are the most efficient signal for dashboards and alerting because they are compact and aggregatable.
#### Metric Types
Counter: A monotonically increasing value. Resets to zero when the process restarts. Use for: total requests, errors, bytes transferred.
# Prometheus counter example
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
# Increment on each request
http_requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()Gauge: A value that goes up and down. Use for: current connections, memory usage, queue depth, temperature.
active_connections = Gauge(
'active_connections',
'Number of active WebSocket connections'
)
active_connections.set(142)
active_connections.inc() # 143
active_connections.dec() # 142Histogram: Measures the distribution of values (e.g., request latency). Buckets observations into predefined ranges and computes quantiles.
request_duration = Histogram(
'http_request_duration_seconds',
'Request latency in seconds',
['endpoint'],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
# Record a latency observation
with request_duration.labels(endpoint='/api/users').time():
process_request()Summary: Similar to histogram but computes quantiles (p50, p95, p99) on the client side. Less flexible than histograms (cannot aggregate across instances) but more accurate for specific quantiles.
#### Key Metrics to Track
The RED Method (for request-driven services):
Rate: Requests per second
Errors: Failed requests per second
Duration: Latency distribution (p50, p95, p99)
The USE Method (for resources):
Utilization: How busy is the resource? (CPU: 75%)
Saturation: How much queued work? (disk I/O queue depth: 12)
Errors: How many errors? (network packet drops: 5/sec)
The Four Golden Signals (Google SRE):
Latency: Time to serve a request
Traffic: Volume of requests
Errors: Rate of failed requests
Saturation: How "full" the service is (CPU, memory, connections)In a microservices architecture, a single user request might touch 10 services. A trace follows that request across all services, showing exactly where time was spent.
#### Anatomy of a Trace
Trace ID: abc-123-def
├── Span: API Gateway (total: 250ms)
│ ├── Span: Auth Service - validate token (15ms)
│ ├── Span: User Service - get profile (80ms)
│ │ └── Span: PostgreSQL query (45ms)
│ ├── Span: Order Service - get recent orders (120ms)
│ │ ├── Span: Redis cache lookup (2ms) - MISS
│ │ └── Span: DynamoDB query (95ms)
│ └── Span: Response serialization (5ms)A trace is a tree of spans. Each span represents a unit of work with a start time, duration, and metadata (tags, logs). Spans have parent-child relationships that form the call tree.
#### Context Propagation
For distributed tracing to work, each service must propagate the trace context (trace ID, span ID, parent span ID) to downstream calls. This is typically done via HTTP headers:
HTTP Request Headers:
traceparent: 00-abc123def-span456-01
tracestate: vendor=value
This follows the W3C Trace Context standard.#### OpenTelemetry
OpenTelemetry (OTel) is the emerging standard for observability instrumentation. It provides a single set of APIs, SDKs, and tools for generating logs, metrics, and traces across all major languages.
Architecture:
Application (OTel SDK) → OTel Collector → Backend (Jaeger, Zipkin, Datadog, etc.)
The Collector:
- Receives telemetry data from applications
- Processes it (batching, filtering, enrichment)
- Exports to one or more backends
Benefits:
- Vendor-agnostic: switch from Jaeger to Datadog without changing code
- Auto-instrumentation: libraries for HTTP, gRPC, databases, etc.
- Unified: logs, metrics, and traces through one frameworkBad alert: "CPU usage > 80%" (cause — maybe CPU is busy doing useful work)
Good alert: "p99 latency > 500ms" (symptom — users are affected)
Good alert: "Error rate > 5%" (symptom — something is broken)
Good alert: "Successful requests < 100/min" (symptom — traffic has dropped)Too many alerts desensitize the on-call engineer. Every alert should be actionable -- if you cannot do anything about it, it should not wake someone up at 3 AM.
Severity levels:
P1 (Critical): Page the on-call engineer immediately.
Example: "Zero successful requests in the last 5 minutes"
P2 (High): Alert to Slack/PagerDuty, respond within 1 hour.
Example: "Error rate > 10% for 10 minutes"
P3 (Medium): Review next business day.
Example: "Disk usage > 80%"
P4 (Low): Informational, review weekly.
Example: "Certificate expires in 30 days"Google's SRE book introduces burn rate alerting for SLO-based monitoring. Instead of alerting on a fixed threshold, alert when you are consuming your error budget faster than expected.
SLO: 99.9% availability (error budget: 0.1% = 43.2 minutes/month)
Burn rate 1x: Consuming budget at exactly the expected rate (no alert)
Burn rate 14x: Will exhaust monthly budget in ~3 days → page immediately
Burn rate 6x: Will exhaust monthly budget in ~7 days → alert to Slack
Burn rate 1x: On track → no alertUse multiple windows (5-minute and 1-hour) to catch both sudden spikes and gradual degradation.
Metrics: Prometheus (collection) + Grafana (visualization)
Logs: Loki or Elasticsearch (storage) + Grafana or Kibana (visualization)
Traces: Jaeger or Tempo (storage) + Grafana (visualization)
Alerting: Prometheus Alertmanager or Grafana AlertsDatadog: All-in-one (metrics, logs, traces, APM). Expensive at scale.
New Relic: Strong APM and error tracking. Per-host pricing.
Honeycomb: Best for high-cardinality exploratory analysis.
Splunk: Powerful log search and analytics. Enterprise-focused.
AWS CloudWatch: Native AWS monitoring. Good for AWS-only environments.The real power of observability comes from correlating logs, metrics, and traces:
1. Alert fires: "p99 latency > 500ms on Order Service"
2. Check metrics dashboard: latency spike started at 14:15
3. Check traces: slow traces show 400ms spent in PostgreSQL queries
4. Filter logs by trace_id: "Connection pool exhausted, waiting for available connection"
5. Root cause: deployment at 14:10 reduced max_connections from 50 to 5
Resolution: revert deployment, fix configuration, redeploy.The trace_id is the glue. Include it in every log line, every metric label, and every span. This enables jumping from a metric anomaly to the specific traces and logs that explain it.