Messaging & Async
Distributed commit log for high-throughput event streaming. Covers partitions, consumer groups, exactly-once semantics, and when Kafka is the right choice.
Apache Kafka is the backbone of event-driven architectures at companies like LinkedIn, Netflix, Uber, and Airbnb. It is not a traditional message queue. It is a distributed commit log designed for high-throughput, fault-tolerant, real-time data streaming. Understanding Kafka's architecture is essential for system design interviews because it appears in nearly every design that involves asynchronous communication at scale.
A Kafka cluster consists of multiple brokers (servers). Each broker stores a subset of the data and serves client requests. Brokers coordinate through a controller (historically managed by ZooKeeper, now handled by Kafka's own Raft-based KRaft protocol). Adding more brokers increases storage capacity and throughput linearly.
Data in Kafka is organized into topics (logical channels). Each topic is divided into partitions, which are the fundamental unit of parallelism and ordering.
Topic: "orders" (3 partitions)
Partition 0: [msg0, msg1, msg4, msg7, ...] → Broker 1
Partition 1: [msg2, msg3, msg6, msg9, ...] → Broker 2
Partition 2: [msg5, msg8, msg10, ...] → Broker 3Each partition is an ordered, immutable sequence of records. Messages within a partition have a sequential offset (like an array index). Ordering is guaranteed within a partition but not across partitions.
Producers publish records to topics. They choose which partition to send each record to, either explicitly, by round-robin, or by hashing a message key. Key-based partitioning ensures all records with the same key (e.g., user ID, order ID) land in the same partition, preserving order for that entity.
# Pseudocode: producing messages
producer.send(
topic="orders",
key="user-42", # All orders for user-42 go to the same partition
value={"orderId": "abc", "items": [...], "total": 59.99}
)Consumers read from topics. A consumer group is a set of consumers that cooperate to consume a topic. Each partition is assigned to exactly one consumer in the group, enabling parallel processing.
Topic: "orders" (4 partitions)
Consumer Group "order-processing":
Consumer A → Partition 0, Partition 1
Consumer B → Partition 2, Partition 3
Consumer Group "analytics":
Consumer C → Partition 0, Partition 1, Partition 2, Partition 3Key properties:
Each partition is replicated across multiple brokers for fault tolerance. One replica is the leader (handles all reads and writes), and the rest are followers (replicate data from the leader). The replication factor (typically 3) determines how many copies exist.
Partition 0:
Leader: Broker 1
Follower: Broker 2 (in-sync replica)
Follower: Broker 3 (in-sync replica)If the leader fails, one of the in-sync replicas (ISR) is elected as the new leader. No data is lost as long as at least one ISR survives.
Kafka supports three delivery guarantees:
Kafka achieves exactly-once semantics through two mechanisms:
The producer assigns a sequence number to each message. If a network failure causes a retry, the broker detects the duplicate sequence number and discards it. Enable with `enable.idempotence=true`.
For consume-transform-produce workflows, Kafka supports transactions. A consumer reads from an input topic, processes the data, writes to an output topic, and commits the consumer offset -- all atomically.
# Pseudocode: transactional consume-produce
producer.begin_transaction()
try:
records = consumer.poll()
for record in records:
result = transform(record)
producer.send(topic="output", value=result)
producer.send_offsets_to_transaction(consumer.offsets(), consumer_group="my-group")
producer.commit_transaction()
except Exception:
producer.abort_transaction()If any step fails, the entire transaction is aborted. Downstream consumers configured with `isolation.level=read_committed` will only see committed messages.
By default, Kafka retains messages for a configured time period (e.g., 7 days) and then deletes them. Log compaction offers an alternative: Kafka retains only the latest value for each key, discarding older records with the same key.
Before compaction:
offset 0: key=A, value=v1
offset 1: key=B, value=v1
offset 2: key=A, value=v2 ← newer value for key A
offset 3: key=C, value=v1
offset 4: key=B, value=v2 ← newer value for key B
After compaction:
offset 2: key=A, value=v2
offset 3: key=C, value=v1
offset 4: key=B, value=v2This is ideal for maintaining a changelog of the latest state. Use cases include database CDC (change data capture), configuration distribution, and maintaining materialized views.
| Feature | Kafka | Traditional Queue (RabbitMQ, SQS) |
|---|---|---|
| Message retention | Persistent log (days/weeks/forever) | Deleted after consumption |
| Consumer model | Pull-based, consumer groups | Push or pull, competing consumers |
| Ordering | Per-partition ordering | Queue-level ordering (limited) |
| Replay | Yes, consumers can seek to any offset | No, messages are gone after ack |
| Throughput | Millions of messages/sec | Thousands to hundreds of thousands |
| Delivery semantics | At-most-once, at-least-once, exactly-once | At-most-once, at-least-once |
| Use case | Event streaming, CDC, analytics pipelines | Task queues, work distribution |
Choose Kafka when: You need event replay, multiple consumers reading the same data, high throughput, or stream processing.
Choose a traditional queue when: You need simple task distribution, message-level routing, priority queues, or delayed messages. RabbitMQ excels at complex routing with exchanges and bindings. SQS excels at zero-ops simplicity in AWS.
Kafka was born at LinkedIn. It handles over 7 trillion messages per day, powering activity feeds, metrics collection, and data pipeline synchronization across hundreds of microservices.
Uber uses Kafka to process real-time ride events, driver location updates, and pricing calculations. The event stream feeds into Apache Flink for real-time surge pricing computations.
Netflix uses Kafka as the central nervous system for its data pipeline. Every user interaction (play, pause, search, browse) is published as an event. These events feed real-time dashboards, A/B test analysis, and the recommendation engine.
A useful back-of-the-envelope calculation for interviews:
Given:
- 100,000 messages/sec
- Average message size: 1 KB
- Retention: 7 days
- Replication factor: 3
Throughput: 100,000 × 1 KB = 100 MB/sec
Daily volume: 100 MB/sec × 86,400 sec = 8.64 TB/day
7-day retention: 8.64 × 7 = 60.5 TB
With replication: 60.5 × 3 = 181.5 TB total storage
If each broker has 10 TB of disk: ~19 brokers minimum