← back

Messaging & Async

Kafka and Event Streaming

Distributed commit log for high-throughput event streaming. Covers partitions, consumer groups, exactly-once semantics, and when Kafka is the right choice.

Kafka and Event Streaming

Apache Kafka is the backbone of event-driven architectures at companies like LinkedIn, Netflix, Uber, and Airbnb. It is not a traditional message queue. It is a distributed commit log designed for high-throughput, fault-tolerant, real-time data streaming. Understanding Kafka's architecture is essential for system design interviews because it appears in nearly every design that involves asynchronous communication at scale.

Kafka's Core Architecture

Brokers

A Kafka cluster consists of multiple brokers (servers). Each broker stores a subset of the data and serves client requests. Brokers coordinate through a controller (historically managed by ZooKeeper, now handled by Kafka's own Raft-based KRaft protocol). Adding more brokers increases storage capacity and throughput linearly.

Topics and Partitions

Data in Kafka is organized into topics (logical channels). Each topic is divided into partitions, which are the fundamental unit of parallelism and ordering.

1
2
3
4
5
Topic: "orders" (3 partitions)

  Partition 0: [msg0, msg1, msg4, msg7, ...]  → Broker 1
  Partition 1: [msg2, msg3, msg6, msg9, ...]  → Broker 2
  Partition 2: [msg5, msg8, msg10, ...]        → Broker 3

Each partition is an ordered, immutable sequence of records. Messages within a partition have a sequential offset (like an array index). Ordering is guaranteed within a partition but not across partitions.

Producers

Producers publish records to topics. They choose which partition to send each record to, either explicitly, by round-robin, or by hashing a message key. Key-based partitioning ensures all records with the same key (e.g., user ID, order ID) land in the same partition, preserving order for that entity.

1
2
3
4
5
6
# Pseudocode: producing messages
producer.send(
    topic="orders",
    key="user-42",       # All orders for user-42 go to the same partition
    value={"orderId": "abc", "items": [...], "total": 59.99}
)

Consumer Groups

Consumers read from topics. A consumer group is a set of consumers that cooperate to consume a topic. Each partition is assigned to exactly one consumer in the group, enabling parallel processing.

1
2
3
4
5
6
7
8
Topic: "orders" (4 partitions)

Consumer Group "order-processing":
  Consumer A → Partition 0, Partition 1
  Consumer B → Partition 2, Partition 3

Consumer Group "analytics":
  Consumer C → Partition 0, Partition 1, Partition 2, Partition 3

Key properties:

  • Adding consumers to a group (up to the number of partitions) increases throughput.
  • Multiple consumer groups can independently read the same topic. This is how Kafka enables pub/sub semantics.
  • If a consumer fails, its partitions are rebalanced to surviving consumers.

Replication

Each partition is replicated across multiple brokers for fault tolerance. One replica is the leader (handles all reads and writes), and the rest are followers (replicate data from the leader). The replication factor (typically 3) determines how many copies exist.

1
2
3
4
Partition 0:
  Leader:   Broker 1
  Follower: Broker 2  (in-sync replica)
  Follower: Broker 3  (in-sync replica)

If the leader fails, one of the in-sync replicas (ISR) is elected as the new leader. No data is lost as long as at least one ISR survives.

Exactly-Once Semantics

Kafka supports three delivery guarantees:

  • At-most-once: Messages may be lost but are never duplicated. The consumer commits the offset before processing.
  • At-least-once: Messages are never lost but may be duplicated. The consumer commits the offset after processing.
  • Exactly-once: Messages are processed exactly once. This is the hardest to achieve.

Kafka achieves exactly-once semantics through two mechanisms:

Idempotent Producer

The producer assigns a sequence number to each message. If a network failure causes a retry, the broker detects the duplicate sequence number and discards it. Enable with `enable.idempotence=true`.

Transactional Writes

For consume-transform-produce workflows, Kafka supports transactions. A consumer reads from an input topic, processes the data, writes to an output topic, and commits the consumer offset -- all atomically.

1
2
3
4
5
6
7
8
9
10
11
# Pseudocode: transactional consume-produce
producer.begin_transaction()
try:
    records = consumer.poll()
    for record in records:
        result = transform(record)
        producer.send(topic="output", value=result)
    producer.send_offsets_to_transaction(consumer.offsets(), consumer_group="my-group")
    producer.commit_transaction()
except Exception:
    producer.abort_transaction()

If any step fails, the entire transaction is aborted. Downstream consumers configured with `isolation.level=read_committed` will only see committed messages.

Log Compaction

By default, Kafka retains messages for a configured time period (e.g., 7 days) and then deletes them. Log compaction offers an alternative: Kafka retains only the latest value for each key, discarding older records with the same key.

1
2
3
4
5
6
7
8
9
10
11
Before compaction:
  offset 0: key=A, value=v1
  offset 1: key=B, value=v1
  offset 2: key=A, value=v2    ← newer value for key A
  offset 3: key=C, value=v1
  offset 4: key=B, value=v2    ← newer value for key B

After compaction:
  offset 2: key=A, value=v2
  offset 3: key=C, value=v1
  offset 4: key=B, value=v2

This is ideal for maintaining a changelog of the latest state. Use cases include database CDC (change data capture), configuration distribution, and maintaining materialized views.

When to Use Kafka vs Traditional Message Queues

FeatureKafkaTraditional Queue (RabbitMQ, SQS)
Message retentionPersistent log (days/weeks/forever)Deleted after consumption
Consumer modelPull-based, consumer groupsPush or pull, competing consumers
OrderingPer-partition orderingQueue-level ordering (limited)
ReplayYes, consumers can seek to any offsetNo, messages are gone after ack
ThroughputMillions of messages/secThousands to hundreds of thousands
Delivery semanticsAt-most-once, at-least-once, exactly-onceAt-most-once, at-least-once
Use caseEvent streaming, CDC, analytics pipelinesTask queues, work distribution

Choose Kafka when: You need event replay, multiple consumers reading the same data, high throughput, or stream processing.

Choose a traditional queue when: You need simple task distribution, message-level routing, priority queues, or delayed messages. RabbitMQ excels at complex routing with exchanges and bindings. SQS excels at zero-ops simplicity in AWS.

Kafka in Real-World Systems

LinkedIn

Kafka was born at LinkedIn. It handles over 7 trillion messages per day, powering activity feeds, metrics collection, and data pipeline synchronization across hundreds of microservices.

Uber

Uber uses Kafka to process real-time ride events, driver location updates, and pricing calculations. The event stream feeds into Apache Flink for real-time surge pricing computations.

Netflix

Netflix uses Kafka as the central nervous system for its data pipeline. Every user interaction (play, pause, search, browse) is published as an event. These events feed real-time dashboards, A/B test analysis, and the recommendation engine.

Capacity Planning

A useful back-of-the-envelope calculation for interviews:

1
2
3
4
5
6
7
8
9
10
11
12
Given:
  - 100,000 messages/sec
  - Average message size: 1 KB
  - Retention: 7 days
  - Replication factor: 3

Throughput: 100,000 × 1 KB = 100 MB/sec
Daily volume: 100 MB/sec × 86,400 sec = 8.64 TB/day
7-day retention: 8.64 × 7 = 60.5 TB
With replication: 60.5 × 3 = 181.5 TB total storage

If each broker has 10 TB of disk: ~19 brokers minimum

Common Pitfalls

  • Too few partitions: Partition count is the upper bound on consumer parallelism. You cannot have more active consumers than partitions.
  • Too many partitions: Each partition has overhead (file handles, memory, leader election time). Thousands of partitions per broker can degrade performance.
  • Large messages: Kafka is optimized for small messages (< 1 MB). For large payloads, store the data in object storage (S3) and publish a reference in Kafka.
  • Consumer lag: If consumers fall behind, they may read from disk instead of the OS page cache, dramatically reducing throughput. Monitor consumer lag and scale consumers accordingly.

Interview Tips

  • Draw the architecture. Sketch brokers, partitions, producer, consumer groups, and replication. This shows you understand how the pieces fit together.
  • Explain partition key choice. The partition key determines ordering and data locality. A poor key choice (e.g., a timestamp) leads to hot partitions. A good key choice (e.g., user ID) distributes load evenly while preserving per-user ordering.
  • Discuss exactly-once carefully. True exactly-once is only achievable within the Kafka ecosystem (consume-transform-produce). End-to-end exactly-once (including external systems) requires idempotent consumers.
  • Know the alternatives. Mention AWS Kinesis (managed, simpler, lower throughput), Apache Pulsar (multi-tenancy, tiered storage), and Google Pub/Sub (fully managed, global) as alternatives to Kafka.
  • Mention Kafka Streams and ksqlDB. For stream processing use cases, Kafka Streams is a lightweight library that runs inside your application. ksqlDB provides SQL-like syntax over streams. These are alternatives to heavier frameworks like Apache Flink or Spark Streaming.