← back

Classic Designs

Chat System

Design a real-time messaging system like WhatsApp or Slack. Covers WebSocket connections, message ordering, presence, and offline message delivery.

Designing a Chat System

Designing a real-time chat system like WhatsApp or Slack is a staple of system design interviews. It tests your understanding of WebSockets, message ordering, delivery guarantees, presence detection, and the fan-out problem for group chats. The challenge is not just sending messages -- it is ensuring they arrive in order, exactly once, and even when the recipient is offline.

Requirements

Functional Requirements

  • One-on-one messaging between users.
  • Group chat (up to thousands of members for Slack-like systems, up to 256 for WhatsApp-like systems).
  • Message delivery: sent, delivered, and read receipts.
  • Presence detection (online/offline/typing indicators).
  • Offline message delivery (messages arrive when the user reconnects).
  • Message history and search.

Non-Functional Requirements

  • Low latency: messages should arrive within 200ms for online users.
  • Reliability: no messages should be lost. At-least-once delivery.
  • Ordering: messages in a conversation must appear in the correct order.
  • Scale: support 500 million daily active users (WhatsApp scale).

WebSocket Gateway

HTTP is request-response: the client must ask the server for new messages. For real-time chat, you need a persistent, bidirectional connection. WebSockets provide exactly this.

1
2
3
4
5
6
7
┌────────┐         WebSocket          ┌─────────────────┐
│ Client │ ◄══════════════════════► │ WebSocket Gateway │
└────────┘    persistent, full-duplex └────────┬────────┘
                                               │
                                        ┌──────┴──────┐
                                        │ Chat Service │
                                        └─────────────┘

Connection Management

Each user maintains a WebSocket connection to a gateway server. The gateway maintains an in-memory mapping of `user_id → connection`. When a message arrives for a user, the system looks up which gateway server holds their connection and routes the message there.

1
2
3
4
Connection Registry (Redis or in-memory):
  user_42 → gateway-server-3
  user_99 → gateway-server-1
  user_17 → gateway-server-3

At scale, you need many gateway servers behind a load balancer. Since WebSocket connections are stateful, you cannot use simple round-robin. Options:

  • Sticky sessions: Route a user to the same gateway server using consistent hashing on user ID.
  • Connection registry: Store the mapping in Redis. Any service that needs to send a message to a user queries the registry to find the right gateway.

Message Flow

One-on-One Messages

1
2
3
4
5
6
7
8
9
10
1. Alice sends a message to Bob via her WebSocket connection.
2. Gateway receives the message and forwards it to the Chat Service.
3. Chat Service:
   a. Generates a message ID and timestamp.
   b. Stores the message in the database.
   c. Looks up Bob's gateway in the connection registry.
   d. If Bob is online: pushes the message to Bob's gateway → Bob's client.
   e. If Bob is offline: stores in an offline queue.
4. Bob's client receives the message and sends a delivery acknowledgment.
5. Chat Service updates the message status to "delivered" and notifies Alice.

Message Storage

Chat messages are write-heavy (billions per day) and access patterns are sequential (users scroll through recent messages). This favors a wide-column store like Cassandra or HBase.

1
2
3
4
5
6
7
8
9
10
11
Table: messages
Partition Key: conversation_id
Clustering Key: message_id (time-ordered)

┌──────────────────┬────────────┬──────────┬─────────────────────┬───────────┐
│ conversation_id  │ message_id │ sender   │ timestamp           │ content   │
├──────────────────┼────────────┼──────────┼─────────────────────┼───────────┤
│ conv_alice_bob   │ msg_001    │ alice    │ 2024-01-15T10:30:00 │ "Hey!"    │
│ conv_alice_bob   │ msg_002    │ bob      │ 2024-01-15T10:30:05 │ "Hi!"     │
│ conv_alice_bob   │ msg_003    │ alice    │ 2024-01-15T10:30:10 │ "How are..│
└──────────────────┴────────────┴──────────┴─────────────────────┴───────────┘

Using `conversation_id` as the partition key ensures all messages in a conversation are stored on the same node, making range queries (fetching recent messages) efficient.

Message Ordering

Ordering is deceptively hard in distributed systems. Two messages sent at "the same time" from different clients may arrive at the server in different orders. Wall-clock timestamps are unreliable because clocks drift between machines.

Lamport Timestamps

A Lamport timestamp is a logical clock that provides a partial ordering of events. Each process maintains a counter. When sending a message, increment the counter and include it. When receiving a message, set your counter to `max(local_counter, received_counter) + 1`.

1
2
3
4
5
6
7
8
9
10
11
class LamportClock:
    def __init__(self):
        self.counter = 0

    def send(self):
        self.counter += 1
        return self.counter

    def receive(self, received_timestamp):
        self.counter = max(self.counter, received_timestamp) + 1
        return self.counter

Server-Assigned Ordering

A simpler approach for chat: the server assigns a monotonically increasing sequence number to each message within a conversation. Since all messages in a conversation flow through the chat service, the service can maintain a per-conversation counter. This is the approach WhatsApp uses.

1
2
3
Message from Alice → Server assigns seq=1 for conv_alice_bob
Message from Bob   → Server assigns seq=2 for conv_alice_bob
Message from Alice → Server assigns seq=3 for conv_alice_bob

The client displays messages sorted by sequence number, not by local timestamps.

Presence Detection

Heartbeat-Based Presence

The client sends a periodic heartbeat (e.g., every 30 seconds) to the server. If the server does not receive a heartbeat within the timeout window, it marks the user as offline.

1
2
3
Presence Store (Redis with TTL):
  user_42: { status: "online", last_seen: 1705312200, ttl: 60s }
  user_99: { status: "online", last_seen: 1705312195, ttl: 60s }

Fan-Out Presence Updates

When Alice comes online, who needs to know? Only users who have Alice in their contact list AND are currently online. Broadcasting presence to all of Alice's contacts would be wasteful for users with thousands of contacts.

Optimization: Only send presence updates to users who are actively viewing a chat with Alice. The client subscribes to presence for visible conversations, not all contacts.

Group Chat Fan-Out

Group messages require fan-out: one incoming message must be delivered to N group members. There are two approaches:

Write-Time Fan-Out (Small Groups)

When a message is sent to a group, the server creates a copy of the message in each member's inbox. This is WhatsApp's approach for groups up to 256 members.

1
2
3
4
Alice sends "Hello" to Group G (members: Alice, Bob, Carol, Dave)
→ Write to Bob's inbox: { from: "Alice", group: "G", content: "Hello" }
→ Write to Carol's inbox: { from: "Alice", group: "G", content: "Hello" }
→ Write to Dave's inbox: { from: "Alice", group: "G", content: "Hello" }

Pros: Read path is simple. Each user reads from their own inbox. Cons: Write amplification. A message to a 256-member group creates 255 writes.

Read-Time Fan-Out (Large Groups)

Store the message once in the group's message log. When a member opens the group, they read from the shared log. This is Slack's approach for channels with thousands of members.

1
2
3
4
5
Alice sends "Hello" to Channel #general (5000 members)
→ Write once to channel_general message log

When Bob opens #general:
→ Read from channel_general log where seq > Bob's last_read_seq

Pros: Single write regardless of group size. Cons: Read path is heavier. Notifications and unread counts require additional logic.

Hybrid Approach

Use write-time fan-out for small groups (< 100 members) and read-time fan-out for large channels. This is what most production systems do.

Offline Message Delivery

When a user is offline, messages must be queued and delivered when they reconnect.

1
2
3
4
5
6
7
1. Chat Service detects Bob is offline (not in connection registry).
2. Message is stored in the database (it always is) and Bob's unread counter is incremented.
3. A push notification is sent via APNs/FCM.
4. When Bob reconnects:
   a. Client sends its last known message sequence number.
   b. Server sends all messages with sequence numbers greater than that.
   c. Bob's client acknowledges receipt.

This "sync" protocol ensures no messages are lost, even if the push notification fails. The client is the ultimate source of truth for what it has received.

End-to-End Encryption

WhatsApp uses the Signal Protocol for end-to-end encryption. The server never sees plaintext messages. Key points for interviews:

  • Each user has a public/private key pair.
  • Messages are encrypted with the recipient's public key.
  • The server stores and forwards encrypted blobs.
  • Group encryption uses a shared group key, rotated when members join or leave.

This means the server cannot implement server-side search on message content. Clients must handle search locally.

High-Level Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
                        ┌──────────────────┐
    Clients ──────────> │  Load Balancer   │
                        └────────┬─────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              ▼                  ▼                  ▼
       ┌────────────┐   ┌────────────┐   ┌────────────┐
       │  WS Gateway │   │  WS Gateway │   │  WS Gateway │
       │  Server 1  │   │  Server 2  │   │  Server 3  │
       └─────┬──────┘   └─────┬──────┘   └─────┬──────┘
             │                 │                 │
             └─────────────────┼─────────────────┘
                               │
                        ┌──────┴───────┐
                        │ Chat Service │
                        └──────┬───────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
       ┌────────────┐  ┌────────────┐  ┌────────────┐
       │  Message DB │  │   Redis    │  │Push Notif. │
       │ (Cassandra) │  │(Presence + │  │  Service   │
       │             │  │ Registry)  │  │(APNs/FCM)  │
       └─────────────┘  └────────────┘  └────────────┘

Interview Tips

  • Start with the WebSocket gateway. This is the most distinctive component. Explain why HTTP polling is insufficient and how you manage stateful connections across multiple servers.
  • Address ordering explicitly. Interviewers love to probe this. Mention server-assigned sequence numbers for simplicity, and Lamport timestamps if they push on distributed ordering.
  • Distinguish small group vs large channel strategies. This shows you understand the trade-offs between write amplification and read complexity.
  • Discuss the sync protocol for offline delivery. Explain how the client sends its last known sequence number on reconnect, and the server sends everything newer.
  • Mention push notifications as a complement, not a replacement. Push notifications alert the user; the sync protocol ensures completeness.
  • Scale numbers matter. If asked about WhatsApp scale (2 billion users, 100 billion messages/day), discuss how to partition the message store by conversation ID, shard the WebSocket gateways by user ID, and use regional deployments.