← back

Messaging & Async

Message Queues

Decouple producers from consumers using queues. Covers RabbitMQ, SQS, and when to use point-to-point queues vs publish-subscribe topics.

Message Queues

A message queue is a middleware component that enables asynchronous communication between services. Instead of Service A calling Service B directly (synchronous), Service A puts a message on a queue and moves on. Service B picks up the message later and processes it.

This decoupling is one of the most powerful patterns in distributed systems. It improves reliability (the producer does not fail if the consumer is down), scalability (consumers can be scaled independently), and resilience (messages are buffered during load spikes).

Why Use Message Queues?

Consider an e-commerce order flow without queues:

1
2
3
4
5
6
7
Synchronous (fragile):
  User places order -> API
    -> Charge payment (500ms)
    -> Reserve inventory (200ms)
    -> Send confirmation email (300ms)
    -> Update analytics (100ms)
  Total: 1100ms, all must succeed for order to complete

Now with a queue:

1
2
3
4
5
6
7
8
9
10
Asynchronous (resilient):
  User places order -> API
    -> Charge payment (500ms)
    -> Reserve inventory (200ms)
    -> Publish "order.created" event to queue
  Total: 700ms, user gets response immediately

  Asynchronously:
    Email service picks up event -> sends email
    Analytics service picks up event -> records data

The user gets a faster response. If the email service is down, the message waits in the queue and is delivered later. The payment and inventory (which are critical) are still synchronous.

Messaging Models

Point-to-Point (Queue)

A message is sent to a queue and consumed by exactly one consumer. Once consumed, the message is removed from the queue.

1
2
3
4
5
Producer -> [Queue] -> Consumer A
                    -> Consumer B  (only one gets each message)
                    -> Consumer C

Messages are distributed among consumers (competing consumers pattern).

Use case: Task distribution. You have 10,000 images to process and 5 worker instances. Each image is processed by exactly one worker.

Examples: Amazon SQS, RabbitMQ (default mode), Celery tasks.

Publish-Subscribe (Pub/Sub)

A message is published to a topic and delivered to all subscribers. Each subscriber gets a copy of every message.

1
2
3
Publisher -> [Topic] -> Subscriber A (gets all messages)
                     -> Subscriber B (gets all messages)
                     -> Subscriber C (gets all messages)

Use case: Event notification. When an order is placed, the email service, analytics service, and inventory service all need to know.

Examples: Amazon SNS, Kafka topics, Google Cloud Pub/Sub, Redis Pub/Sub.

Combining Both: Fan-Out

A common pattern is to combine pub/sub with queues. An event is published to a topic, which fans out to multiple queues — one per consuming service. Each queue is processed independently.

1
2
3
4
Publisher -> [Topic: order.created]
                -> [Queue: email-service]    -> Email Consumer
                -> [Queue: analytics-service] -> Analytics Consumer
                -> [Queue: inventory-service]  -> Inventory Consumer

In AWS: SNS topic fans out to multiple SQS queues. In Kafka: A topic has multiple consumer groups, each processing all messages independently.

This gives you the broadcast semantics of pub/sub with the durability and independent scaling of queues.

Delivery Guarantees

This is one of the most important concepts in messaging and a frequent interview topic.

At-Most-Once

The message is delivered zero or one times. If something goes wrong, the message may be lost.

1
2
3
Producer -> Queue -> Consumer
  If consumer crashes before processing -> message is lost
  (Queue already removed it from the queue upon delivery)

Implementation: The queue removes the message when it is delivered, without waiting for acknowledgment.

When acceptable: Metrics, logging, and other telemetry where losing a few data points is fine.

At-Least-Once

The message is delivered one or more times. It will never be lost, but it may be delivered multiple times.

1
2
3
4
Producer -> Queue -> Consumer -> ACK back to Queue
  If consumer processes but crashes before ACK:
    -> Queue re-delivers the message
    -> Consumer processes it again (duplicate)

Implementation: The queue keeps the message until the consumer explicitly acknowledges it. If no ACK within a timeout, the message is redelivered.

This is the most common guarantee. SQS, RabbitMQ (with ACK), and Kafka all default to at-least-once.

Critical requirement: Consumers must be idempotent — processing the same message twice must produce the same result as processing it once. For example, "set balance to 100"isidempotent,but"add100" is idempotent, but "add 10 to balance" is not (without an idempotency key).

1
2
3
4
5
6
7
8
9
10
11
12
13
def process_order(message):
    order_id = message["order_id"]

    # Idempotency check: have we already processed this?
    if db.exists("processed_orders", order_id):
        return  # Already handled, skip

    # Process the order
    charge_payment(order_id)

    # Record that we processed it
    db.insert("processed_orders", order_id)
    queue.ack(message)

Exactly-Once

The message is delivered exactly one time. This is the holy grail of messaging but is extremely difficult to achieve in distributed systems.

In theory: True exactly-once delivery across independent systems is impossible (it reduces to the Two Generals Problem). What systems actually provide is effectively exactly-once — at-least-once delivery combined with idempotent processing.

Kafka's approach: Kafka 0.11+ supports exactly-once semantics within the Kafka ecosystem using idempotent producers and transactional reads/writes. This works when both the source and sink are Kafka topics. When the consumer writes to an external database, you need application-level idempotency.

Practical advice: Design for at-least-once and make your consumers idempotent. This is simpler, more robust, and works across any messaging system.

Dead Letter Queues (DLQ)

When a message repeatedly fails processing (e.g., the consumer crashes or throws an error every time), it should not block the queue forever. After a configurable number of retries, the message is moved to a dead letter queue for investigation.

1
2
Main Queue -> Consumer -> fails -> retry 1 -> fails -> retry 2 -> fails
    -> Dead Letter Queue (for manual inspection and replay)

Best practices:

  • Set a reasonable retry count (e.g., 3-5 retries with exponential backoff)
  • Monitor the DLQ — messages in the DLQ indicate bugs or data issues
  • Build tooling to replay messages from the DLQ back to the main queue after fixing the issue
  • Include metadata (failure reason, timestamp, retry count) with each DLQ message

Backpressure

When producers publish messages faster than consumers can process them, the queue grows unboundedly. This can exhaust memory or disk.

Strategies

Bounded queues: Set a maximum queue size. When full, either block the producer (producer waits until space is available) or reject new messages (producer gets an error).

1
2
Producer -> [Queue: max 10,000 messages] -> Consumer
  Queue full: producer blocks or gets error

Rate limiting producers: Limit how fast producers can publish.

Auto-scaling consumers: Monitor queue depth and automatically add consumer instances when the queue grows. This is the most common approach in cloud environments.

1
2
Queue depth > 10,000 -> scale consumers from 2 to 10
Queue depth < 1,000  -> scale consumers back to 2

Dropping messages: For non-critical data (e.g., analytics events), it may be acceptable to drop messages during extreme load.

Message Ordering

Why Ordering Is Hard

In a distributed queue with multiple consumers, messages may be processed out of order:

1
2
3
4
5
6
Producer sends: [M1, M2, M3]
Consumer A picks up M1, takes 500ms
Consumer B picks up M2, takes 100ms
Consumer C picks up M3, takes 200ms

Processing order: M2, M3, M1 (out of order!)

Solutions

Single consumer: Only one consumer processes messages. Ordering is preserved but throughput is limited.

Partitioned queues: Messages with the same key (e.g., user_id) are always routed to the same partition, and each partition has a single consumer. This preserves per-key ordering while allowing parallelism across keys.

1
2
3
4
5
Partition 0 (user_ids hashing to 0): [M1, M4] -> Consumer A
Partition 1 (user_ids hashing to 1): [M2, M5] -> Consumer B
Partition 2 (user_ids hashing to 2): [M3, M6] -> Consumer C

Ordering is preserved per user_id, not globally.

This is how Kafka works. Each topic has multiple partitions. Messages with the same key go to the same partition. Within a partition, order is guaranteed.

SQS FIFO queues: SQS offers FIFO queues that guarantee ordering within a message group ID (similar to Kafka's partition key). Standard SQS queues provide best-effort ordering only.

Comparing Message Queue Systems

RabbitMQ

A traditional message broker implementing the AMQP protocol.

Strengths:

  • Flexible routing with exchanges (direct, fanout, topic, headers)
  • Priority queues
  • Per-message TTL and dead letter exchanges
  • Good for task queues with complex routing

Weaknesses:

  • Throughput decreases as queue depth increases (messages stored in memory by default)
  • Scaling horizontally is more complex (clustering, federation)
  • Not designed for message replay (messages are deleted after consumption)

Best for: Task queues, RPC patterns, workloads with complex routing needs, moderate throughput (tens of thousands of messages per second).

Amazon SQS

A fully managed, serverless message queue.

Strengths:

  • Zero operational overhead — no servers to manage
  • Virtually unlimited throughput (standard queues)
  • Highly durable (messages replicated across AZs)
  • Integrates natively with Lambda, SNS, and other AWS services
  • FIFO queues for ordering guarantees

Weaknesses:

  • Higher latency than self-hosted solutions (~20-50ms)
  • FIFO queues limited to 3,000 messages/sec per queue (with batching)
  • No message replay (once consumed and deleted, gone)
  • Limited to the AWS ecosystem

Best for: AWS-native applications, serverless architectures, workloads that need zero-ops queuing.

Apache Kafka

A distributed event streaming platform. Fundamentally different from traditional queues — it is a distributed commit log.

Strengths:

  • Extremely high throughput (millions of messages per second)
  • Messages are retained for a configurable period (days, weeks) — consumers can replay
  • Consumer groups allow pub/sub and competing consumer patterns simultaneously
  • Strong ordering guarantees within partitions
  • Exactly-once semantics (within Kafka ecosystem)
  • Built for stream processing (Kafka Streams, ksqlDB)

Weaknesses:

  • Operationally complex (ZooKeeper dependency in older versions, now KRaft)
  • Higher latency than RabbitMQ for individual messages
  • No per-message priority or routing (routing is by partition key only)
  • Overkill for simple task queues

Best for: High-throughput event streaming, event sourcing, log aggregation, real-time data pipelines, cases where message replay is needed.

Comparison Table

FeatureRabbitMQSQSKafka
Throughput~50K msg/sVirtually unlimitedMillions msg/s
LatencyLow (~1ms)Medium (~20-50ms)Low-Medium (~5-10ms)
OrderingPer-queuePer-message-group (FIFO)Per-partition
ReplayNoNoYes (retention period)
DeliveryAt-least-onceAt-least-onceAt-least-once / exactly-once
RoutingFlexible (exchanges)Simple (queue per consumer)Partition key only
OperationsSelf-managed or hostedFully managedSelf-managed or hosted
Best forTask queues, routingServerless, AWS-nativeEvent streaming, high throughput

Design Patterns

Request-Reply Over Queues

Use two queues — a request queue and a response queue — to implement async RPC.

1
2
Client -> [Request Queue] -> Worker -> [Reply Queue] -> Client
  (Client includes a correlation_id to match responses to requests)

Competing Consumers

Multiple consumers process messages from the same queue. The queue distributes messages among them. Scale consumers up or down based on load.

1
2
3
[Queue] -> Consumer 1 (processing M1)
        -> Consumer 2 (processing M2)
        -> Consumer 3 (processing M3)

Saga Pattern

For distributed transactions spanning multiple services, use a sequence of messages and compensating actions.

1
2
3
4
5
Order Service -> [Queue] -> Payment Service (charge)
  If payment fails -> [Queue] -> Order Service (cancel order)
  If payment succeeds -> [Queue] -> Inventory Service (reserve)
    If inventory fails -> [Queue] -> Payment Service (refund)
                       -> [Queue] -> Order Service (cancel order)

Each service is responsible for its own transaction and for publishing events that drive the next step (or trigger compensating actions).

Interview Tips

Know when to use queues vs direct calls. Queues add latency and complexity. Use them when you need decoupling, buffering, or async processing. Do not use them for simple synchronous request-response where latency matters.

Always specify the delivery guarantee. Say "at-least-once delivery with idempotent consumers" rather than just "we will use a queue." This shows you understand the subtleties.

Discuss idempotency. This is the most common follow-up. If you say "at-least-once," the interviewer will ask how you handle duplicates. Have an answer ready (idempotency keys, deduplication tables).

Know Kafka vs SQS vs RabbitMQ. You do not need to be an expert in all three, but you should know which to recommend and why. Kafka for high-throughput event streaming, SQS for simple managed queues, RabbitMQ for flexible routing.

Address ordering. If your system requires message ordering (e.g., processing events for a single user in sequence), explain how you achieve it (partition key in Kafka, message group ID in SQS FIFO).

Mention dead letter queues. This shows you think about failure handling. "Failed messages go to a DLQ after 3 retries with exponential backoff, and we have monitoring to alert when the DLQ grows."

Connect to the bigger picture. Message queues are a building block. They enable event-driven architecture, CQRS, saga patterns, and stream processing. Mentioning these connections shows architectural maturity.

For the standalone design question ("Design a message queue"), cover: persistence (how messages are stored), delivery guarantees, ordering, consumer groups, dead letter handling, and scaling (partitioning).