← back

Classic Designs

Notification System

Design a multi-channel notification system supporting push, SMS, and email. Covers priority queues, rate limiting, user preferences, and delivery guarantees.

Designing a Notification System

Every major application needs notifications. When someone likes your photo, when your package ships, when your bank detects a suspicious login -- these are all notifications delivered through different channels. Designing a notification system that handles multiple channels (push, SMS, email), respects user preferences, guarantees delivery, and does not spam users is a nuanced problem.

Requirements

Functional Requirements

  • Support multiple notification channels: mobile push (iOS/Android), SMS, email, and in-app.
  • Support different notification types: transactional (order confirmation), marketing (promotional), social (someone liked your post), and system alerts.
  • Users can configure per-channel preferences and opt-out of specific notification types.
  • Notifications can be scheduled for future delivery.
  • Delivery tracking: know whether a notification was sent, delivered, opened, or failed.

Non-Functional Requirements

  • High throughput: 10 billion notifications per day (Facebook scale) or 100 million per day (mid-size platform).
  • No duplicates: a user should never receive the same notification twice.
  • Priority handling: a security alert must not wait behind a batch of marketing emails.
  • Retry on failure with exponential backoff.
  • Soft real-time: transactional notifications within seconds, marketing can tolerate minutes.

High-Level Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
                                     ┌──────────────┐
  Services ────── Notification API ─>│ Validation & │
  (Order Svc,                        │ Rate Limiter │
   Social Svc,                       └──────┬───────┘
   Auth Svc)                                │
                                     ┌──────┴───────┐
                                     │ Preference   │
                                     │ Lookup       │
                                     └──────┬───────┘
                                            │
                              ┌─────────────┼─────────────┐
                              ▼             ▼             ▼
                        ┌──────────┐ ┌──────────┐ ┌──────────┐
                        │ Priority │ │ Standard │ │   Bulk   │
                        │  Queue   │ │  Queue   │ │  Queue   │
                        └────┬─────┘ └────┬─────┘ └────┬─────┘
                             │            │            │
                             └────────────┼────────────┘
                                          │
                              ┌───────────┼───────────┐
                              ▼           ▼           ▼
                        ┌──────────┐ ┌────────┐ ┌────────┐
                        │Push Svc  │ │SMS Svc │ │Email   │
                        │(APNs/FCM)│ │(Twilio)│ │Svc     │
                        └──────────┘ └────────┘ └────────┘

Notification Flow

Step 1: Notification Request

An internal service (e.g., the Order Service) sends a notification request to the Notification API:

1
2
3
4
5
6
7
8
9
10
11
12
13
notification_request = {
    "user_id": "user_42",
    "type": "order_shipped",
    "template_id": "tmpl_order_shipped",
    "data": {
        "order_id": "ORD-12345",
        "tracking_number": "1Z999AA10123456784",
        "estimated_delivery": "2024-01-20"
    },
    "channels": ["push", "email"],  # Preferred channels
    "priority": "high",
    "idempotency_key": "order_shipped_ORD-12345"
}

Step 2: Validation and Rate Limiting

The system validates the request and applies rate limiting to prevent notification storms:

1
2
3
4
5
6
7
8
9
10
11
rate_limits = {
    "push": {"per_hour": 10, "per_day": 50},
    "sms": {"per_hour": 3, "per_day": 10},
    "email": {"per_hour": 5, "per_day": 20},
}

def check_rate_limit(user_id, channel):
    key = f"rate:{user_id}:{channel}:{current_hour()}"
    count = redis.incr(key)
    redis.expire(key, 3600)
    return count <= rate_limits[channel]["per_hour"]

Step 3: User Preference Lookup

Before sending, check the user's notification preferences:

1
2
3
4
5
6
7
User Preferences (stored in database):
  user_42:
    push: enabled
    sms: disabled (user opted out)
    email: enabled
    quiet_hours: 22:00 - 08:00 (user's timezone)
    muted_types: ["marketing", "social_like"]

If the user has disabled SMS, the system skips that channel. If it is within quiet hours, non-urgent notifications are delayed until the window ends.

Step 4: Queue and Prioritize

Notifications are placed into priority queues:

  • Priority queue (P0): Security alerts, 2FA codes, fraud detection. Processed immediately.
  • Standard queue (P1): Transactional notifications (order updates, payment confirmations). Processed within seconds.
  • Bulk queue (P2): Marketing emails, weekly digests. Processed in batches, can tolerate minutes of delay.
1
2
3
Priority Queue:  [2FA code for user_99] [Fraud alert for user_17]
Standard Queue:  [Order shipped for user_42] [Payment received for user_88]
Bulk Queue:      [Weekly digest batch_001] [Promo campaign_holiday_2024]

Step 5: Channel-Specific Delivery

Each channel has its own delivery service that handles the specifics:

Push Notifications:

  • iOS: Apple Push Notification Service (APNs)
  • Android: Firebase Cloud Messaging (FCM)
  • Must manage device tokens. Users may have multiple devices.

Email:

  • Use a transactional email provider (SendGrid, SES, Mailgun).
  • Template rendering with user-specific data.
  • Handle bounces and unsubscribes.

SMS:

  • Use a provider like Twilio or AWS SNS.
  • Most expensive channel. Reserve for high-priority notifications.
  • Comply with opt-in regulations (TCPA in the US).

Deduplication

Duplicate notifications are one of the worst user experiences. They occur when:

  • The producer retries a failed API call.
  • A message is consumed twice from the queue (at-least-once delivery).
  • A race condition causes the same event to trigger multiple notifications.

Idempotency Key

The most effective solution: require an idempotency key in every notification request. Before processing, check if this key has been seen:

1
2
3
4
5
6
7
8
9
10
11
def process_notification(request):
    key = request["idempotency_key"]

    if redis.exists(f"dedup:{key}"):
        return  # Already processed

    # Process the notification...
    send_notification(request)

    # Mark as processed with a TTL (e.g., 24 hours)
    redis.setex(f"dedup:{key}", 86400, "1")

Retry with Exponential Backoff

When a delivery attempt fails (e.g., APNs is temporarily unavailable), retry with exponential backoff and jitter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import random
import time

def send_with_retry(notification, max_retries=5):
    for attempt in range(max_retries):
        try:
            result = deliver(notification)
            if result.success:
                update_status(notification.id, "delivered")
                return
        except TemporaryError:
            pass

        # Exponential backoff with jitter
        base_delay = 2 ** attempt  # 1, 2, 4, 8, 16 seconds
        jitter = random.uniform(0, base_delay)
        delay = base_delay + jitter
        time.sleep(delay)

    # All retries exhausted
    update_status(notification.id, "failed")
    move_to_dead_letter_queue(notification)

The jitter is critical. Without it, if a downstream service goes down and recovers, all retrying clients would hit it simultaneously (thundering herd), potentially bringing it down again.

Dead Letter Queue

After exhausting retries, move the failed notification to a dead letter queue (DLQ). An operations team or automated system can investigate and reprocess these later.

Delivery Tracking

Track the lifecycle of every notification:

1
2
3
Notification Lifecycle:
  created → queued → sent → delivered → opened → clicked
                       ↘ failed → retrying → (delivered | dead_letter)
1
2
3
4
5
6
7
8
9
Table: notification_log
┌────────────┬──────────┬─────────┬──────────┬─────────────────────┬─────────┐
│ notif_id   │ user_id  │ channel │ status   │ timestamp           │ attempt │
├────────────┼──────────┼─────────┼──────────┼─────────────────────┼─────────┤
│ n_001      │ user_42  │ push    │ sent     │ 2024-01-15T10:30:00 │ 1       │
│ n_001      │ user_42  │ push    │ delivered│ 2024-01-15T10:30:01 │ 1       │
│ n_002      │ user_42  │ email   │ failed   │ 2024-01-15T10:30:05 │ 1       │
│ n_002      │ user_42  │ email   │ sent     │ 2024-01-15T10:30:08 │ 2       │
└────────────┴──────────┴─────────┴──────────┴─────────────────────┴─────────┘

For push notifications, delivery confirmation comes from APNs/FCM callbacks. For email, use tracking pixels (for opens) and redirect links (for clicks). For SMS, delivery receipts from the provider.

Template System

Notifications use templates to maintain consistency and enable localization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
templates = {
    "tmpl_order_shipped": {
        "push": {
            "title": "Your order has shipped!",
            "body": "Order {{order_id}} is on its way. Track: {{tracking_number}}"
        },
        "email": {
            "subject": "Your order {{order_id}} has shipped",
            "html_template": "order_shipped.html"
        },
        "sms": {
            "body": "Your order {{order_id}} shipped. Track at: {{tracking_url}}"
        }
    }
}

Notification Aggregation

Avoid sending ten separate "X liked your post" notifications. Aggregate similar notifications:

1
2
3
4
5
6
7
8
9
Instead of:
  "Alice liked your photo"
  "Bob liked your photo"
  "Carol liked your photo"

Send:
  "Alice, Bob, and Carol liked your photo"
  or
  "Alice and 2 others liked your photo"

Implementation: buffer similar notifications for a short window (e.g., 5 minutes). If more arrive within the window, merge them into a single aggregated notification.

Scaling Considerations

Channel-Specific Workers

Scale each channel independently. SMS providers have strict rate limits (e.g., Twilio: 100 messages/sec per number). Email sending has warm-up requirements for new IP addresses. Push notification throughput depends on APNs/FCM capacity.

1
2
3
Push workers: 50 instances (high throughput)
Email workers: 20 instances (moderate throughput, batching)
SMS workers: 5 instances (low throughput, expensive)

Database Partitioning

The notification log grows rapidly. Partition by user_id for efficient per-user queries, and by timestamp for time-range queries and data retention.

Interview Tips

  • Start with the flow. Walk through a notification from creation to delivery. This gives the interviewer a clear mental model before you dive into specifics.
  • Emphasize deduplication. Interviewers will ask "what if a notification is sent twice?" Have the idempotency key answer ready.
  • Discuss priority explicitly. A 2FA code cannot wait behind a million marketing emails. Priority queues are essential.
  • Mention user preferences and quiet hours. This shows you think about the user experience, not just the technical architecture.
  • Explain exponential backoff with jitter. The jitter detail separates strong candidates from average ones.
  • Address compliance. CAN-SPAM for email, TCPA for SMS, GDPR for EU users. Briefly mentioning these shows maturity.
  • Discuss notification fatigue. Too many notifications cause users to disable them entirely. Rate limiting and aggregation are mitigations.