Classic Designs
Design a multi-channel notification system supporting push, SMS, and email. Covers priority queues, rate limiting, user preferences, and delivery guarantees.
Every major application needs notifications. When someone likes your photo, when your package ships, when your bank detects a suspicious login -- these are all notifications delivered through different channels. Designing a notification system that handles multiple channels (push, SMS, email), respects user preferences, guarantees delivery, and does not spam users is a nuanced problem.
┌──────────────┐
Services ────── Notification API ─>│ Validation & │
(Order Svc, │ Rate Limiter │
Social Svc, └──────┬───────┘
Auth Svc) │
┌──────┴───────┐
│ Preference │
│ Lookup │
└──────┬───────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Priority │ │ Standard │ │ Bulk │
│ Queue │ │ Queue │ │ Queue │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
└────────────┼────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌──────────┐ ┌────────┐ ┌────────┐
│Push Svc │ │SMS Svc │ │Email │
│(APNs/FCM)│ │(Twilio)│ │Svc │
└──────────┘ └────────┘ └────────┘An internal service (e.g., the Order Service) sends a notification request to the Notification API:
notification_request = {
"user_id": "user_42",
"type": "order_shipped",
"template_id": "tmpl_order_shipped",
"data": {
"order_id": "ORD-12345",
"tracking_number": "1Z999AA10123456784",
"estimated_delivery": "2024-01-20"
},
"channels": ["push", "email"], # Preferred channels
"priority": "high",
"idempotency_key": "order_shipped_ORD-12345"
}The system validates the request and applies rate limiting to prevent notification storms:
rate_limits = {
"push": {"per_hour": 10, "per_day": 50},
"sms": {"per_hour": 3, "per_day": 10},
"email": {"per_hour": 5, "per_day": 20},
}
def check_rate_limit(user_id, channel):
key = f"rate:{user_id}:{channel}:{current_hour()}"
count = redis.incr(key)
redis.expire(key, 3600)
return count <= rate_limits[channel]["per_hour"]Before sending, check the user's notification preferences:
User Preferences (stored in database):
user_42:
push: enabled
sms: disabled (user opted out)
email: enabled
quiet_hours: 22:00 - 08:00 (user's timezone)
muted_types: ["marketing", "social_like"]If the user has disabled SMS, the system skips that channel. If it is within quiet hours, non-urgent notifications are delayed until the window ends.
Notifications are placed into priority queues:
Priority Queue: [2FA code for user_99] [Fraud alert for user_17]
Standard Queue: [Order shipped for user_42] [Payment received for user_88]
Bulk Queue: [Weekly digest batch_001] [Promo campaign_holiday_2024]Each channel has its own delivery service that handles the specifics:
Push Notifications:
Email:
SMS:
Duplicate notifications are one of the worst user experiences. They occur when:
The most effective solution: require an idempotency key in every notification request. Before processing, check if this key has been seen:
def process_notification(request):
key = request["idempotency_key"]
if redis.exists(f"dedup:{key}"):
return # Already processed
# Process the notification...
send_notification(request)
# Mark as processed with a TTL (e.g., 24 hours)
redis.setex(f"dedup:{key}", 86400, "1")When a delivery attempt fails (e.g., APNs is temporarily unavailable), retry with exponential backoff and jitter:
import random
import time
def send_with_retry(notification, max_retries=5):
for attempt in range(max_retries):
try:
result = deliver(notification)
if result.success:
update_status(notification.id, "delivered")
return
except TemporaryError:
pass
# Exponential backoff with jitter
base_delay = 2 ** attempt # 1, 2, 4, 8, 16 seconds
jitter = random.uniform(0, base_delay)
delay = base_delay + jitter
time.sleep(delay)
# All retries exhausted
update_status(notification.id, "failed")
move_to_dead_letter_queue(notification)The jitter is critical. Without it, if a downstream service goes down and recovers, all retrying clients would hit it simultaneously (thundering herd), potentially bringing it down again.
After exhausting retries, move the failed notification to a dead letter queue (DLQ). An operations team or automated system can investigate and reprocess these later.
Track the lifecycle of every notification:
Notification Lifecycle:
created → queued → sent → delivered → opened → clicked
↘ failed → retrying → (delivered | dead_letter)Table: notification_log
┌────────────┬──────────┬─────────┬──────────┬─────────────────────┬─────────┐
│ notif_id │ user_id │ channel │ status │ timestamp │ attempt │
├────────────┼──────────┼─────────┼──────────┼─────────────────────┼─────────┤
│ n_001 │ user_42 │ push │ sent │ 2024-01-15T10:30:00 │ 1 │
│ n_001 │ user_42 │ push │ delivered│ 2024-01-15T10:30:01 │ 1 │
│ n_002 │ user_42 │ email │ failed │ 2024-01-15T10:30:05 │ 1 │
│ n_002 │ user_42 │ email │ sent │ 2024-01-15T10:30:08 │ 2 │
└────────────┴──────────┴─────────┴──────────┴─────────────────────┴─────────┘For push notifications, delivery confirmation comes from APNs/FCM callbacks. For email, use tracking pixels (for opens) and redirect links (for clicks). For SMS, delivery receipts from the provider.
Notifications use templates to maintain consistency and enable localization:
templates = {
"tmpl_order_shipped": {
"push": {
"title": "Your order has shipped!",
"body": "Order {{order_id}} is on its way. Track: {{tracking_number}}"
},
"email": {
"subject": "Your order {{order_id}} has shipped",
"html_template": "order_shipped.html"
},
"sms": {
"body": "Your order {{order_id}} shipped. Track at: {{tracking_url}}"
}
}
}Avoid sending ten separate "X liked your post" notifications. Aggregate similar notifications:
Instead of:
"Alice liked your photo"
"Bob liked your photo"
"Carol liked your photo"
Send:
"Alice, Bob, and Carol liked your photo"
or
"Alice and 2 others liked your photo"Implementation: buffer similar notifications for a short window (e.g., 5 minutes). If more arrive within the window, merge them into a single aggregated notification.
Scale each channel independently. SMS providers have strict rate limits (e.g., Twilio: 100 messages/sec per number). Email sending has warm-up requirements for new IP addresses. Push notification throughput depends on APNs/FCM capacity.
Push workers: 50 instances (high throughput)
Email workers: 20 instances (moderate throughput, batching)
SMS workers: 5 instances (low throughput, expensive)The notification log grows rapidly. Partition by user_id for efficient per-user queries, and by timestamp for time-range queries and data retention.