← back

Classic Designs

Payment System

Design a payment system like Stripe. Covers idempotency, double-entry bookkeeping, payment state machines, webhooks, and reconciliation.

Designing a Payment System

Designing a payment system like Stripe is one of the most demanding system design questions because correctness is paramount. A bug in a social media feed shows the wrong post; a bug in a payment system loses real money. This question tests your understanding of idempotency, state machines, double-entry bookkeeping, webhook reliability, and the unique challenge of building a system where "at least once" and "at most once" are both unacceptable -- you need exactly once.

Requirements

Functional Requirements

  • Process payments: charge a customer's payment method (credit card, bank account) and transfer funds to a merchant.
  • Support the payment lifecycle: authorize, capture, void, and refund.
  • Provide webhooks to notify merchants of payment status changes.
  • Handle multiple currencies and payment methods.
  • Provide a ledger for transaction history and reconciliation.

Non-Functional Requirements

  • Correctness: No double charges. No lost payments. Every cent must be accounted for.
  • Availability: 99.999% uptime. Payment downtime directly costs merchants revenue.
  • Idempotency: Retrying a payment request must not result in duplicate charges.
  • Latency: Payment authorization should complete within 2 seconds.
  • Compliance: PCI DSS compliance for handling cardholder data.
  • Auditability: Every state change must be logged for regulatory and dispute purposes.

Capacity Estimation

1
2
3
4
5
6
7
Assumptions:
  - 10 million transactions per day (Stripe processes more, but this is a good starting point)
  - Peak: 5x average = ~580 transactions/sec
  - Average transaction record: 1 KB
  - Storage: 10M × 1 KB × 365 days = 3.65 TB/year
  - Ledger entries: 2 entries per transaction (double-entry) = 7.3 TB/year
  - Webhook deliveries: 3-4 events per transaction = 40M events/day

Idempotency Keys

The most critical concept in payment systems. Network failures are inevitable: a client sends a payment request, the server processes it, but the response is lost. The client retries. Without idempotency, the customer is charged twice.

How Idempotency Keys Work

The client generates a unique idempotency key (typically a UUID) and includes it in every payment request. The server stores the key alongside the result. On retry, the server recognizes the key and returns the stored result without reprocessing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class PaymentService:
    def process_payment(self, idempotency_key, amount, currency, customer_id):
        # 1. Check if this key has been seen before
        existing = self.db.get_by_idempotency_key(idempotency_key)
        if existing:
            return existing.result  # return cached result, no reprocessing

        # 2. Acquire a lock on the idempotency key to prevent concurrent duplicates
        lock = self.lock_manager.acquire(f"idem:{idempotency_key}", ttl=30)
        if not lock:
            raise ConflictError("Request already in progress")

        try:
            # 3. Process the payment
            result = self._charge_payment_method(amount, currency, customer_id)

            # 4. Store the result keyed by idempotency key
            self.db.save_idempotency_record(idempotency_key, result)

            return result
        finally:
            lock.release()

Key Design Decisions

  • Client-generated keys: The client (merchant's server) generates the key, ensuring the same logical operation always uses the same key, even across retries.
  • Key TTL: Idempotency keys expire after 24-48 hours. After that, the same key can be reused (though this is rare in practice).
  • Scope: Each idempotency key is scoped to a merchant account. Two different merchants can use the same key string without conflict.

Double-Entry Bookkeeping

In accounting, every transaction is recorded as two entries: a debit and a credit. The sum of all debits must equal the sum of all credits. This invariant makes it easy to detect errors and ensures the system is always balanced.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Payment of $100 from Customer to Merchant:

  ┌───────────────────┬──────────┬──────────┐
  │ Account           │ Debit    │ Credit   │
  ├───────────────────┼──────────┼──────────┤
  │ Customer Wallet   │          │ $100.00  │  ← money leaves customer
  │ Merchant Balance  │ $97.00   │          │  ← merchant receives $97
  │ Platform Revenue  │ $3.00    │          │  ← platform takes $3 fee
  ├───────────────────┼──────────┼──────────┤
  │ TOTAL             │ $100.00  │ $100.00  │  ← balanced!
  └───────────────────┴──────────┴──────────┘

Refund of $100:
  ┌───────────────────┬──────────┬──────────┐
  │ Account           │ Debit    │ Credit   │
  ├───────────────────┼──────────┼──────────┤
  │ Customer Wallet   │ $100.00  │          │  ← money returns to customer
  │ Merchant Balance  │          │ $97.00   │  ← merchant gives back $97
  │ Platform Revenue  │          │ $3.00    │  ← platform gives back fee
  ├───────────────────┼──────────┼──────────┤
  │ TOTAL             │ $100.00  │ $100.00  │  ← balanced!
  └───────────────────┴──────────┴──────────┘
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class Ledger:
    def record_payment(self, tx_id, amount, fee, customer_id, merchant_id):
        entries = [
            LedgerEntry(tx_id, account=f"customer:{customer_id}",
                       credit=amount, debit=0),
            LedgerEntry(tx_id, account=f"merchant:{merchant_id}",
                       credit=0, debit=amount - fee),
            LedgerEntry(tx_id, account="platform:revenue",
                       credit=0, debit=fee),
        ]
        # All entries are written in a single transaction
        self.db.insert_batch(entries)

    def verify_balance(self):
        total_debits = self.db.sum_all_debits()
        total_credits = self.db.sum_all_credits()
        assert total_debits == total_credits, "LEDGER IMBALANCE DETECTED"

Why double-entry? If you just increment and decrement balances, a bug or crash could leave the system in an inconsistent state (money created or destroyed). Double-entry bookkeeping makes imbalances immediately detectable. Run the balance verification as a scheduled job and alert if it ever fails.

Payment State Machine

A payment goes through a well-defined lifecycle. Modeling it as an explicit state machine prevents invalid transitions and ensures every state change is auditable.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Payment State Machine:

  CREATED → PENDING → AUTHORIZED → CAPTURED → SETTLED
                │           │           │
                ▼           ▼           ▼
             FAILED       VOIDED     REFUNDED
                                   (partial or full)

States:
  CREATED:    Payment intent created, not yet submitted to payment processor.
  PENDING:    Submitted to payment processor, awaiting response.
  AUTHORIZED: Funds reserved on the customer's card (not yet charged).
  CAPTURED:   Funds actually charged and transferred.
  SETTLED:    Funds deposited into the merchant's bank account.
  FAILED:     Payment processor declined the transaction.
  VOIDED:     Authorization cancelled before capture.
  REFUNDED:   Captured payment returned to the customer.

Two-Phase Payment: Authorize then Capture

Many merchants use a two-phase flow. When a customer places an order, the system authorizes the payment (reserves the funds). Later, when the order ships, the system captures the payment (actually charges the card). This prevents charging for items that are out of stock or cannot be fulfilled.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class Payment:
    VALID_TRANSITIONS = {
        "CREATED": ["PENDING"],
        "PENDING": ["AUTHORIZED", "FAILED"],
        "AUTHORIZED": ["CAPTURED", "VOIDED"],
        "CAPTURED": ["SETTLED", "REFUNDED"],
        "SETTLED": ["REFUNDED"],
    }

    def transition(self, new_state, metadata=None):
        if new_state not in self.VALID_TRANSITIONS.get(self.state, []):
            raise InvalidTransitionError(
                f"Cannot go from {self.state} to {new_state}"
            )
        old_state = self.state
        self.state = new_state
        self.updated_at = now()

        # Log every state change for auditing
        self.audit_log.append({
            "from": old_state,
            "to": new_state,
            "timestamp": self.updated_at,
            "metadata": metadata,
        })

        # Trigger webhooks
        self.webhook_queue.enqueue(PaymentEvent(self.id, new_state))

Webhook Delivery

Merchants need to know when payment states change (e.g., a payment succeeded, a refund was processed). Webhooks are HTTP callbacks: your system POSTs an event to a URL the merchant configured.

Reliable Webhook Delivery

Webhooks must be delivered at least once. Network failures, merchant server downtime, and timeouts are common. The system must retry failed deliveries.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Webhook Delivery Pipeline:

  Payment State Change → Webhook Queue (persistent)
                              │
                              ▼
                     Webhook Delivery Worker
                              │
                    ┌─────────┼─────────┐
                    ▼         ▼         ▼
                 Success   Timeout    5xx Error
                 (200 OK)  (>10s)    (500, 502)
                    │         │         │
                    ▼         ▼         ▼
                  Done     Retry      Retry
                          (exponential backoff)

Retry schedule:
  Attempt 1: immediately
  Attempt 2: 5 minutes later
  Attempt 3: 30 minutes later
  Attempt 4: 2 hours later
  Attempt 5: 8 hours later
  Attempt 6: 24 hours later
  After 6 failures: mark as failed, alert merchant via dashboard

Webhook Security

  • Signature verification: Each webhook includes an HMAC signature computed with a shared secret. The merchant verifies the signature to ensure the webhook came from your system, not an attacker.
  • Timestamp validation: Include a timestamp in the webhook. Merchants should reject webhooks older than 5 minutes to prevent replay attacks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import hmac
import hashlib
import time

def sign_webhook(payload, secret):
    timestamp = str(int(time.time()))
    message = f"{timestamp}.{payload}"
    signature = hmac.new(
        secret.encode(), message.encode(), hashlib.sha256
    ).hexdigest()
    return timestamp, signature

# Merchant verification:
def verify_webhook(payload, timestamp, signature, secret):
    if abs(time.time() - int(timestamp)) > 300:  # 5 minute tolerance
        return False
    expected = hmac.new(
        secret.encode(), f"{timestamp}.{payload}".encode(), hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(signature, expected)

PCI Compliance

The Payment Card Industry Data Security Standard (PCI DSS) governs how cardholder data (card numbers, CVV, expiration dates) must be handled. PCI compliance is not optional -- it is legally required for any system that processes credit cards.

Key Principles

  • Minimize scope: The fewer systems that touch cardholder data, the smaller the PCI audit scope. This is why Stripe exists: merchants send card data directly to Stripe's frontend SDK, and Stripe returns a token. The merchant's backend never sees the actual card number.
  • Tokenization: Replace sensitive card data with a non-sensitive token. The token is meaningless outside your system. Store the mapping (token → card) in a heavily secured, isolated vault.
  • Encryption: All cardholder data must be encrypted at rest (AES-256) and in transit (TLS 1.2+).
  • Access control: Restrict access to cardholder data to the minimum number of personnel and systems. Log all access.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Tokenization Flow:

  Customer Browser → Stripe.js (client-side SDK)
                         │
                         ▼
                    Stripe Token Service (PCI-compliant)
                         │ returns token: "tok_abc123"
                         ▼
                    Merchant Backend (never sees card number)
                         │ sends token + amount to Stripe API
                         ▼
                    Stripe Payment Service
                         │ detokenizes, charges card via card network
                         ▼
                    Result returned to merchant

Reconciliation

Reconciliation is the process of verifying that your internal records match the external records from payment processors and banks. Discrepancies indicate bugs, fraud, or timing issues.

Daily Reconciliation Process

1
2
3
4
5
6
7
8
9
10
11
12
13
14
1. Export all transactions from your database for the day.
2. Download settlement reports from each payment processor (Visa, Mastercard, etc.).
3. Match each internal transaction to the corresponding external record.
4. Flag discrepancies:
   - Transaction in your system but not in processor report (we think it succeeded, but it did not).
   - Transaction in processor report but not in your system (we missed recording it).
   - Amount mismatches (currency conversion issues, fee calculation errors).
5. Investigate and resolve each discrepancy.

Common causes of discrepancies:
  - Race conditions during payment processing.
  - Network timeouts where the payment succeeded but the response was lost.
  - Currency conversion rounding differences.
  - Chargebacks processed by the bank but not yet reflected in your system.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def reconcile(internal_transactions, processor_report):
    internal_map = {tx.processor_ref: tx for tx in internal_transactions}
    external_map = {entry.ref_id: entry for entry in processor_report}

    discrepancies = []

    for ref, tx in internal_map.items():
        if ref not in external_map:
            discrepancies.append(("MISSING_EXTERNAL", ref, tx))
        elif tx.amount != external_map[ref].amount:
            discrepancies.append(("AMOUNT_MISMATCH", ref, tx, external_map[ref]))

    for ref, entry in external_map.items():
        if ref not in internal_map:
            discrepancies.append(("MISSING_INTERNAL", ref, entry))

    return discrepancies

Handling Failures

Payment processing involves multiple external systems (card networks, banks, fraud detection), each of which can fail. The system must handle failures gracefully without losing money.

Timeout Handling

The most dangerous failure mode: you send a charge request to the payment processor, and the connection times out. Did the charge go through? You do not know.

  • Never retry blindly. The original charge may have succeeded. Retrying would double-charge the customer.
  • Use idempotency keys with the payment processor. Most processors support this. The retry with the same key returns the original result.
  • Query the processor. After a timeout, query the transaction status endpoint before deciding whether to retry.
  • Reconcile later. If the status is still unknown, record it as "UNCERTAIN" and resolve it during reconciliation.

Distributed Transaction Pattern

A payment often involves multiple steps (debit customer, credit merchant, record in ledger). If one step fails, the others must be rolled back.

1
2
3
4
5
6
7
8
9
10
11
12
13
Saga Pattern for Payment:

  Step 1: Reserve funds (authorize)
    Compensation: Release funds (void authorization)

  Step 2: Record ledger entries
    Compensation: Reverse ledger entries

  Step 3: Capture funds
    Compensation: Refund

  If Step 3 fails:
    Execute compensation for Step 2, then Step 1 (reverse order)

High-Level Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
              ┌──────────────┐
 Merchants──> │ API Gateway  │
              └──────┬───────┘
                     │
        ┌────────────┼────────────┐
        ▼            ▼            ▼
  ┌──────────┐ ┌──────────┐ ┌──────────┐
  │ Payment  │ │ Webhook  │ │ Ledger   │
  │ Service  │ │ Service  │ │ Service  │
  └────┬─────┘ └────┬─────┘ └────┬─────┘
       │             │             │
       ▼             ▼             ▼
  ┌──────────┐ ┌──────────┐ ┌──────────┐
  │Payment DB│ │ Message  │ │ Ledger DB│
  │(Postgres)│ │  Queue   │ │(Postgres)│
  └────┬─────┘ └──────────┘ └──────────┘
       │
       ▼
  ┌──────────────────────┐
  │ Payment Processor    │
  │ (Visa/MC/Bank APIs)  │
  └──────────────────────┘

Interview Tips

  • Lead with idempotency. This is the most important concept. Explain the problem (network failures cause retries) and the solution (client-generated idempotency keys stored server-side).
  • Draw the payment state machine. Walk through each state and the transitions. Explain why authorize-then-capture exists (merchants do not want to charge for unfulfillable orders).
  • Explain double-entry bookkeeping. This is what separates a toy payment system from a real one. Every transaction has balanced debits and credits. The ledger is the source of truth.
  • Discuss webhook reliability. Explain exponential backoff retries and HMAC signature verification. Mention that webhooks guarantee at-least-once delivery, so merchants must handle duplicates.
  • Mention PCI compliance and tokenization. This shows you understand the regulatory and security landscape. The key insight is minimizing PCI scope by never letting cardholder data touch your servers.
  • Address the timeout problem explicitly. "What happens if the payment processor times out?" is the hardest failure mode. Explain idempotency keys with the processor, status queries, and reconciliation as the fallback.