← back

Randomized Algorithms

Reservoir Sampling

Maintain a reservoir of k items. For the i-th item, replace a random element in the reservoir with probability k/i. After processing all items, the reservoir is a uniform random sample without knowing the stream length upfront.

Reservoir Sampling

Imagine a data stream of unknown length: a live log feed, a database cursor, a network socket. You want to select k items uniformly at random, but you cannot store all the elements or know the total count ahead of time. Reservoir sampling solves this in a single pass using O(k)O(k) space regardless of how large the stream turns out to be.

The Problem

If you knew the stream had nn elements you could pick k indices at random and collect those items. But you do not know n in advance. You need a strategy that is correct at every moment: if the stream stops after i elements, the k items currently held should be a uniform random sample from those i elements.

The Algorithm

The algorithm has two phases. For the first k elements, accept them unconditionally: they fill the reservoir. For every subsequent element at position i (using 0-based indexing, so i ≥ k), pick a random integer j in [0, i]. If j is less than k, replace reservoir[j] with the new element. Otherwise discard it.

The probability of accepting element ii is k/(i+1)k / (i + 1), and the probability of any existing reservoir element surviving that step is 11/(i+1)=i/(i+1)1 - 1/(i+1) = i/(i+1). These two probabilities multiply through correctly so that every element always has exactly k/nk/n probability of being in the final reservoir.

Why Every Element Has Equal Probability: An Inductive Sketch

After processing the first kk elements each one is in the reservoir with probability 1, which equals k/kk/k. For the base case, that is fine.

Now suppose after ii elements each is in the reservoir with probability k/ik/i. Element i+1i+1 arrives. It is accepted into the reservoir with probability k/(i+1)k/(i+1) (since we pick a uniform j in [0, i] and accept iff j < k). For each currently held element, the probability of being evicted is the chance the new element is accepted and the random slot lands on this element: ki+11k=1i+1\frac{k}{i+1} \cdot \frac{1}{k} = \frac{1}{i+1}. So it survives with probability 11i+1=ii+11 - \frac{1}{i+1} = \frac{i}{i+1}, and its new overall probability of being in the reservoir is kiii+1=ki+1\frac{k}{i} \cdot \frac{i}{i+1} = \frac{k}{i+1}. The invariant is preserved. By induction, when the stream ends at nn elements, every element has probability k/nk/n.

Walked Example: k=2, Stream of 5 Elements

Stream: [A, B, C, D, E]. We want 2 uniform random samples.

1
2
3
4
5
i=0: reservoir = [A]           (accept unconditionally)
i=1: reservoir = [A, B]        (accept unconditionally, reservoir full)
i=2: j = randint(0,2) = 1 → j < 2, replace reservoir[1]: reservoir = [A, C]
i=3: j = randint(0,3) = 3 → j >= 2, discard D:           reservoir = [A, C]
i=4: j = randint(0,4) = 0 → j < 2, replace reservoir[0]: reservoir = [E, C]

Final reservoir: [E, C]. Each of A, B, C, D, E had a 2/5 chance of ending up in the result.

Note that the random numbers above are just one possible sequence; the outcome varies per run. What is guaranteed is that the distribution is uniform over all possible runs.

The Implementation

1
2
3
4
5
6
7
8
9
10
11
12
import random

def reservoir_sample(stream, k):
    reservoir = []
    for i, item in enumerate(stream):
        if i < k:
            reservoir.append(item)
        else:
            j = random.randint(0, i)   # inclusive on both ends
            if j < k:
                reservoir[j] = item
    return reservoir

The loop is a single pass over the stream. At each step the work is O(1)O(1): generate one random number and possibly overwrite one slot. Total time is O(n)O(n); space is O(k)O(k).

The k=1 Case: Random Node in a Linked List

When k equals 1, the reservoir holds a single element. At step ii, replace the current pick with probability 1/i1/i. This is the classic "random node from a linked list without knowing its length" problem: LeetCode 382. Linked List Random Node.

1
2
3
4
5
6
7
8
9
10
11
12
import random

def random_list_node(head):
    result = head.val
    node = head.next
    i = 2
    while node:
        if random.randint(1, i) == 1:   # probability 1/i
            result = node.val
        node = node.next
        i += 1
    return result

After seeing the first node, result is set to it with probability 1/1 = 1. When the second node arrives, it replaces result with probability 1/21/2, so each of the two has probability 1/21/2. When the third arrives, it wins with probability 1/31/3, and the previous winner survives with probability 2/32/3, giving each candidate a 1/31/3 shot overall. The pattern continues to nn nodes, each with probability 1/n1/n.

Weighted Reservoir Sampling

Sometimes elements are not equally important: a user with 100 purchases should be more likely to appear in a sample than one with 1 purchase. Weighted reservoir sampling assigns each element a key of u1/wu^{1/w} where uu is a uniform random number and ww is the element's weight, then keeps the k elements with the largest keys. This guarantees selection probability proportional to weight while still requiring only a single pass and O(k)O(k) space.

1
2
3
4
5
6
7
8
9
10
11
12
13
import random
import heapq

def weighted_reservoir_sample(stream, k):
    # stream yields (item, weight) pairs
    heap = []   # min-heap of (key, item)
    for item, weight in stream:
        key = random.random() ** (1.0 / weight)
        if len(heap) < k:
            heapq.heappush(heap, (key, item))
        elif key > heap[0][0]:
            heapq.heapreplace(heap, (key, item))
    return [item for _, item in heap]

The heap tracks the k largest keys seen so far. A new element displaces the weakest current member only if its key is larger, which happens with the correct weighted probability.

Random Pick Index

LeetCode 398. Random Pick Index gives an array nums and asks you to return a uniformly random index of a target value. You could precompute a hash from value to list of indices, but reservoir sampling solves it without the extra memory:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import random

class Solution:
    def __init__(self, nums):
        self.nums = nums

    def pick(self, target):
        chosen = -1
        count = 0
        for i, v in enumerate(self.nums):
            if v == target:
                count += 1
                if random.randint(1, count) == 1:
                    chosen = i
        return chosen

This is exactly the k=1 reservoir over the subsequence of matching indices. Each candidate ends up chosen with probability 1 / (number of matches).

When to Reach for It

Reservoir sampling is the right tool whenever you need a uniform random sample from a stream whose length you don't know, can't store, or don't want to traverse twice. The signal phrase in problems is "without knowing the size in advance" or "single pass." If you're allowed two passes, a simpler approach (count first, then pick) often beats it. If the stream is small enough to materialize, just shuffle.

Complexity

Time is O(n)O(n) for a stream of nn elements: a single pass with O(1)O(1) work per element. Space is O(k)O(k) for the reservoir. The weighted variant adds a heap that costs O(logk)O(\log k) per element, making it O(nlogk)O(n \log k) time overall.

Recognizing this pattern

You're probably looking at reservoir sampling when:

  • You need to pick a uniformly random element (or k of them) from a stream of unknown length.
  • The input is too large to materialize in memory or you're only allowed one pass.
  • The data structure is a singly linked list with no length stored and no random access.
  • Elements have weights and you need probability proportional to weight without precomputing a CDF.

Common templates:

  • Single-element reservoir: keep current pick with probability 1/i at element i. Example: Linked List Random Node.
  • Multi-target index reservoir: sample uniformly among all indices matching a target value. Example: Random Pick Index.
  • Size-k reservoir (Algorithm R): fill first k, then replace position randint(0, i) if it falls in range. Example: Random Pick with Blacklist.
  • Weighted reservoir (A-Res): key each item by u^(1/w) and keep top k via a min-heap. Example: Random Pick with Weight.