Randomized Algorithms
Maintain a reservoir of k items. For the i-th item, replace a random element in the reservoir with probability k/i. After processing all items, the reservoir is a uniform random sample without knowing the stream length upfront.
Imagine a data stream of unknown length: a live log feed, a database cursor, a network socket. You want to select k items uniformly at random, but you cannot store all the elements or know the total count ahead of time. Reservoir sampling solves this in a single pass using space regardless of how large the stream turns out to be.
If you knew the stream had elements you could pick k indices at random and collect those items. But you do not know n in advance. You need a strategy that is correct at every moment: if the stream stops after i elements, the k items currently held should be a uniform random sample from those i elements.
The algorithm has two phases. For the first k elements, accept them unconditionally: they fill the reservoir. For every subsequent element at position i (using 0-based indexing, so i ≥ k), pick a random integer j in [0, i]. If j is less than k, replace reservoir[j] with the new element. Otherwise discard it.
The probability of accepting element is , and the probability of any existing reservoir element surviving that step is . These two probabilities multiply through correctly so that every element always has exactly probability of being in the final reservoir.
After processing the first elements each one is in the reservoir with probability 1, which equals . For the base case, that is fine.
Now suppose after elements each is in the reservoir with probability . Element arrives. It is accepted into the reservoir with probability (since we pick a uniform j in [0, i] and accept iff j < k). For each currently held element, the probability of being evicted is the chance the new element is accepted and the random slot lands on this element: . So it survives with probability , and its new overall probability of being in the reservoir is . The invariant is preserved. By induction, when the stream ends at elements, every element has probability .
Stream: [A, B, C, D, E]. We want 2 uniform random samples.
i=0: reservoir = [A] (accept unconditionally)
i=1: reservoir = [A, B] (accept unconditionally, reservoir full)
i=2: j = randint(0,2) = 1 → j < 2, replace reservoir[1]: reservoir = [A, C]
i=3: j = randint(0,3) = 3 → j >= 2, discard D: reservoir = [A, C]
i=4: j = randint(0,4) = 0 → j < 2, replace reservoir[0]: reservoir = [E, C]Final reservoir: [E, C]. Each of A, B, C, D, E had a 2/5 chance of ending up in the result.
Note that the random numbers above are just one possible sequence; the outcome varies per run. What is guaranteed is that the distribution is uniform over all possible runs.
import random
def reservoir_sample(stream, k):
reservoir = []
for i, item in enumerate(stream):
if i < k:
reservoir.append(item)
else:
j = random.randint(0, i) # inclusive on both ends
if j < k:
reservoir[j] = item
return reservoirThe loop is a single pass over the stream. At each step the work is : generate one random number and possibly overwrite one slot. Total time is ; space is .
When k equals 1, the reservoir holds a single element. At step , replace the current pick with probability . This is the classic "random node from a linked list without knowing its length" problem: LeetCode 382. Linked List Random Node.
import random
def random_list_node(head):
result = head.val
node = head.next
i = 2
while node:
if random.randint(1, i) == 1: # probability 1/i
result = node.val
node = node.next
i += 1
return resultAfter seeing the first node, result is set to it with probability 1/1 = 1. When the second node arrives, it replaces result with probability , so each of the two has probability . When the third arrives, it wins with probability , and the previous winner survives with probability , giving each candidate a shot overall. The pattern continues to nodes, each with probability .
Sometimes elements are not equally important: a user with 100 purchases should be more likely to appear in a sample than one with 1 purchase. Weighted reservoir sampling assigns each element a key of where is a uniform random number and is the element's weight, then keeps the k elements with the largest keys. This guarantees selection probability proportional to weight while still requiring only a single pass and space.
import random
import heapq
def weighted_reservoir_sample(stream, k):
# stream yields (item, weight) pairs
heap = [] # min-heap of (key, item)
for item, weight in stream:
key = random.random() ** (1.0 / weight)
if len(heap) < k:
heapq.heappush(heap, (key, item))
elif key > heap[0][0]:
heapq.heapreplace(heap, (key, item))
return [item for _, item in heap]The heap tracks the k largest keys seen so far. A new element displaces the weakest current member only if its key is larger, which happens with the correct weighted probability.
LeetCode 398. Random Pick Index gives an array nums and asks you to return a uniformly random index of a target value. You could precompute a hash from value to list of indices, but reservoir sampling solves it without the extra memory:
import random
class Solution:
def __init__(self, nums):
self.nums = nums
def pick(self, target):
chosen = -1
count = 0
for i, v in enumerate(self.nums):
if v == target:
count += 1
if random.randint(1, count) == 1:
chosen = i
return chosenThis is exactly the k=1 reservoir over the subsequence of matching indices. Each candidate ends up chosen with probability 1 / (number of matches).
Reservoir sampling is the right tool whenever you need a uniform random sample from a stream whose length you don't know, can't store, or don't want to traverse twice. The signal phrase in problems is "without knowing the size in advance" or "single pass." If you're allowed two passes, a simpler approach (count first, then pick) often beats it. If the stream is small enough to materialize, just shuffle.
Time is for a stream of elements: a single pass with work per element. Space is for the reservoir. The weighted variant adds a heap that costs per element, making it time overall.
You're probably looking at reservoir sampling when:
k of them) from a stream of unknown length.Common templates:
1/i at element i. Example: Linked List Random Node.k reservoir (Algorithm R): fill first k, then replace position randint(0, i) if it falls in range. Example: Random Pick with Blacklist.u^(1/w) and keep top k via a min-heap. Example: Random Pick with Weight.