← back

Infrastructure

Bloom Filters

Space-efficient probabilistic data structure for membership testing. Covers false positive rates, sizing, and real-world uses in databases and caches.

Bloom Filters

A bloom filter answers one question with extraordinary efficiency: "Is this element definitely not in the set, or possibly in the set?" It never produces false negatives (if it says "no," the element is absolutely not in the set), but it can produce false positives (it might say "yes" when the element is actually absent). This trade-off -- giving up certainty for dramatic savings in space and time -- makes bloom filters one of the most practical data structures in system design.

Why Bloom Filters Matter

Consider Chrome's Safe Browsing feature. Google maintains a list of millions of known malicious URLs. Before every page load, Chrome needs to check: "Is this URL on the blocklist?" Downloading the entire list to every browser would use gigabytes of storage. Querying a remote server for every page load would add latency. A bloom filter compresses millions of URLs into roughly 10-20 MB. Chrome checks the local bloom filter first. If it says "no," the URL is safe -- no network request needed. If it says "possibly yes," Chrome makes a server call to confirm.

This pattern -- using a bloom filter as a fast, space-efficient first check before a more expensive lookup -- appears everywhere in distributed systems.

How a Bloom Filter Works

A bloom filter is a bit array of m bits, initialized to all zeros, combined with k independent hash functions. Each hash function maps an element to one of the m bit positions.

Insertion

To add an element, hash it with all k hash functions. Set each resulting bit position to 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Bloom filter: m=16 bits, k=3 hash functions

Initial state:
  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15

Insert "apple":
  h1("apple") = 2,  h2("apple") = 7,  h3("apple") = 13

  [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0]
         ↑                 ↑                 ↑

Insert "banana":
  h1("banana") = 4,  h2("banana") = 7,  h3("banana") = 11

  [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0]
               ↑           (already 1)      ↑

Notice that bit 7 was already set by "apple." This overlap is the source of false positives.

Lookup

To check if an element is in the set, hash it with all k hash functions. If all resulting bit positions are 1, the element is possibly in the set. If any bit is 0, the element is definitely not in the set.

1
2
3
4
5
6
7
Check "cherry":
  h1("cherry") = 2,  h2("cherry") = 4,  h3("cherry") = 11
  Bits: [1, 1, 1] → All set → "Possibly in the set" (FALSE POSITIVE!)

Check "grape":
  h1("grape") = 1,  h2("grape") = 7,  h3("grape") = 15
  Bits: [0, 1, 0] → Bit 1 is 0 → "Definitely not in the set" ✓

"Cherry" was never inserted, but its hash positions happen to overlap with bits set by "apple" and "banana." This is a false positive.

Deletion

Standard bloom filters do not support deletion. You cannot set a bit back to 0 because it might be shared by multiple elements. This is a fundamental limitation.

False Positive Probability

The false positive rate depends on three parameters:

  • m: number of bits in the array
  • n: number of elements inserted
  • k: number of hash functions

The probability of a false positive is approximately:

1
P(false positive) ≈ (1 - e^(-kn/m))^k

Optimal Number of Hash Functions

Given m and n, the optimal k that minimizes false positives is:

1
k_optimal = (m/n) × ln(2) ≈ 0.693 × (m/n)

Sizing a Bloom Filter

Given a desired false positive rate p and expected number of elements n:

1
2
3
4
5
6
7
8
m = -(n × ln(p)) / (ln(2))^2

Example: n = 1,000,000 elements, p = 1% false positive rate
  m = -(1,000,000 × ln(0.01)) / (ln(2))^2
  m = -(1,000,000 × (-4.605)) / 0.4805
  m ≈ 9,585,058 bits ≈ 1.14 MB

  k = 0.693 × (9,585,058 / 1,000,000) ≈ 7 hash functions

So a million elements can be checked with just 1.14 MB of memory and a 1% false positive rate. Compare this to storing the actual elements, which might require hundreds of megabytes.

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import math
import mmh3  # MurmurHash3 -- fast, well-distributed

class BloomFilter:
    def __init__(self, expected_elements, false_positive_rate=0.01):
        self.n = expected_elements
        self.p = false_positive_rate

        # Calculate optimal size and number of hash functions
        self.m = int(-self.n * math.log(self.p) / (math.log(2) ** 2))
        self.k = int((self.m / self.n) * math.log(2))

        self.bit_array = [0] * self.m

    def _hashes(self, item):
        """Generate k hash values using double hashing technique."""
        h1 = mmh3.hash(item, 0) % self.m
        h2 = mmh3.hash(item, 1) % self.m
        for i in range(self.k):
            yield (h1 + i * h2) % self.m

    def add(self, item):
        for pos in self._hashes(str(item)):
            self.bit_array[pos] = 1

    def might_contain(self, item):
        return all(self.bit_array[pos] for pos in self._hashes(str(item)))

# Usage
bf = BloomFilter(expected_elements=1_000_000, false_positive_rate=0.01)
bf.add("malicious-url-1.com")
bf.add("malicious-url-2.com")

print(bf.might_contain("malicious-url-1.com"))  # True (correct)
print(bf.might_contain("safe-url.com"))          # Almost certainly False
print(f"Size: {bf.m / 8 / 1024:.1f} KB, Hash functions: {bf.k}")

Note the double hashing technique: instead of k independent hash functions, we use two hash functions (h1, h2) and combine them as h1 + i*h2. This is mathematically proven to be as effective as k independent hash functions.

Counting Bloom Filters

Since standard bloom filters cannot support deletion, counting bloom filters replace each bit with a counter (typically 4 bits). Instead of setting a bit to 1, you increment the counter. To delete, you decrement.

1
2
3
4
5
Standard bloom filter: [0, 1, 0, 1, 1, 0]  (bits)
Counting bloom filter: [0, 2, 0, 1, 3, 0]  (counters)

Delete "apple" (positions 1, 3, 4):
  [0, 1, 0, 0, 2, 0]  (decremented)

Trade-off: Counting bloom filters use 4x more space (4 bits per position instead of 1). A counter overflow (exceeding the max value, typically 15 for 4-bit counters) is extremely rare but must be handled.

Real-World Applications

Cassandra: Reducing Disk Reads

Cassandra stores data in SSTables on disk. When a read request arrives, Cassandra must check multiple SSTables to find the value. Each SSTable has an associated bloom filter. Before reading an SSTable from disk, Cassandra checks the bloom filter. If the bloom filter says "no," the SSTable definitely does not contain the key, and the expensive disk read is skipped.

1
2
3
4
Read request for key "user:42":
  SSTable 1 bloom filter: "possibly yes" → read from disk → not found
  SSTable 2 bloom filter: "definitely no" → SKIP (saved a disk read!)
  SSTable 3 bloom filter: "possibly yes" → read from disk → FOUND

Cassandra's default bloom filter false positive rate is 1%, configurable per table. Lowering the rate uses more memory but reduces unnecessary disk reads.

Chrome Safe Browsing

Google Safe Browsing maintains a bloom filter of known malicious URLs, distributed to browsers. The filter is checked locally on every navigation. A positive result triggers a server-side verification. The bloom filter is updated periodically (every 30 minutes).

Medium: Avoiding Duplicate Recommendations

Medium uses bloom filters to track which articles a user has already seen. Before recommending an article, check the user's bloom filter. If it says "possibly seen," skip it. The occasional false positive (skipping an unseen article) is acceptable -- it is far better than showing the same article twice.

Spell Checkers

A bloom filter of all valid dictionary words provides a fast first check. If a word is "definitely not" in the dictionary, it is misspelled. If it "might be" in the dictionary, verify against the full dictionary.

CDN Caching (Akamai)

Akamai uses bloom filters to decide whether to cache a URL. On the first request, the URL is added to a bloom filter but not cached. On the second request, the bloom filter says "possibly seen before," and the URL is cached. This prevents one-hit wonders from polluting the cache.

Variations

Cuckoo Filters

A more recent alternative that supports deletion, has lower false positive rates for the same memory, and offers faster lookups. It uses cuckoo hashing internally. Recommended when deletion is needed.

Scalable Bloom Filters

When the number of elements is not known in advance, scalable bloom filters dynamically add new bloom filters as the set grows, maintaining the target false positive rate.

Trade-Offs Summary

PropertyBloom FilterHash SetSorted List
SpaceO(n) bits, very smallO(n) elements, largeO(n) elements, large
Lookup timeO(k) hashesO(1) averageO(log n)
False positivesYes (tunable)NoNo
False negativesNoNoNo
DeletionNo (yes with counting variant)YesYes
DistributedEasy to merge (bitwise OR)HardHard

Interview Tips

  • Lead with the use case. Start by explaining why you need a bloom filter: "We have a large set and need fast membership checks without storing all elements."
  • State the guarantees clearly. "No false negatives, tunable false positive rate." This is the bloom filter's contract.
  • Know the sizing math. Being able to calculate that 1 million elements with 1% FP needs ~1.2 MB and 7 hash functions impresses interviewers.
  • Mention real-world examples. Cassandra SSTable lookup, Chrome Safe Browsing, and CDN caching are the strongest examples.
  • Discuss the deletion limitation. Mention counting bloom filters or cuckoo filters as alternatives when deletion is required.
  • Describe the "two-tier" pattern. Bloom filter as a fast local check, followed by a definitive remote check on a positive result. This pattern is applicable in many system design scenarios.