Infrastructure
Space-efficient probabilistic data structure for membership testing. Covers false positive rates, sizing, and real-world uses in databases and caches.
A bloom filter answers one question with extraordinary efficiency: "Is this element definitely not in the set, or possibly in the set?" It never produces false negatives (if it says "no," the element is absolutely not in the set), but it can produce false positives (it might say "yes" when the element is actually absent). This trade-off -- giving up certainty for dramatic savings in space and time -- makes bloom filters one of the most practical data structures in system design.
Consider Chrome's Safe Browsing feature. Google maintains a list of millions of known malicious URLs. Before every page load, Chrome needs to check: "Is this URL on the blocklist?" Downloading the entire list to every browser would use gigabytes of storage. Querying a remote server for every page load would add latency. A bloom filter compresses millions of URLs into roughly 10-20 MB. Chrome checks the local bloom filter first. If it says "no," the URL is safe -- no network request needed. If it says "possibly yes," Chrome makes a server call to confirm.
This pattern -- using a bloom filter as a fast, space-efficient first check before a more expensive lookup -- appears everywhere in distributed systems.
A bloom filter is a bit array of m bits, initialized to all zeros, combined with k independent hash functions. Each hash function maps an element to one of the m bit positions.
To add an element, hash it with all k hash functions. Set each resulting bit position to 1.
Bloom filter: m=16 bits, k=3 hash functions
Initial state:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Insert "apple":
h1("apple") = 2, h2("apple") = 7, h3("apple") = 13
[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0]
↑ ↑ ↑
Insert "banana":
h1("banana") = 4, h2("banana") = 7, h3("banana") = 11
[0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0]
↑ (already 1) ↑Notice that bit 7 was already set by "apple." This overlap is the source of false positives.
To check if an element is in the set, hash it with all k hash functions. If all resulting bit positions are 1, the element is possibly in the set. If any bit is 0, the element is definitely not in the set.
Check "cherry":
h1("cherry") = 2, h2("cherry") = 4, h3("cherry") = 11
Bits: [1, 1, 1] → All set → "Possibly in the set" (FALSE POSITIVE!)
Check "grape":
h1("grape") = 1, h2("grape") = 7, h3("grape") = 15
Bits: [0, 1, 0] → Bit 1 is 0 → "Definitely not in the set" ✓"Cherry" was never inserted, but its hash positions happen to overlap with bits set by "apple" and "banana." This is a false positive.
Standard bloom filters do not support deletion. You cannot set a bit back to 0 because it might be shared by multiple elements. This is a fundamental limitation.
The false positive rate depends on three parameters:
The probability of a false positive is approximately:
P(false positive) ≈ (1 - e^(-kn/m))^kGiven m and n, the optimal k that minimizes false positives is:
k_optimal = (m/n) × ln(2) ≈ 0.693 × (m/n)Given a desired false positive rate p and expected number of elements n:
m = -(n × ln(p)) / (ln(2))^2
Example: n = 1,000,000 elements, p = 1% false positive rate
m = -(1,000,000 × ln(0.01)) / (ln(2))^2
m = -(1,000,000 × (-4.605)) / 0.4805
m ≈ 9,585,058 bits ≈ 1.14 MB
k = 0.693 × (9,585,058 / 1,000,000) ≈ 7 hash functionsSo a million elements can be checked with just 1.14 MB of memory and a 1% false positive rate. Compare this to storing the actual elements, which might require hundreds of megabytes.
import math
import mmh3 # MurmurHash3 -- fast, well-distributed
class BloomFilter:
def __init__(self, expected_elements, false_positive_rate=0.01):
self.n = expected_elements
self.p = false_positive_rate
# Calculate optimal size and number of hash functions
self.m = int(-self.n * math.log(self.p) / (math.log(2) ** 2))
self.k = int((self.m / self.n) * math.log(2))
self.bit_array = [0] * self.m
def _hashes(self, item):
"""Generate k hash values using double hashing technique."""
h1 = mmh3.hash(item, 0) % self.m
h2 = mmh3.hash(item, 1) % self.m
for i in range(self.k):
yield (h1 + i * h2) % self.m
def add(self, item):
for pos in self._hashes(str(item)):
self.bit_array[pos] = 1
def might_contain(self, item):
return all(self.bit_array[pos] for pos in self._hashes(str(item)))
# Usage
bf = BloomFilter(expected_elements=1_000_000, false_positive_rate=0.01)
bf.add("malicious-url-1.com")
bf.add("malicious-url-2.com")
print(bf.might_contain("malicious-url-1.com")) # True (correct)
print(bf.might_contain("safe-url.com")) # Almost certainly False
print(f"Size: {bf.m / 8 / 1024:.1f} KB, Hash functions: {bf.k}")Note the double hashing technique: instead of k independent hash functions, we use two hash functions (h1, h2) and combine them as h1 + i*h2. This is mathematically proven to be as effective as k independent hash functions.
Since standard bloom filters cannot support deletion, counting bloom filters replace each bit with a counter (typically 4 bits). Instead of setting a bit to 1, you increment the counter. To delete, you decrement.
Standard bloom filter: [0, 1, 0, 1, 1, 0] (bits)
Counting bloom filter: [0, 2, 0, 1, 3, 0] (counters)
Delete "apple" (positions 1, 3, 4):
[0, 1, 0, 0, 2, 0] (decremented)Trade-off: Counting bloom filters use 4x more space (4 bits per position instead of 1). A counter overflow (exceeding the max value, typically 15 for 4-bit counters) is extremely rare but must be handled.
Cassandra stores data in SSTables on disk. When a read request arrives, Cassandra must check multiple SSTables to find the value. Each SSTable has an associated bloom filter. Before reading an SSTable from disk, Cassandra checks the bloom filter. If the bloom filter says "no," the SSTable definitely does not contain the key, and the expensive disk read is skipped.
Read request for key "user:42":
SSTable 1 bloom filter: "possibly yes" → read from disk → not found
SSTable 2 bloom filter: "definitely no" → SKIP (saved a disk read!)
SSTable 3 bloom filter: "possibly yes" → read from disk → FOUNDCassandra's default bloom filter false positive rate is 1%, configurable per table. Lowering the rate uses more memory but reduces unnecessary disk reads.
Google Safe Browsing maintains a bloom filter of known malicious URLs, distributed to browsers. The filter is checked locally on every navigation. A positive result triggers a server-side verification. The bloom filter is updated periodically (every 30 minutes).
Medium uses bloom filters to track which articles a user has already seen. Before recommending an article, check the user's bloom filter. If it says "possibly seen," skip it. The occasional false positive (skipping an unseen article) is acceptable -- it is far better than showing the same article twice.
A bloom filter of all valid dictionary words provides a fast first check. If a word is "definitely not" in the dictionary, it is misspelled. If it "might be" in the dictionary, verify against the full dictionary.
Akamai uses bloom filters to decide whether to cache a URL. On the first request, the URL is added to a bloom filter but not cached. On the second request, the bloom filter says "possibly seen before," and the URL is cached. This prevents one-hit wonders from polluting the cache.
A more recent alternative that supports deletion, has lower false positive rates for the same memory, and offers faster lookups. It uses cuckoo hashing internally. Recommended when deletion is needed.
When the number of elements is not known in advance, scalable bloom filters dynamically add new bloom filters as the set grows, maintaining the target false positive rate.
| Property | Bloom Filter | Hash Set | Sorted List |
|---|---|---|---|
| Space | O(n) bits, very small | O(n) elements, large | O(n) elements, large |
| Lookup time | O(k) hashes | O(1) average | O(log n) |
| False positives | Yes (tunable) | No | No |
| False negatives | No | No | No |
| Deletion | No (yes with counting variant) | Yes | Yes |
| Distributed | Easy to merge (bitwise OR) | Hard | Hard |