← back

String Algorithms

Rolling Hash / Rabin-Karp

Compute a hash for the first window and roll it across the string by adding the new character and removing the old one in O(1) per step. Used for repeated substring detection and multiple-pattern matching.

Rolling Hash and the Rabin-Karp Algorithm

String matching sounds simple: slide a window of length m across a text of length n and check each position. The problem is that each character-by-character comparison costs O(m)O(m), pushing the total to O(nm)O(nm). The Rabin-Karp algorithm escapes this trap by replacing direct comparison with hash comparison. Two strings can only be equal if their hashes match, and a well-chosen hash can be updated in O(1)O(1) as the window slides, bringing the expected runtime down to O(n+m)O(n + m).

Polynomial Hashing

The hash of a string is computed by treating each character as a digit in a base-B number system. For a string s of length m:

1
hash = s[0]*B^(m-1) + s[1]*B^(m-2) + ... + s[m-1]*B^0

All arithmetic is done modulo a large prime M to keep values bounded. A common setup is B = 31 (or 26 for lowercase letters) and M=109+7M = 10^9 + 7. Converting characters to integers typically subtracts ord('a') - 1 so that 'a' maps to 1, avoiding zero-valued characters that would collapse distinct strings to the same hash.

Sliding the Window in O(1)O(1)

The key insight is that when the window advances one position to the right, you only need to remove the contribution of the leftmost character and fold in the new rightmost character:

1
new_hash = (old_hash - s[i] * B^(m-1)) * B + s[i + m]

The term B^(m-1) is a constant that can be precomputed once. Every slide is a fixed-cost arithmetic operation rather than a fresh traversal of m characters.

Walked Example

Suppose the text is "abcabc" and the pattern is "cab". Let B = 31, M=109+7M = 10^9 + 7.

First, compute the pattern hash and the hash of the first window "abc". Then slide:

  • Window "abc": hash doesn't match pattern hash "cab", skip.
  • Slide: remove 'a', add 'a' → window "bca", still no match.
  • Slide: remove 'b', add 'b' → window "cab", hash matches. Verify characters: "cab" == "cab". Found at index 2.

The verification step on hash match is cheap (one comparison of length-m strings) and guards against false positives.

The Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def rabin_karp(text: str, pattern: str) -> int:
    base, mod = 31, 10**9 + 7
    m, n = len(pattern), len(text)
    if m > n:
        return -1

    # Precompute base^(m-1) mod M
    power = pow(base, m - 1, mod)

    # Build initial hashes
    p_hash = t_hash = 0
    for i in range(m):
        p_hash = (p_hash * base + ord(pattern[i])) % mod
        t_hash = (t_hash * base + ord(text[i])) % mod

    for i in range(n - m + 1):
        if t_hash == p_hash and text[i:i + m] == pattern:
            return i
        if i + m < n:
            # Roll the window: evict text[i], admit text[i + m]
            t_hash = (t_hash - ord(text[i]) * power) % mod
            t_hash = (t_hash * base + ord(text[i + m])) % mod
            t_hash %= mod  # ensure non-negative

    return -1

The check text[i:i+m] == pattern only executes on hash matches. In practice, with a good hash function, this almost never triggers on non-matching windows, keeping the average runtime at O(n+m)O(n + m).

Hash Collisions and Double Hashing

A collision happens when two different strings produce the same hash. With a single large prime modulus the probability per window is roughly 1/M1091/M \approx 10^{-9}, which is usually acceptable. For problems where correctness is critical (competitive programming, adversarial inputs), use double hashing: maintain two independent hashes with different base/modulus pairs and only accept a match when both agree. The collision probability drops to roughly 1/(M1M2)10181/(M_1 \cdot M_2) \approx 10^{-18}.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def rabin_karp_double(text: str, pattern: str) -> int:
    B1, M1 = 31, 10**9 + 7
    B2, M2 = 37, 10**9 + 9
    m, n = len(pattern), len(text)
    pw1 = pow(B1, m - 1, M1)
    pw2 = pow(B2, m - 1, M2)

    ph1 = ph2 = th1 = th2 = 0
    for i in range(m):
        c = ord(pattern[i])
        ph1 = (ph1 * B1 + c) % M1
        ph2 = (ph2 * B2 + c) % M2
        c2 = ord(text[i])
        th1 = (th1 * B1 + c2) % M1
        th2 = (th2 * B2 + c2) % M2

    for i in range(n - m + 1):
        if th1 == ph1 and th2 == ph2:
            return i  # collision probability negligible; skip verify
        if i + m < n:
            th1 = (th1 - ord(text[i]) * pw1) * B1 % M1 + ord(text[i + m])
            th1 %= M1
            th2 = (th2 - ord(text[i]) * pw2) * B2 % M2 + ord(text[i + m])
            th2 %= M2
    return -1

Application: Longest Duplicate Substring

LeetCode 1044. Longest Duplicate Substring asks for the longest substring that appears at least twice. A naive approach is O(n2)O(n^2) in the length of candidates. Rolling hash enables a cleaner O(nlogn)O(n \log n) solution by combining binary search with a hash set.

The key observation: if a duplicate substring of length k exists, one of length k - 1 certainly exists too. This monotone property makes binary search valid: search for the largest k such that some window of that length repeats.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def longest_dup_substring(s: str) -> str:
    def has_duplicate(length: int) -> str:
        base, mod = 31, 10**9 + 7
        power = pow(base, length - 1, mod)
        h = 0
        for c in s[:length]:
            h = (h * base + ord(c)) % mod
        seen = {h: [0]}
        for i in range(1, len(s) - length + 1):
            h = (h - ord(s[i - 1]) * power) % mod
            h = (h * base + ord(s[i + length - 1])) % mod
            if h in seen:
                # Verify to avoid false positives
                candidate = s[i:i + length]
                for j in seen[h]:
                    if s[j:j + length] == candidate:
                        return candidate
                seen[h].append(i)
            else:
                seen[h] = [i]
        return ""

    lo, hi, result = 1, len(s) - 1, ""
    while lo <= hi:
        mid = (lo + hi) // 2
        found = has_duplicate(mid)
        if found:
            result = found
            lo = mid + 1
        else:
            hi = mid - 1
    return result

Each call to has_duplicate runs in O(n)O(n); binary search calls it O(logn)O(\log n) times, giving O(nlogn)O(n \log n) overall.

Complexity

Time: O(n+m)O(n + m) expected for single-pattern search (O(nm)O(nm) worst case if every window produces a hash collision). O(nlogn)O(n \log n) for longest duplicate substring via binary search.

Space: O(1)O(1) extra for basic Rabin-Karp; O(n)O(n) for the hash set in the duplicate-substring variant.

Other Applications

LeetCode 187. Repeated DNA Sequences is the textbook Rabin-Karp warm-up: find every 10-letter substring that appears more than once. A rolling hash over a length-10 window plus a set captures every repeat in one pass.

LeetCode 1147. Longest Chunked Palindrome Decomposition greedily matches a growing prefix against the corresponding suffix. Each comparison is one rolling-hash equality check, keeping the whole algorithm linear.

LeetCode 1923. Longest Common Subpath finds the longest path appearing in every one of mm input paths. Binary search on the answer length combined with rolling hash gives O((NlogN))O((N \log N)) where NN is total path length, infeasible without hashing.

Recognizing this pattern

You're probably looking at rolling hash when:

  • You need to compare many fixed-length substrings, a sliding window of equal-length comparisons.
  • The problem asks for the longest duplicate or longest common substring, where binary search on length pairs with hash-set lookups.
  • The strings are long enough that O(nm)O(nm) naive matching is too slow but KMP doesn't fit (multiple patterns, unknown lengths, k-grams).
  • Inputs are k-mers over a small alphabet: DNA, fixed-length sliding windows.

Common templates: