← back

BM25 Ranking

#90 · NLP · Medium

⊣ Solve on deep-ml.com

Problem

Implement the BM25 (Best Matching 25) ranking function. Given a corpus of documents and a query, compute the BM25 score for each document with respect to the query. BM25 is a bag-of-words retrieval function that ranks documents based on the query terms appearing in each document.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import math
from collections import Counter

def bm25(corpus: list[list[str]], query: list[str], k1: float = 1.5, b: float = 0.75) -> list[float]:
    N = len(corpus)
    doc_lens = [len(doc) for doc in corpus]
    avgdl = sum(doc_lens) / N

    # Document frequency for each term
    df = {}
    for doc in corpus:
        seen = set(doc)
        for term in seen:
            df[term] = df.get(term, 0) + 1

    # Term frequencies per document
    doc_tfs = [Counter(doc) for doc in corpus]

    scores = []
    for i in range(N):
        score = 0.0
        for term in query:
            if term not in df:
                continue
            tf = doc_tfs[i].get(term, 0)
            idf = math.log((N - df[term] + 0.5) / (df[term] + 0.5) + 1)
            numerator = tf * (k1 + 1)
            denominator = tf + k1 * (1 - b + b * doc_lens[i] / avgdl)
            score += idf * numerator / denominator
        scores.append(round(score, 4))
    return scores

Explanation

  1. Compute document statistics: Calculate the average document length and document frequency (number of documents containing each term).
  2. IDF (Inverse Document Frequency): For each query term, compute log((N - df + 0.5) / (df + 0.5) + 1) which gives higher weight to rarer terms.
  3. TF component: The term frequency is normalized using the formula tf * (k1 + 1) / (tf + k1 * (1 - b + b * dl / avgdl)). Parameter k1 controls term frequency saturation and b controls length normalization.
  4. Sum the IDF * TF product for all query terms to get each document's BM25 score.

Complexity

  • Time: O(N Q D) where N is number of documents, Q is query length, D is average document length
  • Space: O(V * N) where V is vocabulary size for storing term frequencies