← back

Implement TF-IDF (Term Frequency-Inverse Document Frequency)

#60 · NLP · Medium

⊣ Solve on deep-ml.com

Problem

Implement TF-IDF (Term Frequency-Inverse Document Frequency) from scratch. Given a list of documents (each a string), compute the TF-IDF matrix where each row corresponds to a document and each column corresponds to a unique term in the corpus.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import math

def compute_tfidf(corpus):
    docs = [doc.lower().split() for doc in corpus]
    vocab = sorted(set(word for doc in docs for word in doc))
    word_to_idx = {w: i for i, w in enumerate(vocab)}
    n_docs = len(docs)

    # Document frequency
    df = {}
    for word in vocab:
        df[word] = sum(1 for doc in docs if word in doc)

    tfidf_matrix = []
    for doc in docs:
        word_count = {}
        for word in doc:
            word_count[word] = word_count.get(word, 0) + 1
        total_terms = len(doc)

        row = [0.0] * len(vocab)
        for word, count in word_count.items():
            tf = count / total_terms
            idf = math.log(n_docs / df[word])
            row[word_to_idx[word]] = round(tf * idf, 4)
        tfidf_matrix.append(row)

    return tfidf_matrix

Explanation

  1. Tokenize each document by lowercasing and splitting on whitespace.
  2. Build a sorted vocabulary of all unique terms.
  3. Compute document frequency (DF) for each term: how many documents contain it.
  4. For each document, compute term frequency (TF) as count / total terms.
  5. Compute IDF as log(N / DF) where N is the total number of documents.
  6. TF-IDF for each term in each document is TF * IDF.

Complexity

  • Time: O(N * V) where N is number of documents and V is vocabulary size
  • Space: O(N * V) for the TF-IDF matrix