← back

Compute Pointwise Mutual Information

#111 · NLP · Medium

⊣ Solve on deep-ml.com

Problem

Compute the Pointwise Mutual Information (PMI) between pairs of words in a corpus. PMI measures how much more (or less) likely two words are to co-occur compared to what we would expect if they were independent.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import math
from collections import Counter, defaultdict

def pointwise_mutual_information(corpus: list[list[str]], word1: str, word2: str, window_size: int = None) -> float:
    total_words = sum(len(doc) for doc in corpus)
    word_counts = Counter()
    co_occur_count = 0

    for doc in corpus:
        word_counts.update(doc)
        if window_size is None:
            # Document-level co-occurrence
            words_in_doc = set(doc)
            if word1 in words_in_doc and word2 in words_in_doc:
                co_occur_count += 1
        else:
            # Window-based co-occurrence
            for i, w in enumerate(doc):
                if w == word1:
                    start = max(0, i - window_size)
                    end = min(len(doc), i + window_size + 1)
                    if word2 in doc[start:end]:
                        co_occur_count += 1

    p_word1 = word_counts[word1] / total_words
    p_word2 = word_counts[word2] / total_words

    if window_size is None:
        n_contexts = len(corpus)
        p_co_occur = co_occur_count / n_contexts
    else:
        p_co_occur = co_occur_count / total_words

    if p_co_occur == 0 or p_word1 == 0 or p_word2 == 0:
        return 0.0

    pmi = math.log2(p_co_occur / (p_word1 * p_word2))
    return round(pmi, 4)

Explanation

  1. Word probabilities: Count occurrences of each word and divide by total words to get P(word).
  2. Co-occurrence probability: Count how often word1 and word2 appear together (either in the same document or within a sliding window) and normalize.
  3. PMI formula: PMI(x, y) = log2(P(x, y) / (P(x) * P(y))).
  4. - PMI > 0: words co-occur more than expected (positively associated).
  5. - PMI = 0: words are independent.
  6. - PMI < 0: words co-occur less than expected (negatively associated).
  7. PMI is widely used in NLP for collocation detection, word embeddings, and topic modeling.

Complexity

  • Time: O(N * W) where N is corpus size and W is window size (or O(N) for document-level)
  • Space: O(V) where V is vocabulary size