← back

Calculate Unigram Probability from Corpus

#129 · NLP · Easy

⊣ Solve on deep-ml.com

Problem

Given a text corpus, compute the unigram probability of each word. The unigram probability of a word is its frequency count divided by the total number of words in the corpus.

Solution

1
2
3
4
5
6
7
8
9
10
def unigram_probabilities(corpus: list[str]) -> dict[str, float]:
    word_counts: dict[str, int] = {}
    total = 0
    for sentence in corpus:
        words = sentence.lower().split()
        for word in words:
            word_counts[word] = word_counts.get(word, 0) + 1
            total += 1

    return {word: count / total for word, count in word_counts.items()}

Explanation

  1. Iterate over each sentence in the corpus, split into words, and convert to lowercase.
  2. Count occurrences of each unique word and track the total word count.
  3. Divide each word's count by the total to get its unigram probability.
  4. The probabilities sum to 1.0 across all unique words.

Complexity

  • Time: O(W) where W is the total number of words in the corpus
  • Space: O(V) where V is the vocabulary size (number of unique words)