Calculate Unigram Probability from Corpus

#129 · NLP · Easy

Problem

Given a text corpus, compute the unigram probability of each word. The unigram probability of a word is its frequency count divided by the total number of words in the corpus.

Solution

def unigram_probabilities(corpus: list[str]) -> dict[str, float]:
    word_counts: dict[str, int] = {}
    total = 0
    for sentence in corpus:
        words = sentence.lower().split()
        for word in words:
            word_counts[word] = word_counts.get(word, 0) + 1
            total += 1

    return {word: count / total for word, count in word_counts.items()}

Explanation

Iterate over each sentence in the corpus, split into words, and convert to lowercase.
Count occurrences of each unique word and track the total word count.
Divide each word's count by the total to get its unigram probability.
The probabilities sum to 1.0 across all unique words.

Complexity

Time: O(W) where W is the total number of words in the corpus
Space: O(V) where V is the vocabulary size (number of unique words)

← #128 #130 →