← back

Evaluate Translation Quality with METEOR Score

#110 · NLP · Medium

⊣ Solve on deep-ml.com

Problem

Evaluate translation quality with the METEOR (Metric for Evaluation of Translation with Explicit ORdering) score. Given a candidate translation and one or more reference translations, compute the METEOR score which considers precision, recall, and alignment with a penalty for fragmentation.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from collections import Counter

def meteor_score(candidate: str, reference: str, alpha: float = 0.9, beta: float = 3.0, gamma: float = 0.5) -> float:
    cand_tokens = candidate.lower().split()
    ref_tokens = reference.lower().split()

    cand_counts = Counter(cand_tokens)
    ref_counts = Counter(ref_tokens)

    # Unigram matches
    matches = sum((cand_counts & ref_counts).values())

    if matches == 0:
        return 0.0

    precision = matches / len(cand_tokens)
    recall = matches / len(ref_tokens)

    # F-mean with alpha weighting recall more heavily
    f_score = (precision * recall) / (alpha * precision + (1 - alpha) * recall)

    # Count chunks (consecutive matched sequences)
    chunks = 0
    in_chunk = False
    ref_set = set(ref_tokens)
    for token in cand_tokens:
        if token in ref_set:
            if not in_chunk:
                chunks += 1
                in_chunk = True
        else:
            in_chunk = False

    # Fragmentation penalty
    if matches > 0:
        frag = chunks / matches
    else:
        frag = 0
    penalty = gamma * (frag ** beta)

    score = f_score * (1 - penalty)
    return round(max(0.0, score), 4)

Explanation

  1. Tokenize and count: Lowercase and split both candidate and reference into tokens. Count unigram matches using counter intersection.
  2. Precision and recall: Precision is matches/candidate_length, recall is matches/reference_length.
  3. F-score: Weighted harmonic mean of precision and recall. Alpha controls the weight (default 0.9 emphasizes recall).
  4. Fragmentation penalty: Count the number of contiguous chunks of matched words. More chunks means more fragmentation. Penalty is gamma * (chunks/matches)^beta.
  5. Final score: F * (1 - penalty). Perfectly ordered translations get low penalty; scrambled translations get high penalty.

Complexity

  • Time: O(n + m) where n and m are candidate and reference lengths
  • Space: O(n + m) for token counters