← back

Pairwise Preference Judge for LLM Comparison

#323 · LLM · Medium

⊣ Solve on deep-ml.com

Problem

Implement a pairwise preference judge that compares two LLM responses and determines which one is better based on specified criteria (relevance, coherence, completeness). Return the preferred response index and a score breakdown.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
from typing import Dict, List

def pairwise_preference_judge(
    prompt: str,
    response_a: str,
    response_b: str,
    criteria_keywords: Dict[str, List[str]]
) -> Dict:
    scores_a = {}
    scores_b = {}

    a_lower = response_a.lower()
    b_lower = response_b.lower()
    prompt_lower = prompt.lower()

    for criterion, keywords in criteria_keywords.items():
        # Score based on keyword coverage
        a_matches = sum(1 for kw in keywords if kw.lower() in a_lower)
        b_matches = sum(1 for kw in keywords if kw.lower() in b_lower)

        scores_a[criterion] = a_matches / len(keywords) if keywords else 0
        scores_b[criterion] = b_matches / len(keywords) if keywords else 0

    # Length-based coherence bonus (penalize extremely short or long)
    prompt_len = len(prompt.split())
    for label, resp, scores in [("A", response_a, scores_a), ("B", response_b, scores_b)]:
        resp_len = len(resp.split())
        ratio = resp_len / max(prompt_len, 1)
        length_score = min(ratio / 10.0, 1.0)
        scores["length_appropriateness"] = round(length_score, 4)

    total_a = sum(scores_a.values())
    total_b = sum(scores_b.values())

    if total_a > total_b:
        preferred = "A"
    elif total_b > total_a:
        preferred = "B"
    else:
        preferred = "TIE"

    return {
        "preferred": preferred,
        "scores_a": scores_a,
        "scores_b": scores_b,
        "total_a": round(total_a, 4),
        "total_b": round(total_b, 4)
    }

Explanation

  1. For each evaluation criterion, count how many expected keywords appear in each response.
  2. Convert keyword matches to a ratio score (0 to 1) per criterion.
  3. Add a length-appropriateness score that rewards responses of reasonable length relative to the prompt.
  4. Sum all criterion scores for each response and declare the one with the higher total as preferred.

Complexity

  • Time: O(c k n) where c is number of criteria, k is keywords per criterion, n is response length
  • Space: O(c)