← back

Rubric-Based LLM Judge Evaluation

#317 · LLM · Medium

⊣ Solve on deep-ml.com

Problem

Implement a rubric-based LLM judge evaluation system. Given a response text, a rubric with scored criteria, evaluate the response against each criterion and produce an overall score.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from typing import Dict, List, Tuple

def rubric_based_evaluation(
    response: str,
    rubric: Dict[str, Dict[str, any]],
) -> Dict:
    results = {}
    total_score = 0.0
    total_weight = 0.0

    for criterion, details in rubric.items():
        keywords = details.get("keywords", [])
        weight = details.get("weight", 1.0)
        max_score = details.get("max_score", 5)

        response_lower = response.lower()
        matches = sum(1 for kw in keywords if kw.lower() in response_lower)
        ratio = matches / len(keywords) if keywords else 0
        score = round(ratio * max_score, 2)

        results[criterion] = {
            "score": score,
            "max_score": max_score,
            "weight": weight,
            "matched_keywords": matches,
            "total_keywords": len(keywords)
        }

        total_score += score * weight
        total_weight += max_score * weight

    overall = round(total_score / total_weight, 4) if total_weight > 0 else 0.0

    return {
        "criteria_scores": results,
        "overall_score": overall,
        "weighted_total": round(total_score, 2),
        "weighted_max": round(total_weight, 2)
    }

Explanation

  1. Iterate over each criterion in the rubric. Each criterion has a list of expected keywords, a weight, and a max score.
  2. For each criterion, count how many keywords appear in the response (case-insensitive).
  3. Score is proportional to the fraction of keywords matched, scaled by the max score.
  4. The overall score is the weighted sum of criterion scores divided by the weighted sum of max scores.

Complexity

  • Time: O(c k n) where c is the number of criteria, k is keywords per criterion, and n is response length
  • Space: O(c)