← back

Elo Rating System for Model Comparison

#315 · Machine Learning · Medium

⊣ Solve on deep-ml.com

Problem

Implement the Elo rating system for comparing ML models (or players). Given a sequence of pairwise comparison outcomes, update the ratings of each model according to the Elo formula.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def expected_score(rating_a: float, rating_b: float) -> float:
    return 1.0 / (1.0 + 10 ** ((rating_b - rating_a) / 400))

def update_elo(rating_a: float, rating_b: float, score_a: float,
               k: float = 32.0) -> tuple:
    """
    score_a: 1.0 if A wins, 0.0 if B wins, 0.5 for draw
    Returns updated (rating_a, rating_b)
    """
    e_a = expected_score(rating_a, rating_b)
    e_b = 1.0 - e_a

    new_rating_a = rating_a + k * (score_a - e_a)
    new_rating_b = rating_b + k * ((1.0 - score_a) - e_b)

    return new_rating_a, new_rating_b

def elo_rating_system(models: list[str], matchups: list[dict],
                      initial_rating: float = 1500.0,
                      k: float = 32.0) -> dict:
    """
    matchups: list of {"model_a": str, "model_b": str, "score_a": float}
    Returns: dict of model name -> final rating
    """
    ratings = {model: initial_rating for model in models}

    for match in matchups:
        a = match["model_a"]
        b = match["model_b"]
        score_a = match["score_a"]

        new_a, new_b = update_elo(ratings[a], ratings[b], score_a, k)
        ratings[a] = round(new_a, 2)
        ratings[b] = round(new_b, 2)

    return ratings

Explanation

  1. Expected score for model A against B: E_A = 1 / (1 + 10^((R_B - R_A) / 400)). Higher-rated models are expected to win more often.
  2. Rating update: R_new = R_old + K * (actual_score - expected_score). K controls update magnitude.
  3. The system is zero-sum: rating points gained by the winner equal points lost by the loser.
  4. In ML model comparison (as used in Chatbot Arena), models are paired and human judges pick the winner. Over many comparisons, Elo ratings converge to a meaningful ranking.
  5. K factor controls sensitivity: larger K means faster adaptation but more volatility.

Complexity

  • Time: O(m) where m is the number of matchups
  • Space: O(n) where n is the number of models