← back

Group Relative Advantage for GRPO

#224 · Reinforcement Learning · Easy

⊣ Solve on deep-ml.com

Problem

Compute the Group Relative Advantage used in GRPO (Group Relative Policy Optimization). Given a set of rewards for completions in a group, compute the advantage for each completion by normalizing rewards using the group mean and standard deviation.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def group_relative_advantage(rewards: list[float]) -> list[float]:
    n = len(rewards)
    if n == 0:
        return []

    mean_r = sum(rewards) / n
    variance = sum((r - mean_r) ** 2 for r in rewards) / n
    std_r = variance ** 0.5

    if std_r < 1e-8:
        return [0.0] * n

    advantages = [(r - mean_r) / std_r for r in rewards]
    return [round(a, 4) for a in advantages]

Explanation

  1. GRPO generates a group of completions for each prompt and scores them with a reward model.
  2. Instead of training a value function (like PPO), GRPO normalizes rewards within the group: A_i = (r_i - mean(r)) / std(r).
  3. This relative advantage centers the rewards and makes the optimization scale-invariant.
  4. If all rewards are the same (zero standard deviation), advantages are zero.

Complexity

  • Time: O(n) where n is the group size
  • Space: O(n) for the advantages