Group Relative Advantage for GRPO

#224 · Reinforcement Learning · Easy

Problem

Compute the Group Relative Advantage used in GRPO (Group Relative Policy Optimization). Given a set of rewards for completions in a group, compute the advantage for each completion by normalizing rewards using the group mean and standard deviation.

Solution

def group_relative_advantage(rewards: list[float]) -> list[float]:
    n = len(rewards)
    if n == 0:
        return []

    mean_r = sum(rewards) / n
    variance = sum((r - mean_r) ** 2 for r in rewards) / n
    std_r = variance ** 0.5

    if std_r < 1e-8:
        return [0.0] * n

    advantages = [(r - mean_r) / std_r for r in rewards]
    return [round(a, 4) for a in advantages]

Explanation

GRPO generates a group of completions for each prompt and scores them with a reward model.
Instead of training a value function (like PPO), GRPO normalizes rewards within the group: A_i = (r_i - mean(r)) / std(r).
This relative advantage centers the rewards and makes the optimization scale-invariant.
If all rewards are the same (zero standard deviation), advantages are zero.

Complexity

Time: O(n) where n is the group size
Space: O(n) for the advantages

← #223 #225 →