Compute the Group Relative Advantage used in GRPO (Group Relative Policy Optimization). Given a set of rewards for completions in a group, compute the advantage for each completion by normalizing rewards using the group mean and standard deviation.
def group_relative_advantage(rewards: list[float]) -> list[float]:
n = len(rewards)
if n == 0:
return []
mean_r = sum(rewards) / n
variance = sum((r - mean_r) ** 2 for r in rewards) / n
std_r = variance ** 0.5
if std_r < 1e-8:
return [0.0] * n
advantages = [(r - mean_r) / std_r for r in rewards]
return [round(a, 4) for a in advantages]A_i = (r_i - mean(r)) / std(r).