← back

Implement the GRPO Objective Function

#101 · Reinforcement Learning · Hard

⊣ Solve on deep-ml.com

Problem

Implement the GRPO (Group Relative Policy Optimization) objective function used in reinforcement learning for language models. Given old and new log probabilities, advantages, and a clipping parameter epsilon, compute the GRPO objective with clipped surrogate loss.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np

def grpo_objective(
    log_probs_new: np.ndarray,
    log_probs_old: np.ndarray,
    advantages: np.ndarray,
    epsilon: float = 0.2,
    beta: float = 0.01
) -> float:
    # Importance sampling ratio
    ratio = np.exp(log_probs_new - log_probs_old)

    # Clipped surrogate objective
    unclipped = ratio * advantages
    clipped = np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantages

    # Take the minimum (pessimistic bound)
    surrogate = np.minimum(unclipped, clipped)

    # KL penalty term (approximation)
    kl_div = np.mean(log_probs_old - log_probs_new)

    # GRPO objective: maximize surrogate - beta * KL
    objective = np.mean(surrogate) - beta * kl_div
    return round(float(objective), 4)

Explanation

  1. Importance sampling ratio: r = exp(log_pi_new - log_pi_old) measures how much the new policy differs from the old one for each action.
  2. Clipped surrogate: Clip the ratio to [1-epsilon, 1+epsilon] to prevent excessively large policy updates. Taking the minimum of clipped and unclipped ensures a conservative (pessimistic) update.
  3. KL penalty: A KL divergence term penalizes the new policy for deviating too far from the old policy. GRPO uses group-relative advantages where rewards are normalized within each group.
  4. The final objective balances reward maximization (surrogate) with policy stability (KL penalty).

Complexity

  • Time: O(n) where n is the number of samples
  • Space: O(n) for intermediate ratio and clipped arrays