Implement the GRPO Objective Function

#101 · Reinforcement Learning · Hard

Problem

Implement the GRPO (Group Relative Policy Optimization) objective function used in reinforcement learning for language models. Given old and new log probabilities, advantages, and a clipping parameter epsilon, compute the GRPO objective with clipped surrogate loss.

Solution

import numpy as np

def grpo_objective(
    log_probs_new: np.ndarray,
    log_probs_old: np.ndarray,
    advantages: np.ndarray,
    epsilon: float = 0.2,
    beta: float = 0.01
) -> float:
    # Importance sampling ratio
    ratio = np.exp(log_probs_new - log_probs_old)

    # Clipped surrogate objective
    unclipped = ratio * advantages
    clipped = np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantages

    # Take the minimum (pessimistic bound)
    surrogate = np.minimum(unclipped, clipped)

    # KL penalty term (approximation)
    kl_div = np.mean(log_probs_old - log_probs_new)

    # GRPO objective: maximize surrogate - beta * KL
    objective = np.mean(surrogate) - beta * kl_div
    return round(float(objective), 4)

Explanation

Importance sampling ratio: r = exp(log_pi_new - log_pi_old) measures how much the new policy differs from the old one for each action.
Clipped surrogate: Clip the ratio to [1-epsilon, 1+epsilon] to prevent excessively large policy updates. Taking the minimum of clipped and unclipped ensures a conservative (pessimistic) update.
KL penalty: A KL divergence term penalizes the new policy for deviating too far from the old policy. GRPO uses group-relative advantages where rewards are normalized within each group.
The final objective balances reward maximization (surrogate) with policy stability (KL penalty).

Complexity

Time: O(n) where n is the number of samples
Space: O(n) for intermediate ratio and clipped arrays

← #100 #102 →