#101 · Reinforcement Learning · Hard
⊣ Solve on deep-ml.comImplement the GRPO (Group Relative Policy Optimization) objective function used in reinforcement learning for language models. Given old and new log probabilities, advantages, and a clipping parameter epsilon, compute the GRPO objective with clipped surrogate loss.
import numpy as np
def grpo_objective(
log_probs_new: np.ndarray,
log_probs_old: np.ndarray,
advantages: np.ndarray,
epsilon: float = 0.2,
beta: float = 0.01
) -> float:
# Importance sampling ratio
ratio = np.exp(log_probs_new - log_probs_old)
# Clipped surrogate objective
unclipped = ratio * advantages
clipped = np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantages
# Take the minimum (pessimistic bound)
surrogate = np.minimum(unclipped, clipped)
# KL penalty term (approximation)
kl_div = np.mean(log_probs_old - log_probs_new)
# GRPO objective: maximize surrogate - beta * KL
objective = np.mean(surrogate) - beta * kl_div
return round(float(objective), 4)r = exp(log_pi_new - log_pi_old) measures how much the new policy differs from the old one for each action.[1-epsilon, 1+epsilon] to prevent excessively large policy updates. Taking the minimum of clipped and unclipped ensures a conservative (pessimistic) update.