← back

Policy Gradient with REINFORCE

#122 · Reinforcement Learning · Hard

⊣ Solve on deep-ml.com

Problem

Implement the REINFORCE policy gradient algorithm. Given a policy network (parameterized by weights), a set of episode trajectories (states, actions, rewards), and a discount factor gamma, compute the policy gradient update and return updated weights.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np

def reinforce(policy_weights: np.ndarray, episodes: list, learning_rate: float = 0.01, gamma: float = 0.99) -> np.ndarray:
    weight_update = np.zeros_like(policy_weights)

    for episode in episodes:
        states, actions, rewards = episode

        # Compute discounted returns for each timestep
        T = len(rewards)
        returns = np.zeros(T)
        G = 0.0
        for t in reversed(range(T)):
            G = rewards[t] + gamma * G
            returns[t] = G

        # Normalize returns for stability
        if len(returns) > 1 and np.std(returns) > 0:
            returns = (returns - np.mean(returns)) / (np.std(returns) + 1e-8)

        for t in range(T):
            state = np.array(states[t])
            action = actions[t]

            # Softmax policy: pi(a|s) = softmax(s @ W)
            logits = state @ policy_weights
            logits = logits - np.max(logits)  # numerical stability
            exp_logits = np.exp(logits)
            probs = exp_logits / np.sum(exp_logits)

            # Gradient of log-policy: d/dW log pi(a|s)
            # = state^T (one_hot(a) - probs)
            one_hot = np.zeros_like(probs)
            one_hot[action] = 1.0
            grad = np.outer(state, one_hot - probs)

            weight_update += returns[t] * grad

    policy_weights += learning_rate * weight_update / len(episodes)
    return policy_weights

Explanation

  1. For each episode, compute discounted returns G_t at every timestep by working backwards from the last reward.
  2. Optionally normalize returns across the episode for training stability.
  3. Compute the softmax policy probabilities from the state and weight matrix.
  4. The policy gradient for log pi(a|s) under softmax is state^T * (one_hot(action) - probs).
  5. Multiply each gradient by the corresponding return G_t and accumulate.
  6. Average over episodes and apply the learning rate to update weights.

Complexity

  • Time: O(E T S * A) where E = episodes, T = max timesteps, S = state dim, A = action dim
  • Space: O(S * A) for the weight matrix and gradient accumulator