← back

Gradient Bandit Action Selection

#163 · Reinforcement Learning · Medium

⊣ Solve on deep-ml.com

Problem

Implement Gradient Bandit action selection using a softmax preference-based method. Each action has a numerical preference H(a), and action probabilities are computed via softmax. Update preferences using a gradient ascent rule.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np

def gradient_bandit_select(preferences: np.ndarray) -> tuple:
    preferences = preferences - np.max(preferences)
    exp_prefs = np.exp(preferences)
    probs = exp_prefs / np.sum(exp_prefs)
    action = np.random.choice(len(preferences), p=probs)
    return int(action), probs

def gradient_bandit_update(preferences: np.ndarray, action: int,
                           reward: float, avg_reward: float,
                           probs: np.ndarray, alpha: float) -> np.ndarray:
    baseline = reward - avg_reward
    for a in range(len(preferences)):
        if a == action:
            preferences[a] += alpha * baseline * (1 - probs[a])
        else:
            preferences[a] -= alpha * baseline * probs[a]
    return preferences

Explanation

  1. Action selection: Convert preferences to probabilities via softmax (subtract max for numerical stability), then sample an action from the resulting distribution.
  2. Update: After receiving a reward, compute baseline = reward - avg_reward.
  3. For the selected action: increase its preference proportional to alpha * baseline * (1 - pi(a)).
  4. For all other actions: decrease their preference proportional to alpha * baseline * pi(a).
  5. This is stochastic gradient ascent on expected reward. The baseline reduces variance without introducing bias.

Complexity

  • Time: O(n) where n is the number of actions
  • Space: O(n) for the probability vector