Gradient Bandit Action Selection

#163 · Reinforcement Learning · Medium

Problem

Implement Gradient Bandit action selection using a softmax preference-based method. Each action has a numerical preference H(a), and action probabilities are computed via softmax. Update preferences using a gradient ascent rule.

Solution

import numpy as np

def gradient_bandit_select(preferences: np.ndarray) -> tuple:
    preferences = preferences - np.max(preferences)
    exp_prefs = np.exp(preferences)
    probs = exp_prefs / np.sum(exp_prefs)
    action = np.random.choice(len(preferences), p=probs)
    return int(action), probs

def gradient_bandit_update(preferences: np.ndarray, action: int,
                           reward: float, avg_reward: float,
                           probs: np.ndarray, alpha: float) -> np.ndarray:
    baseline = reward - avg_reward
    for a in range(len(preferences)):
        if a == action:
            preferences[a] += alpha * baseline * (1 - probs[a])
        else:
            preferences[a] -= alpha * baseline * probs[a]
    return preferences

Explanation

Action selection: Convert preferences to probabilities via softmax (subtract max for numerical stability), then sample an action from the resulting distribution.
Update: After receiving a reward, compute baseline = reward - avg_reward.
For the selected action: increase its preference proportional to alpha * baseline * (1 - pi(a)).
For all other actions: decrease their preference proportional to alpha * baseline * pi(a).
This is stochastic gradient ascent on expected reward. The baseline reduces variance without introducing bias.

Complexity

Time: O(n) where n is the number of actions
Space: O(n) for the probability vector

← #162 #164 →