Upper Confidence Bound (UCB) Action Selection

#162 · Reinforcement Learning · Easy

Problem

Implement Upper Confidence Bound (UCB) action selection for the multi-armed bandit problem. UCB selects the action that maximizes Q(a) + c * sqrt(ln(t) / N(a)), balancing exploitation (high Q) with exploration (high uncertainty).

Solution

import numpy as np

def ucb_action_selection(q_values: np.ndarray, action_counts: np.ndarray,
                         t: int, c: float = 2.0) -> int:
    n_actions = len(q_values)
    for a in range(n_actions):
        if action_counts[a] == 0:
            return a

    ucb_values = q_values + c * np.sqrt(np.log(t) / action_counts)
    return int(np.argmax(ucb_values))

Explanation

If any action has never been tried (N(a) = 0), select it immediately (ensures every action is tried at least once).
Otherwise compute the UCB value for each action: Q(a) + c * sqrt(ln(t) / N(a)).
The first term is the current estimated value (exploitation). The second term is the confidence bonus which shrinks as an action is tried more often.
The parameter c controls the degree of exploration. Higher c favors exploration.

Complexity

Time: O(n) where n is the number of actions
Space: O(1) beyond the input arrays

← #161 #163 →