← back

Upper Confidence Bound (UCB) Action Selection

#162 · Reinforcement Learning · Easy

⊣ Solve on deep-ml.com

Problem

Implement Upper Confidence Bound (UCB) action selection for the multi-armed bandit problem. UCB selects the action that maximizes Q(a) + c * sqrt(ln(t) / N(a)), balancing exploitation (high Q) with exploration (high uncertainty).

Solution

1
2
3
4
5
6
7
8
9
10
11
import numpy as np

def ucb_action_selection(q_values: np.ndarray, action_counts: np.ndarray,
                         t: int, c: float = 2.0) -> int:
    n_actions = len(q_values)
    for a in range(n_actions):
        if action_counts[a] == 0:
            return a

    ucb_values = q_values + c * np.sqrt(np.log(t) / action_counts)
    return int(np.argmax(ucb_values))

Explanation

  1. If any action has never been tried (N(a) = 0), select it immediately (ensures every action is tried at least once).
  2. Otherwise compute the UCB value for each action: Q(a) + c * sqrt(ln(t) / N(a)).
  3. The first term is the current estimated value (exploitation). The second term is the confidence bonus which shrinks as an action is tried more often.
  4. The parameter c controls the degree of exploration. Higher c favors exploration.

Complexity

  • Time: O(n) where n is the number of actions
  • Space: O(1) beyond the input arrays