#162 · Reinforcement Learning · Easy
⊣ Solve on deep-ml.comImplement Upper Confidence Bound (UCB) action selection for the multi-armed bandit problem. UCB selects the action that maximizes Q(a) + c * sqrt(ln(t) / N(a)), balancing exploitation (high Q) with exploration (high uncertainty).
import numpy as np
def ucb_action_selection(q_values: np.ndarray, action_counts: np.ndarray,
t: int, c: float = 2.0) -> int:
n_actions = len(q_values)
for a in range(n_actions):
if action_counts[a] == 0:
return a
ucb_values = q_values + c * np.sqrt(np.log(t) / action_counts)
return int(np.argmax(ucb_values))N(a) = 0), select it immediately (ensures every action is tried at least once).Q(a) + c * sqrt(ln(t) / N(a)).c controls the degree of exploration. Higher c favors exploration.