← back

Epsilon-Greedy Action Selection for n-Armed Bandit

#158 · Reinforcement Learning · Medium

⊣ Solve on deep-ml.com

Problem

Implement Epsilon-Greedy Action Selection for the n-armed bandit problem. With probability epsilon, select a random action (exploration); otherwise, select the action with the highest estimated value (exploitation).

Solution

1
2
3
4
5
6
7
import numpy as np

def epsilon_greedy(q_values: np.ndarray, epsilon: float) -> int:
    if np.random.rand() < epsilon:
        return np.random.randint(len(q_values))
    else:
        return int(np.argmax(q_values))

Explanation

  1. Generate a uniform random number between 0 and 1.
  2. If it falls below epsilon, choose a random action uniformly (explore).
  3. Otherwise, choose the action with the highest Q-value estimate (exploit).
  4. This simple strategy balances exploration and exploitation. Setting epsilon = 0 gives pure greedy; epsilon = 1 gives pure random.

Complexity

  • Time: O(n) where n is the number of actions (for argmax)
  • Space: O(1)