Epsilon-Greedy Action Selection for n-Armed Bandit

#158 · Reinforcement Learning · Medium

Problem

Implement Epsilon-Greedy Action Selection for the n-armed bandit problem. With probability epsilon, select a random action (exploration); otherwise, select the action with the highest estimated value (exploitation).

Solution

1

2

3

4

5

6

7

import numpy as np

def epsilon_greedy(q_values: np.ndarray, epsilon: float) -> int:
    if np.random.rand() < epsilon:
        return np.random.randint(len(q_values))
    else:
        return int(np.argmax(q_values))

Explanation

Generate a uniform random number between 0 and 1.
If it falls below epsilon, choose a random action uniformly (explore).
Otherwise, choose the action with the highest Q-value estimate (exploit).
This simple strategy balances exploration and exploitation. Setting epsilon = 0 gives pure greedy; epsilon = 1 gives pure random.

Complexity

Time: O(n) where n is the number of actions (for argmax)
Space: O(1)

← #157 #159 →