Implement the Bellman Equation for Value Iteration

#157 · Reinforcement Learning · Medium

Problem

Implement the Bellman Equation for Value Iteration in a Markov Decision Process (MDP). Given states, actions, transition probabilities, rewards, and a discount factor, iteratively compute the optimal value function.

Solution

import numpy as np

def value_iteration(states: int, actions: int, transitions: np.ndarray,
                    rewards: np.ndarray, gamma: float,
                    theta: float = 1e-6, max_iter: int = 1000) -> np.ndarray:
    V = np.zeros(states)
    for _ in range(max_iter):
        V_new = np.zeros(states)
        for s in range(states):
            q_values = []
            for a in range(actions):
                q = 0.0
                for s_next in range(states):
                    q += transitions[s, a, s_next] * (rewards[s, a, s_next] + gamma * V[s_next])
                q_values.append(q)
            V_new[s] = max(q_values)
        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new
    return V

Explanation

Initialize the value function V(s) = 0 for all states.
For each state, compute the action-value Q(s, a) for every action using the Bellman equation: Q(s,a) = sum over s' of T(s,a,s') * [R(s,a,s') + gamma * V(s')].
Set V(s) to the maximum Q-value across all actions (greedy selection).
Repeat until the maximum change in V is below threshold theta.

Complexity

Time: O(iterations |S|^2 |A|) per iteration over all state-action-next-state triples
Space: O(|S|) for the value function

← #156 #158 →