Evaluate Expected Value in a Markov Decision Process

#166 · Reinforcement Learning · Medium

Problem

Evaluate the Expected Value of a state in a Markov Decision Process (MDP) given a fixed policy. Compute V^pi(s) = sum_a pi(a|s) * sum_s' T(s,a,s') * [R(s,a,s') + gamma * V^pi(s')] using iterative policy evaluation.

Solution

import numpy as np

def policy_evaluation(states: int, actions: int, policy: np.ndarray,
                      transitions: np.ndarray, rewards: np.ndarray,
                      gamma: float, theta: float = 1e-6,
                      max_iter: int = 1000) -> np.ndarray:
    V = np.zeros(states)
    for _ in range(max_iter):
        V_new = np.zeros(states)
        for s in range(states):
            v = 0.0
            for a in range(actions):
                if policy[s, a] == 0:
                    continue
                q = 0.0
                for s_next in range(states):
                    q += transitions[s, a, s_next] * (
                        rewards[s, a, s_next] + gamma * V[s_next]
                    )
                v += policy[s, a] * q
            V_new[s] = v
        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new
    return V

Explanation

Initialize V(s) = 0 for all states.
For each state, compute the expected value under the policy by weighting each action's Q-value by the policy probability pi(a|s).
Each Q-value is computed as the expected immediate reward plus discounted future value, summed over all possible next states weighted by transition probabilities.
Repeat until convergence (max change below theta).

Complexity

Time: O(iterations |S|^2 |A|)
Space: O(|S|) for the value function

← #165 #167 →