Compute the Temporal Difference (TD) error for a single step in reinforcement learning. Given the current state value, reward, next state value, discount factor, and (optionally) whether the episode terminated, compute the TD error.
TD error = reward + gamma * V(next_state) * (1 - done) - V(current_state).
def td_error(
v_current: float,
reward: float,
v_next: float,
gamma: float = 0.99,
done: bool = False,
) -> float:
target = reward + gamma * v_next * (1.0 - float(done))
error = target - v_current
return round(error, 6)r + gamma * V(s') if the episode continues, or just r if the episode has ended.V(s).V(s) <- V(s) + alpha * delta.