#272 · Reinforcement Learning · Medium
⊣ Solve on deep-ml.comImplement First-Visit Monte Carlo Prediction to estimate the state-value function. Given episodes of (state, reward) pairs, compute the average return for the first visit to each state across all episodes.
For each episode, identify the first visit to each state, compute the return from that point forward, and maintain running averages.
def first_visit_mc_prediction(
episodes: list[list[tuple]],
gamma: float = 0.99,
) -> dict[str, float]:
returns_sum: dict[str, float] = {}
returns_count: dict[str, int] = {}
for episode in episodes:
# episode is a list of (state, reward) tuples
visited: set[str] = set()
T = len(episode)
# Pre-compute returns from each timestep
G = [0.0] * T
G[T - 1] = episode[T - 1][1]
for t in range(T - 2, -1, -1):
G[t] = episode[t][1] + gamma * G[t + 1]
for t in range(T):
state = str(episode[t][0])
if state not in visited:
visited.add(state)
returns_sum[state] = returns_sum.get(state, 0.0) + G[t]
returns_count[state] = returns_count.get(state, 0) + 1
value_function: dict[str, float] = {}
for state in returns_sum:
value_function[state] = round(returns_sum[state] / returns_count[state], 6)
return value_functionG_t = r_t + gamma * G_{t+1} backwards from the terminal state.V(s) = average of all first-visit returns for state s.