This chapter covers the two foundational TD control algorithms: Q-Learning (off-policy) and SARSA (on-policy). You will learn how to store state-action values in Q-tables, implement both algorithms from scratch, and understand when to use each approach through the classic Cliff Walking problem.
Learning Objectives
By completing this chapter, you will be able to:
- Understand the Q-table concept and when tabular methods are appropriate
- Explain the Q-Learning update rule and why it is off-policy
- Explain the SARSA update rule and why it is on-policy
- Compare Q-Learning vs SARSA behavior in risk-sensitive environments
- Implement multiple exploration strategies: epsilon-greedy, Boltzmann, and UCB
- Apply both algorithms to the Cliff Walking problem
- Understand Expected SARSA as a variance-reduction technique
2.1 Introduction to Tabular Methods
The Q-Table Concept
In reinforcement learning with discrete state and action spaces, we can store the value of every state-action pair in a table. This table, called a Q-table, maps each (state, action) pair to a value representing the expected cumulative reward.
The Q-table is a 2D array where:
- Rows correspond to states $s \in S$
- Columns correspond to actions $a \in A$
- Each cell $Q(s, a)$ stores the estimated value of taking action $a$ in state $s$
When Tabular Methods Work (and When They Do Not)
| Scenario | Tabular Methods | Reason |
|---|---|---|
| Small discrete state space | Excellent | Can enumerate all states |
| Grid worlds, board games | Good | Finite, countable states |
| Continuous states (e.g., robot joints) | Not applicable | Infinite states, cannot store |
| High-dimensional states (e.g., images) | Not applicable | Exponential state explosion |
| Very large discrete spaces | Impractical | Memory and sample efficiency issues |
Rule of Thumb: If $|S| \times |A| < 10^6$, tabular methods are feasible. Beyond that, consider function approximation (Chapter 3: DQN).
State-Action Value Storage
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0
# - gymnasium>=0.29.0
import numpy as np
import gymnasium as gym
class QTable:
"""
Q-Table implementation for discrete state-action spaces.
This class provides a clean interface for storing and retrieving
Q-values, with support for initialization strategies.
"""
def __init__(self, n_states: int, n_actions: int, init_value: float = 0.0):
"""
Initialize Q-table.
Args:
n_states: Number of discrete states
n_actions: Number of discrete actions
init_value: Initial Q-value for all pairs (optimistic init if > 0)
"""
self.n_states = n_states
self.n_actions = n_actions
self.table = np.full((n_states, n_actions), init_value)
# Track visit counts for analysis
self.visit_counts = np.zeros((n_states, n_actions), dtype=int)
def get_value(self, state: int, action: int) -> float:
"""Get Q(s, a)"""
return self.table[state, action]
def set_value(self, state: int, action: int, value: float):
"""Set Q(s, a)"""
self.table[state, action] = value
self.visit_counts[state, action] += 1
def get_best_action(self, state: int) -> int:
"""Get argmax_a Q(s, a)"""
return int(np.argmax(self.table[state]))
def get_max_value(self, state: int) -> float:
"""Get max_a Q(s, a)"""
return float(np.max(self.table[state]))
def get_action_values(self, state: int) -> np.ndarray:
"""Get Q(s, :) for all actions"""
return self.table[state].copy()
# Demonstration with FrozenLake environment
print("=== Q-Table Demonstration ===\n")
env = gym.make('FrozenLake-v1', is_slippery=False)
n_states = env.observation_space.n # 16 states (4x4 grid)
n_actions = env.action_space.n # 4 actions (left, down, right, up)
print(f"State space size: {n_states}")
print(f"Action space size: {n_actions}")
print(f"Q-table shape: ({n_states}, {n_actions})")
print(f"Total entries: {n_states * n_actions}")
# Create Q-table with optimistic initialization
q_table = QTable(n_states, n_actions, init_value=1.0)
print(f"\nInitial Q-table (first 4 states):")
print(q_table.table[:4])
env.close()
2.2 Q-Learning Algorithm
Off-Policy TD Control
Q-Learning is an off-policy temporal difference control algorithm. It learns the optimal action-value function $Q^*$ directly, regardless of the policy being followed.
The Q-Learning Update Rule
The core update equation for Q-Learning is:
$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right] $$Where:
- $\alpha \in (0, 1]$ is the learning rate
- $\gamma \in [0, 1]$ is the discount factor
- $r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a')$ is the TD target
- $\delta_t = r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)$ is the TD error
(epsilon-greedy)"] A --> E["Environment"] E --> R["Reward r"] E --> Sp["Next State s'"] Sp --> Max["max Q(s', a')"] Max --> Target["TD Target:
r + gamma * max Q"] Target --> Update["Update Q(s,a)"] style S fill:#b3e5fc style A fill:#c5e1a5 style R fill:#fff9c4 style Max fill:#ffab91 style Update fill:#f8bbd9
Why Q-Learning is Off-Policy
The key insight is the max operator in the update:
- Behavior policy: The agent follows an exploratory policy (e.g., epsilon-greedy)
- Target policy: The update uses $\max_{a'} Q(s_{t+1}, a')$, which is the greedy action
The agent can explore freely while still learning the optimal policy. This separation of behavior and target policies is what makes Q-Learning "off-policy."
Convergence Properties
Q-Learning converges to $Q^*$ under the following conditions:
- All state-action pairs are visited infinitely often
- The learning rate $\alpha_t$ satisfies: $\sum_t \alpha_t = \infty$ and $\sum_t \alpha_t^2 < \infty$
- The reward is bounded
In practice, a fixed small learning rate (e.g., $\alpha = 0.1$) works well for most tabular problems.
Complete Q-Learning Implementation
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0
# - gymnasium>=0.29.0
# - matplotlib>=3.7.0
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
from typing import Tuple, List
class QLearningAgent:
"""
Q-Learning Agent with epsilon-greedy exploration.
Implements the off-policy TD control algorithm that learns
Q* directly by bootstrapping with the max over next actions.
"""
def __init__(
self,
n_states: int,
n_actions: int,
alpha: float = 0.1,
gamma: float = 0.99,
epsilon: float = 0.1,
epsilon_decay: float = 1.0,
epsilon_min: float = 0.01
):
"""
Initialize Q-Learning agent.
Args:
n_states: Number of discrete states
n_actions: Number of discrete actions
alpha: Learning rate
gamma: Discount factor
epsilon: Initial exploration rate
epsilon_decay: Decay factor for epsilon after each episode
epsilon_min: Minimum epsilon value
"""
self.n_states = n_states
self.n_actions = n_actions
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
# Initialize Q-table with zeros
self.Q = np.zeros((n_states, n_actions))
def select_action(self, state: int) -> int:
"""
Select action using epsilon-greedy policy.
Args:
state: Current state
Returns:
Selected action
"""
if np.random.random() < self.epsilon:
# Exploration: random action
return np.random.randint(self.n_actions)
else:
# Exploitation: greedy action
return int(np.argmax(self.Q[state]))
def update(
self,
state: int,
action: int,
reward: float,
next_state: int,
done: bool
):
"""
Update Q-value using Q-Learning update rule.
Q(s,a) <- Q(s,a) + alpha * [r + gamma * max_a' Q(s',a') - Q(s,a)]
Args:
state: Current state
action: Action taken
reward: Reward received
next_state: Next state
done: Whether episode terminated
"""
if done:
# Terminal state has no future value
td_target = reward
else:
# Q-Learning: use max over next state's Q-values
td_target = reward + self.gamma * np.max(self.Q[next_state])
# TD error
td_error = td_target - self.Q[state, action]
# Update Q-value
self.Q[state, action] += self.alpha * td_error
def decay_epsilon(self):
"""Decay exploration rate after each episode."""
self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
def train_q_learning(
env: gym.Env,
agent: QLearningAgent,
n_episodes: int = 1000,
max_steps: int = 200
) -> Tuple[List[float], List[int]]:
"""
Train Q-Learning agent on environment.
Args:
env: Gymnasium environment
agent: Q-Learning agent
n_episodes: Number of training episodes
max_steps: Maximum steps per episode
Returns:
Tuple of (episode_rewards, episode_lengths)
"""
episode_rewards = []
episode_lengths = []
for episode in range(n_episodes):
state, _ = env.reset()
total_reward = 0
steps = 0
for step in range(max_steps):
# Select action
action = agent.select_action(state)
# Take action
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Update Q-value
agent.update(state, action, reward, next_state, done)
total_reward += reward
steps += 1
if done:
break
state = next_state
# Decay epsilon
agent.decay_epsilon()
episode_rewards.append(total_reward)
episode_lengths.append(steps)
# Progress logging
if (episode + 1) % 100 == 0:
avg_reward = np.mean(episode_rewards[-100:])
print(f"Episode {episode + 1}: Avg Reward = {avg_reward:.2f}, "
f"Epsilon = {agent.epsilon:.3f}")
return episode_rewards, episode_lengths
# Training demonstration on FrozenLake
print("=== Q-Learning Training on FrozenLake ===\n")
env = gym.make('FrozenLake-v1', is_slippery=False)
agent = QLearningAgent(
n_states=env.observation_space.n,
n_actions=env.action_space.n,
alpha=0.1,
gamma=0.99,
epsilon=1.0,
epsilon_decay=0.995,
epsilon_min=0.01
)
rewards, lengths = train_q_learning(env, agent, n_episodes=500)
# Display learned Q-table
print("\n=== Learned Q-Table (reshaped as 4x4 grid) ===")
print("Best action per state:")
action_names = ['Left', 'Down', 'Right', 'Up']
policy = [action_names[np.argmax(agent.Q[s])] for s in range(16)]
for i in range(4):
print(policy[i*4:(i+1)*4])
env.close()
2.3 SARSA Algorithm
On-Policy TD Control
SARSA (State-Action-Reward-State-Action) is an on-policy TD control algorithm. Unlike Q-Learning, SARSA learns the value of the policy it actually follows.
The SARSA Update Rule
The core update equation for SARSA is:
$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right] $$Notice the difference from Q-Learning:
- Q-Learning: Uses $\max_{a'} Q(s_{t+1}, a')$ (the best action)
- SARSA: Uses $Q(s_{t+1}, a_{t+1})$ (the actual next action)
(same policy)"] A2 --> Update["Update Q(S_t, A_t)
using Q(S_t+1, A_t+1)"] style S1 fill:#b3e5fc style A1 fill:#c5e1a5 style R fill:#fff9c4 style S2 fill:#b3e5fc style A2 fill:#c5e1a5 style Update fill:#ffab91
Why SARSA is On-Policy
The name SARSA comes from the quintuple used in each update: $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$.
Because SARSA uses the actual next action $a_{t+1}$ (chosen by the same policy), it learns the value of the policy it follows. If the agent explores suboptimally, SARSA accounts for this in its value estimates.
Key Insight: SARSA learns a "safer" policy because it considers the possibility of exploratory (potentially bad) actions in the future.
Complete SARSA Implementation
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0
# - gymnasium>=0.29.0
import numpy as np
import gymnasium as gym
from typing import Tuple, List, Optional
class SARSAAgent:
"""
SARSA Agent with epsilon-greedy exploration.
Implements the on-policy TD control algorithm that learns
the value of the policy being followed.
"""
def __init__(
self,
n_states: int,
n_actions: int,
alpha: float = 0.1,
gamma: float = 0.99,
epsilon: float = 0.1,
epsilon_decay: float = 1.0,
epsilon_min: float = 0.01
):
"""
Initialize SARSA agent.
Args:
n_states: Number of discrete states
n_actions: Number of discrete actions
alpha: Learning rate
gamma: Discount factor
epsilon: Initial exploration rate
epsilon_decay: Decay factor for epsilon
epsilon_min: Minimum epsilon value
"""
self.n_states = n_states
self.n_actions = n_actions
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.Q = np.zeros((n_states, n_actions))
def select_action(self, state: int) -> int:
"""Select action using epsilon-greedy policy."""
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions)
else:
return int(np.argmax(self.Q[state]))
def update(
self,
state: int,
action: int,
reward: float,
next_state: int,
next_action: Optional[int],
done: bool
):
"""
Update Q-value using SARSA update rule.
Q(s,a) <- Q(s,a) + alpha * [r + gamma * Q(s',a') - Q(s,a)]
Args:
state: Current state
action: Action taken
reward: Reward received
next_state: Next state
next_action: Next action (from same policy), None if terminal
done: Whether episode terminated
"""
if done:
td_target = reward
else:
# SARSA: use the actual next action's Q-value
td_target = reward + self.gamma * self.Q[next_state, next_action]
td_error = td_target - self.Q[state, action]
self.Q[state, action] += self.alpha * td_error
def decay_epsilon(self):
"""Decay exploration rate."""
self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
def train_sarsa(
env: gym.Env,
agent: SARSAAgent,
n_episodes: int = 1000,
max_steps: int = 200
) -> Tuple[List[float], List[int]]:
"""
Train SARSA agent on environment.
Note the difference from Q-Learning: we select the next action
BEFORE updating, because SARSA needs a_{t+1} for the update.
Args:
env: Gymnasium environment
agent: SARSA agent
n_episodes: Number of training episodes
max_steps: Maximum steps per episode
Returns:
Tuple of (episode_rewards, episode_lengths)
"""
episode_rewards = []
episode_lengths = []
for episode in range(n_episodes):
state, _ = env.reset()
action = agent.select_action(state) # Select first action
total_reward = 0
steps = 0
for step in range(max_steps):
# Take action
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Select next action (for SARSA update)
if not done:
next_action = agent.select_action(next_state)
else:
next_action = None
# Update Q-value using (s, a, r, s', a')
agent.update(state, action, reward, next_state, next_action, done)
total_reward += reward
steps += 1
if done:
break
# Move to next state and action
state = next_state
action = next_action
agent.decay_epsilon()
episode_rewards.append(total_reward)
episode_lengths.append(steps)
if (episode + 1) % 100 == 0:
avg_reward = np.mean(episode_rewards[-100:])
print(f"Episode {episode + 1}: Avg Reward = {avg_reward:.2f}, "
f"Epsilon = {agent.epsilon:.3f}")
return episode_rewards, episode_lengths
# Training demonstration
print("=== SARSA Training on FrozenLake ===\n")
env = gym.make('FrozenLake-v1', is_slippery=False)
sarsa_agent = SARSAAgent(
n_states=env.observation_space.n,
n_actions=env.action_space.n,
alpha=0.1,
gamma=0.99,
epsilon=1.0,
epsilon_decay=0.995,
epsilon_min=0.01
)
sarsa_rewards, sarsa_lengths = train_sarsa(env, sarsa_agent, n_episodes=500)
print("\n=== SARSA Learned Policy ===")
action_names = ['Left', 'Down', 'Right', 'Up']
policy = [action_names[np.argmax(sarsa_agent.Q[s])] for s in range(16)]
for i in range(4):
print(policy[i*4:(i+1)*4])
env.close()
2.4 Q-Learning vs SARSA Comparison
Key Differences
| Aspect | Q-Learning | SARSA |
|---|---|---|
| Policy Type | Off-policy | On-policy |
| Update Target | $r + \gamma \max_{a'} Q(s', a')$ | $r + \gamma Q(s', a')$ |
| Learns | Optimal policy $\pi^*$ | Policy being followed |
| Exploration Impact | Exploration does not affect target | Exploration affects value estimates |
| Risk Sensitivity | Risk-neutral (ignores exploration mistakes) | Risk-aware (accounts for mistakes) |
| Best Use Case | Simulated environments | Real-world, safety-critical systems |
| Convergence | To $Q^*$ (optimal) | To $Q^\pi$ (policy-dependent) |
On-Policy vs Off-Policy: Practical Implications
Q-Learning (Off-Policy):
- Can learn from data collected by any policy (including random or old policies)
- More sample efficient when reusing experience
- May learn risky policies that assume perfect execution
SARSA (On-Policy):
- Must learn from the policy being executed
- Accounts for the agent's own exploration mistakes
- Learns safer policies in dangerous environments
When to Choose Which
- Use Q-Learning when:
- Training in simulation where failures are cheap
- You want the theoretically optimal policy
- You plan to use experience replay (requires off-policy)
- Use SARSA when:
- Training on real hardware (robotics)
- Failures during exploration are costly
- You want a policy that is safe given exploration
2.5 Exploration Strategies Deep Dive
Effective exploration is crucial for reinforcement learning. Here we examine three popular strategies.
1. Epsilon-Greedy with Decay
The simplest exploration strategy:
$$ a_t = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg\max_a Q(s_t, a) & \text{with probability } 1 - \epsilon \end{cases} $$Epsilon typically decays over time: $\epsilon_t = \epsilon_0 \cdot \text{decay}^t$
2. Boltzmann (Softmax) Exploration
Selects actions probabilistically based on their Q-values:
$$ P(a | s) = \frac{\exp(Q(s, a) / \tau)}{\sum_{a'} \exp(Q(s, a') / \tau)} $$Where $\tau$ is the temperature parameter:
- $\tau \to 0$: Greedy (deterministic)
- $\tau \to \infty$: Uniform random
- $\tau = 1$: Standard softmax
3. Upper Confidence Bound (UCB)
Balances exploration and exploitation using uncertainty:
$$ a_t = \arg\max_a \left[ Q(s_t, a) + c \sqrt{\frac{\ln t}{N(s_t, a)}} \right] $$Where:
- $N(s, a)$ is the number of times action $a$ was selected in state $s$
- $c$ is the exploration coefficient
- The bonus term $\sqrt{\frac{\ln t}{N(s, a)}}$ encourages trying less-visited actions
Exploration Strategy Comparison
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0
# - matplotlib>=3.7.0
import numpy as np
import matplotlib.pyplot as plt
class ExplorationStrategy:
"""Base class for exploration strategies."""
def select_action(self, q_values: np.ndarray, **kwargs) -> int:
raise NotImplementedError
class EpsilonGreedy(ExplorationStrategy):
"""Epsilon-greedy exploration."""
def __init__(self, epsilon: float = 0.1):
self.epsilon = epsilon
def select_action(self, q_values: np.ndarray, **kwargs) -> int:
if np.random.random() < self.epsilon:
return np.random.randint(len(q_values))
return int(np.argmax(q_values))
class BoltzmannExploration(ExplorationStrategy):
"""Boltzmann (Softmax) exploration."""
def __init__(self, temperature: float = 1.0):
self.temperature = temperature
def select_action(self, q_values: np.ndarray, **kwargs) -> int:
# Numerical stability: subtract max
exp_values = np.exp((q_values - np.max(q_values)) / self.temperature)
probabilities = exp_values / np.sum(exp_values)
return np.random.choice(len(q_values), p=probabilities)
class UCBExploration(ExplorationStrategy):
"""Upper Confidence Bound exploration."""
def __init__(self, c: float = 2.0):
self.c = c
def select_action(
self,
q_values: np.ndarray,
visit_counts: np.ndarray = None,
total_steps: int = 1,
**kwargs
) -> int:
if visit_counts is None:
return np.random.randint(len(q_values))
# Handle unvisited actions
if np.min(visit_counts) == 0:
return int(np.argmin(visit_counts))
# UCB formula
ucb_values = q_values + self.c * np.sqrt(
np.log(total_steps) / visit_counts
)
return int(np.argmax(ucb_values))
# Compare exploration strategies
print("=== Exploration Strategy Comparison ===\n")
# Simulated Q-values for a 4-action problem
q_values = np.array([1.0, 2.0, 1.5, 0.5]) # Action 1 is best
visit_counts = np.array([100, 50, 75, 25])
strategies = {
'Epsilon-Greedy (eps=0.1)': EpsilonGreedy(epsilon=0.1),
'Epsilon-Greedy (eps=0.3)': EpsilonGreedy(epsilon=0.3),
'Boltzmann (tau=0.5)': BoltzmannExploration(temperature=0.5),
'Boltzmann (tau=1.0)': BoltzmannExploration(temperature=1.0),
'UCB (c=2)': UCBExploration(c=2.0),
}
n_samples = 10000
print(f"Q-values: {q_values}")
print(f"Visit counts: {visit_counts}\n")
fig, axes = plt.subplots(1, len(strategies), figsize=(15, 4))
for idx, (name, strategy) in enumerate(strategies.items()):
action_counts = np.zeros(4)
for step in range(n_samples):
if isinstance(strategy, UCBExploration):
action = strategy.select_action(
q_values,
visit_counts=visit_counts,
total_steps=step + 1
)
else:
action = strategy.select_action(q_values)
action_counts[action] += 1
action_probs = action_counts / n_samples
axes[idx].bar(range(4), action_probs, color=['#e74c3c', '#27ae60', '#3498db', '#9b59b6'])
axes[idx].set_xlabel('Action')
axes[idx].set_ylabel('Selection Probability')
axes[idx].set_title(name)
axes[idx].set_xticks(range(4))
axes[idx].set_ylim(0, 1)
print(f"{name}:")
print(f" Action probabilities: {action_probs}")
plt.tight_layout()
plt.savefig('exploration_strategies.png', dpi=150, bbox_inches='tight')
print("\nSaved: exploration_strategies.png")
plt.close()
2.6 Cliff Walking Problem
Environment Description
The Cliff Walking environment is a classic problem that perfectly illustrates the difference between Q-Learning and SARSA. It is a 4x12 grid world:
- Start: Bottom-left corner (state 36)
- Goal: Bottom-right corner (state 47)
- Cliff: Bottom row between start and goal (states 37-46)
- Reward: -1 per step, -100 for falling off cliff (and reset to start)
Why Cliff Walking is Perfect for Comparison
In this environment:
- Q-Learning learns the optimal path: Walk right along the cliff edge (shortest but risky)
- SARSA learns a safer path: Walk up, across, then down (longer but safer)
The difference arises because SARSA accounts for the possibility of epsilon-greedy exploration causing the agent to accidentally step off the cliff.
Full Cliff Walking Implementation
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0
# - gymnasium>=0.29.0
# - matplotlib>=3.7.0
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
from typing import Tuple, List
def train_and_compare_cliff_walking(
n_episodes: int = 500,
alpha: float = 0.5,
gamma: float = 1.0,
epsilon: float = 0.1
) -> Tuple[np.ndarray, np.ndarray, List[float], List[float]]:
"""
Train both Q-Learning and SARSA on Cliff Walking and compare.
Args:
n_episodes: Number of training episodes
alpha: Learning rate
gamma: Discount factor (1.0 for episodic, undiscounted)
epsilon: Exploration rate (fixed, no decay)
Returns:
Tuple of (Q_qlearning, Q_sarsa, qlearning_rewards, sarsa_rewards)
"""
# Create environments
env_q = gym.make('CliffWalking-v0')
env_s = gym.make('CliffWalking-v0')
n_states = env_q.observation_space.n # 48 states (4x12)
n_actions = env_q.action_space.n # 4 actions (up, right, down, left)
# Initialize Q-tables
Q_qlearning = np.zeros((n_states, n_actions))
Q_sarsa = np.zeros((n_states, n_actions))
qlearning_rewards = []
sarsa_rewards = []
for episode in range(n_episodes):
# === Q-Learning episode ===
state_q, _ = env_q.reset()
total_reward_q = 0
while True:
# Epsilon-greedy action selection
if np.random.random() < epsilon:
action = np.random.randint(n_actions)
else:
action = int(np.argmax(Q_qlearning[state_q]))
next_state, reward, terminated, truncated, _ = env_q.step(action)
done = terminated or truncated
# Q-Learning update: use max over next state
if done:
td_target = reward
else:
td_target = reward + gamma * np.max(Q_qlearning[next_state])
Q_qlearning[state_q, action] += alpha * (
td_target - Q_qlearning[state_q, action]
)
total_reward_q += reward
state_q = next_state
if done:
break
qlearning_rewards.append(total_reward_q)
# === SARSA episode ===
state_s, _ = env_s.reset()
# Select initial action
if np.random.random() < epsilon:
action_s = np.random.randint(n_actions)
else:
action_s = int(np.argmax(Q_sarsa[state_s]))
total_reward_s = 0
while True:
next_state, reward, terminated, truncated, _ = env_s.step(action_s)
done = terminated or truncated
# Select next action (for SARSA)
if not done:
if np.random.random() < epsilon:
next_action = np.random.randint(n_actions)
else:
next_action = int(np.argmax(Q_sarsa[next_state]))
# SARSA update: use actual next action
if done:
td_target = reward
else:
td_target = reward + gamma * Q_sarsa[next_state, next_action]
Q_sarsa[state_s, action_s] += alpha * (
td_target - Q_sarsa[state_s, action_s]
)
total_reward_s += reward
state_s = next_state
if done:
break
action_s = next_action
sarsa_rewards.append(total_reward_s)
env_q.close()
env_s.close()
return Q_qlearning, Q_sarsa, qlearning_rewards, sarsa_rewards
def visualize_cliff_walking_results(
Q_qlearning: np.ndarray,
Q_sarsa: np.ndarray,
qlearning_rewards: List[float],
sarsa_rewards: List[float]
):
"""Visualize comparison between Q-Learning and SARSA."""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# --- Plot 1: Learning Curves ---
ax1 = axes[0, 0]
# Smooth rewards using moving average
window = 10
q_smooth = np.convolve(qlearning_rewards, np.ones(window)/window, mode='valid')
s_smooth = np.convolve(sarsa_rewards, np.ones(window)/window, mode='valid')
ax1.plot(q_smooth, label='Q-Learning', color='#e74c3c', linewidth=2)
ax1.plot(s_smooth, label='SARSA', color='#3498db', linewidth=2)
ax1.set_xlabel('Episode')
ax1.set_ylabel('Sum of Rewards (Smoothed)')
ax1.set_title('Learning Curves: Q-Learning vs SARSA')
ax1.legend()
ax1.grid(True, alpha=0.3)
# --- Plot 2: Q-Learning Policy Visualization ---
ax2 = axes[0, 1]
visualize_policy(Q_qlearning, ax2, "Q-Learning Policy (Optimal but Risky)")
# --- Plot 3: SARSA Policy Visualization ---
ax3 = axes[1, 0]
visualize_policy(Q_sarsa, ax3, "SARSA Policy (Safe Path)")
# --- Plot 4: Final Performance Comparison ---
ax4 = axes[1, 1]
# Last 100 episodes average
q_final = np.mean(qlearning_rewards[-100:])
s_final = np.mean(sarsa_rewards[-100:])
bars = ax4.bar(['Q-Learning', 'SARSA'], [q_final, s_final],
color=['#e74c3c', '#3498db'])
ax4.set_ylabel('Average Reward (Last 100 Episodes)')
ax4.set_title('Final Performance Comparison')
ax4.axhline(y=-13, color='green', linestyle='--',
label='Optimal (no exploration)', linewidth=2)
ax4.legend()
# Add value labels on bars
for bar, val in zip(bars, [q_final, s_final]):
ax4.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
f'{val:.1f}', ha='center', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.savefig('cliff_walking_comparison.png', dpi=150, bbox_inches='tight')
print("Saved: cliff_walking_comparison.png")
plt.close()
def visualize_policy(Q: np.ndarray, ax, title: str):
"""Visualize policy as arrows on the cliff walking grid."""
# Action arrows: up=0, right=1, down=2, left=3
arrows = {0: (0, 0.3), 1: (0.3, 0), 2: (0, -0.3), 3: (-0.3, 0)}
# Create grid
grid = np.zeros((4, 12))
# Mark cliff
for j in range(1, 11):
grid[3, j] = -1 # Cliff
ax.imshow(grid, cmap='RdYlBu', alpha=0.3)
# Draw policy arrows
for state in range(48):
row = state // 12
col = state % 12
# Skip cliff cells and goal
if row == 3 and 1 <= col <= 10:
ax.text(col, row, 'X', ha='center', va='center',
fontsize=12, color='red', fontweight='bold')
continue
if row == 3 and col == 11:
ax.text(col, row, 'G', ha='center', va='center',
fontsize=14, color='green', fontweight='bold')
continue
if row == 3 and col == 0:
ax.text(col, row, 'S', ha='center', va='center',
fontsize=14, color='blue', fontweight='bold')
best_action = np.argmax(Q[state])
dx, dy = arrows[best_action]
ax.arrow(col, row, dx, dy, head_width=0.15, head_length=0.1,
fc='black', ec='black')
ax.set_xlim(-0.5, 11.5)
ax.set_ylim(3.5, -0.5)
ax.set_xticks(range(12))
ax.set_yticks(range(4))
ax.set_title(title)
ax.grid(True, alpha=0.3)
# Run the comparison
print("=== Cliff Walking: Q-Learning vs SARSA ===\n")
Q_q, Q_s, rewards_q, rewards_s = train_and_compare_cliff_walking(
n_episodes=500,
alpha=0.5,
gamma=1.0,
epsilon=0.1
)
print(f"Q-Learning final avg reward (last 100): {np.mean(rewards_q[-100:]):.2f}")
print(f"SARSA final avg reward (last 100): {np.mean(rewards_s[-100:]):.2f}")
visualize_cliff_walking_results(Q_q, Q_s, rewards_q, rewards_s)
print("\n=== Path Analysis ===")
print("Q-Learning learns to walk right along the cliff edge (optimal but risky)")
print("SARSA learns to go up first, then across, avoiding the cliff")
print("With epsilon=0.1, Q-Learning occasionally falls off during training,")
print("while SARSA learns a safer path that accounts for exploration mistakes.")
Understanding the Path Difference
Why does Q-Learning take the risky path?
Q-Learning uses $\max_{a'} Q(s', a')$ in its update. This assumes the agent will act optimally in the future, ignoring the possibility of exploratory actions. It learns the optimal policy as if epsilon were 0.
Why does SARSA take the safe path?
SARSA uses $Q(s', a')$ where $a'$ is the actual next action (which may be exploratory). Near the cliff, SARSA "knows" that sometimes it will randomly step toward the cliff due to epsilon-greedy exploration, so it learns to stay away.
2.7 Expected SARSA
Taking the Expectation Over Next Actions
Expected SARSA is a variant that uses the expected value under the policy, rather than a single sampled action:
$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \sum_{a'} \pi(a'|s_{t+1}) Q(s_{t+1}, a') - Q(s_t, a_t) \right] $$For an epsilon-greedy policy:
$$ \sum_{a'} \pi(a'|s_{t+1}) Q(s_{t+1}, a') = \frac{\epsilon}{|A|} \sum_{a'} Q(s', a') + (1-\epsilon) \max_{a'} Q(s', a') $$Properties of Expected SARSA
- Lower variance than SARSA (uses expectation instead of sample)
- Can be on-policy or off-policy depending on the policy used in expectation
- Reduces to Q-Learning when $\epsilon = 0$ (greedy policy)
- Generally performs better than both Q-Learning and SARSA in many environments
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0
import numpy as np
def expected_sarsa_update(
Q: np.ndarray,
state: int,
action: int,
reward: float,
next_state: int,
done: bool,
alpha: float,
gamma: float,
epsilon: float
) -> None:
"""
Expected SARSA update rule.
Uses the expected value over next actions under the epsilon-greedy
policy, rather than sampling a single next action.
"""
n_actions = Q.shape[1]
if done:
td_target = reward
else:
# Expected value under epsilon-greedy policy
# E[Q(s', a')] = eps/n * sum(Q(s', :)) + (1-eps) * max(Q(s', :))
greedy_action = np.argmax(Q[next_state])
expected_q = (
(epsilon / n_actions) * np.sum(Q[next_state]) +
(1 - epsilon) * Q[next_state, greedy_action]
)
td_target = reward + gamma * expected_q
td_error = td_target - Q[state, action]
Q[state, action] += alpha * td_error
# Example usage
print("=== Expected SARSA Example ===\n")
# Simple Q-table example
Q = np.array([
[1.0, 2.0, 1.5, 0.5], # State 0
[0.8, 1.2, 0.9, 1.5], # State 1
])
epsilon = 0.1
n_actions = 4
# For state 1:
# max Q = 1.5 (action 3)
# sum Q = 0.8 + 1.2 + 0.9 + 1.5 = 4.4
# Expected Q = (0.1/4) * 4.4 + 0.9 * 1.5 = 0.11 + 1.35 = 1.46
expected_q = (epsilon / n_actions) * np.sum(Q[1]) + (1 - epsilon) * np.max(Q[1])
print(f"Q-values for state 1: {Q[1]}")
print(f"Expected Q under epsilon-greedy (epsilon={epsilon}): {expected_q:.3f}")
print(f"Max Q (used by Q-Learning): {np.max(Q[1]):.3f}")
print(f"Sample Q (used by SARSA): depends on sampled action")
Summary
In this chapter, we covered the two fundamental TD control algorithms for tabular reinforcement learning:
Key Takeaways
| Concept | Key Point |
|---|---|
| Q-Table | Stores Q(s,a) for all state-action pairs in discrete spaces |
| Q-Learning | Off-policy: uses $\max_{a'} Q(s', a')$, learns optimal policy |
| SARSA | On-policy: uses $Q(s', a')$, learns policy being followed |
| Epsilon-Greedy | Simple exploration: random with prob epsilon, greedy otherwise |
| Boltzmann | Probabilistic: action selection based on Q-value magnitudes |
| UCB | Optimistic: bonus for less-visited actions |
| Cliff Walking | Q-Learning: risky optimal path; SARSA: safe conservative path |
| Expected SARSA | Lower variance by taking expectation over next actions |
Algorithm Selection Guide
| Situation | Recommended Algorithm | Reason |
|---|---|---|
| Simulation, failures are cheap | Q-Learning | Learns optimal policy efficiently |
| Real hardware, robotics | SARSA | Safer during exploration |
| Experience replay needed | Q-Learning | Off-policy requirement |
| Low variance critical | Expected SARSA | Averaging reduces variance |
Next Steps
In Chapter 3: Deep Q-Network (DQN), we will learn how to handle large and continuous state spaces using neural network function approximation. Key topics include:
- Why tabular methods fail for large state spaces
- Neural network function approximation
- Experience Replay for stable learning
- Target Networks to prevent oscillation
- DQN variants: Double DQN, Dueling DQN, Rainbow
Exercises
Exercise 1: Implement Epsilon Decay Strategies
Implement and compare three epsilon decay strategies on the CliffWalking environment:
- Linear decay: $\epsilon_t = \epsilon_0 - t \cdot \text{rate}$
- Exponential decay: $\epsilon_t = \epsilon_0 \cdot \text{decay}^t$
- Step decay: $\epsilon_t = \epsilon_0 / (1 + t / k)$
Plot the learning curves and final performance for each strategy.
Exercise 2: Double Q-Learning
Implement Double Q-Learning, which uses two Q-tables to reduce overestimation bias:
$$ Q_1(s, a) \leftarrow Q_1(s, a) + \alpha [r + \gamma Q_2(s', \arg\max_{a'} Q_1(s', a')) - Q_1(s, a)] $$Compare the Q-value estimates of standard Q-Learning vs Double Q-Learning on a simple environment.
Exercise 3: n-Step SARSA
Implement n-step SARSA, which uses n-step returns instead of 1-step:
$$ G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n Q(S_{t+n}, A_{t+n}) $$Compare n=1, n=3, and n=5 on FrozenLake-v1.
Exercise 4: Adaptive Learning Rate
Implement state-action specific learning rates that decay based on visit counts:
$$ \alpha(s, a) = \frac{1}{1 + N(s, a)} $$Compare performance with fixed learning rate on the Taxi-v3 environment.
Exercise 5: Windy Gridworld
Implement and solve the Windy Gridworld problem:
- 7x10 grid with wind pushing the agent upward in certain columns
- Compare Q-Learning, SARSA, and Expected SARSA
- Visualize the learned policies