EN | JP | Last updated: 2026-01

Chapter 2: Q-Learning and SARSA

Mastering Tabular TD Control Methods for Discrete State-Action Spaces

Reading Time: 25-30 minutes Difficulty: Intermediate Code Examples: 8 Exercises: 5

This chapter covers the two foundational TD control algorithms: Q-Learning (off-policy) and SARSA (on-policy). You will learn how to store state-action values in Q-tables, implement both algorithms from scratch, and understand when to use each approach through the classic Cliff Walking problem.

Learning Objectives

By completing this chapter, you will be able to:


2.1 Introduction to Tabular Methods

The Q-Table Concept

In reinforcement learning with discrete state and action spaces, we can store the value of every state-action pair in a table. This table, called a Q-table, maps each (state, action) pair to a value representing the expected cumulative reward.

The Q-table is a 2D array where:

graph TD subgraph Q-Table direction TB H["      | a0 | a1 | a2 | a3 |"] R0["s0 | 0.5| 0.8| 0.2| 0.1|"] R1["s1 | 0.3| 0.9| 0.4| 0.2|"] R2["s2 | 0.7| 0.1| 0.6| 0.3|"] R3["...| ...| ...| ...| ...|"] end Agent["Agent"] --> Lookup["Q(s, a) Lookup"] Lookup --> Action["Select Action"] Action --> Env["Environment"] Env --> Update["Update Q(s, a)"] Update --> Agent style Agent fill:#e3f2fd style Env fill:#fff3e0 style Lookup fill:#e8f5e9 style Update fill:#ffab91

When Tabular Methods Work (and When They Do Not)

Scenario Tabular Methods Reason
Small discrete state space Excellent Can enumerate all states
Grid worlds, board games Good Finite, countable states
Continuous states (e.g., robot joints) Not applicable Infinite states, cannot store
High-dimensional states (e.g., images) Not applicable Exponential state explosion
Very large discrete spaces Impractical Memory and sample efficiency issues

Rule of Thumb: If $|S| \times |A| < 10^6$, tabular methods are feasible. Beyond that, consider function approximation (Chapter 3: DQN).

State-Action Value Storage

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0
# - gymnasium>=0.29.0

import numpy as np
import gymnasium as gym

class QTable:
    """
    Q-Table implementation for discrete state-action spaces.

    This class provides a clean interface for storing and retrieving
    Q-values, with support for initialization strategies.
    """

    def __init__(self, n_states: int, n_actions: int, init_value: float = 0.0):
        """
        Initialize Q-table.

        Args:
            n_states: Number of discrete states
            n_actions: Number of discrete actions
            init_value: Initial Q-value for all pairs (optimistic init if > 0)
        """
        self.n_states = n_states
        self.n_actions = n_actions
        self.table = np.full((n_states, n_actions), init_value)

        # Track visit counts for analysis
        self.visit_counts = np.zeros((n_states, n_actions), dtype=int)

    def get_value(self, state: int, action: int) -> float:
        """Get Q(s, a)"""
        return self.table[state, action]

    def set_value(self, state: int, action: int, value: float):
        """Set Q(s, a)"""
        self.table[state, action] = value
        self.visit_counts[state, action] += 1

    def get_best_action(self, state: int) -> int:
        """Get argmax_a Q(s, a)"""
        return int(np.argmax(self.table[state]))

    def get_max_value(self, state: int) -> float:
        """Get max_a Q(s, a)"""
        return float(np.max(self.table[state]))

    def get_action_values(self, state: int) -> np.ndarray:
        """Get Q(s, :) for all actions"""
        return self.table[state].copy()


# Demonstration with FrozenLake environment
print("=== Q-Table Demonstration ===\n")

env = gym.make('FrozenLake-v1', is_slippery=False)

n_states = env.observation_space.n  # 16 states (4x4 grid)
n_actions = env.action_space.n      # 4 actions (left, down, right, up)

print(f"State space size: {n_states}")
print(f"Action space size: {n_actions}")
print(f"Q-table shape: ({n_states}, {n_actions})")
print(f"Total entries: {n_states * n_actions}")

# Create Q-table with optimistic initialization
q_table = QTable(n_states, n_actions, init_value=1.0)

print(f"\nInitial Q-table (first 4 states):")
print(q_table.table[:4])

env.close()

2.2 Q-Learning Algorithm

Off-Policy TD Control

Q-Learning is an off-policy temporal difference control algorithm. It learns the optimal action-value function $Q^*$ directly, regardless of the policy being followed.

The Q-Learning Update Rule

The core update equation for Q-Learning is:

$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right] $$

Where:

graph LR S["State s"] --> A["Action a
(epsilon-greedy)"] A --> E["Environment"] E --> R["Reward r"] E --> Sp["Next State s'"] Sp --> Max["max Q(s', a')"] Max --> Target["TD Target:
r + gamma * max Q"] Target --> Update["Update Q(s,a)"] style S fill:#b3e5fc style A fill:#c5e1a5 style R fill:#fff9c4 style Max fill:#ffab91 style Update fill:#f8bbd9

Why Q-Learning is Off-Policy

The key insight is the max operator in the update:

The agent can explore freely while still learning the optimal policy. This separation of behavior and target policies is what makes Q-Learning "off-policy."

Convergence Properties

Q-Learning converges to $Q^*$ under the following conditions:

  1. All state-action pairs are visited infinitely often
  2. The learning rate $\alpha_t$ satisfies: $\sum_t \alpha_t = \infty$ and $\sum_t \alpha_t^2 < \infty$
  3. The reward is bounded

In practice, a fixed small learning rate (e.g., $\alpha = 0.1$) works well for most tabular problems.

Complete Q-Learning Implementation

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0
# - gymnasium>=0.29.0
# - matplotlib>=3.7.0

import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
from typing import Tuple, List

class QLearningAgent:
    """
    Q-Learning Agent with epsilon-greedy exploration.

    Implements the off-policy TD control algorithm that learns
    Q* directly by bootstrapping with the max over next actions.
    """

    def __init__(
        self,
        n_states: int,
        n_actions: int,
        alpha: float = 0.1,
        gamma: float = 0.99,
        epsilon: float = 0.1,
        epsilon_decay: float = 1.0,
        epsilon_min: float = 0.01
    ):
        """
        Initialize Q-Learning agent.

        Args:
            n_states: Number of discrete states
            n_actions: Number of discrete actions
            alpha: Learning rate
            gamma: Discount factor
            epsilon: Initial exploration rate
            epsilon_decay: Decay factor for epsilon after each episode
            epsilon_min: Minimum epsilon value
        """
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min

        # Initialize Q-table with zeros
        self.Q = np.zeros((n_states, n_actions))

    def select_action(self, state: int) -> int:
        """
        Select action using epsilon-greedy policy.

        Args:
            state: Current state

        Returns:
            Selected action
        """
        if np.random.random() < self.epsilon:
            # Exploration: random action
            return np.random.randint(self.n_actions)
        else:
            # Exploitation: greedy action
            return int(np.argmax(self.Q[state]))

    def update(
        self,
        state: int,
        action: int,
        reward: float,
        next_state: int,
        done: bool
    ):
        """
        Update Q-value using Q-Learning update rule.

        Q(s,a) <- Q(s,a) + alpha * [r + gamma * max_a' Q(s',a') - Q(s,a)]

        Args:
            state: Current state
            action: Action taken
            reward: Reward received
            next_state: Next state
            done: Whether episode terminated
        """
        if done:
            # Terminal state has no future value
            td_target = reward
        else:
            # Q-Learning: use max over next state's Q-values
            td_target = reward + self.gamma * np.max(self.Q[next_state])

        # TD error
        td_error = td_target - self.Q[state, action]

        # Update Q-value
        self.Q[state, action] += self.alpha * td_error

    def decay_epsilon(self):
        """Decay exploration rate after each episode."""
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)


def train_q_learning(
    env: gym.Env,
    agent: QLearningAgent,
    n_episodes: int = 1000,
    max_steps: int = 200
) -> Tuple[List[float], List[int]]:
    """
    Train Q-Learning agent on environment.

    Args:
        env: Gymnasium environment
        agent: Q-Learning agent
        n_episodes: Number of training episodes
        max_steps: Maximum steps per episode

    Returns:
        Tuple of (episode_rewards, episode_lengths)
    """
    episode_rewards = []
    episode_lengths = []

    for episode in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0
        steps = 0

        for step in range(max_steps):
            # Select action
            action = agent.select_action(state)

            # Take action
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Update Q-value
            agent.update(state, action, reward, next_state, done)

            total_reward += reward
            steps += 1

            if done:
                break

            state = next_state

        # Decay epsilon
        agent.decay_epsilon()

        episode_rewards.append(total_reward)
        episode_lengths.append(steps)

        # Progress logging
        if (episode + 1) % 100 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            print(f"Episode {episode + 1}: Avg Reward = {avg_reward:.2f}, "
                  f"Epsilon = {agent.epsilon:.3f}")

    return episode_rewards, episode_lengths


# Training demonstration on FrozenLake
print("=== Q-Learning Training on FrozenLake ===\n")

env = gym.make('FrozenLake-v1', is_slippery=False)

agent = QLearningAgent(
    n_states=env.observation_space.n,
    n_actions=env.action_space.n,
    alpha=0.1,
    gamma=0.99,
    epsilon=1.0,
    epsilon_decay=0.995,
    epsilon_min=0.01
)

rewards, lengths = train_q_learning(env, agent, n_episodes=500)

# Display learned Q-table
print("\n=== Learned Q-Table (reshaped as 4x4 grid) ===")
print("Best action per state:")
action_names = ['Left', 'Down', 'Right', 'Up']
policy = [action_names[np.argmax(agent.Q[s])] for s in range(16)]
for i in range(4):
    print(policy[i*4:(i+1)*4])

env.close()

2.3 SARSA Algorithm

On-Policy TD Control

SARSA (State-Action-Reward-State-Action) is an on-policy TD control algorithm. Unlike Q-Learning, SARSA learns the value of the policy it actually follows.

The SARSA Update Rule

The core update equation for SARSA is:

$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right] $$

Notice the difference from Q-Learning:

graph LR S1["S_t"] --> A1["A_t"] A1 --> R["R_t+1"] R --> S2["S_t+1"] S2 --> A2["A_t+1
(same policy)"] A2 --> Update["Update Q(S_t, A_t)
using Q(S_t+1, A_t+1)"] style S1 fill:#b3e5fc style A1 fill:#c5e1a5 style R fill:#fff9c4 style S2 fill:#b3e5fc style A2 fill:#c5e1a5 style Update fill:#ffab91

Why SARSA is On-Policy

The name SARSA comes from the quintuple used in each update: $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$.

Because SARSA uses the actual next action $a_{t+1}$ (chosen by the same policy), it learns the value of the policy it follows. If the agent explores suboptimally, SARSA accounts for this in its value estimates.

Key Insight: SARSA learns a "safer" policy because it considers the possibility of exploratory (potentially bad) actions in the future.

Complete SARSA Implementation

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0
# - gymnasium>=0.29.0

import numpy as np
import gymnasium as gym
from typing import Tuple, List, Optional

class SARSAAgent:
    """
    SARSA Agent with epsilon-greedy exploration.

    Implements the on-policy TD control algorithm that learns
    the value of the policy being followed.
    """

    def __init__(
        self,
        n_states: int,
        n_actions: int,
        alpha: float = 0.1,
        gamma: float = 0.99,
        epsilon: float = 0.1,
        epsilon_decay: float = 1.0,
        epsilon_min: float = 0.01
    ):
        """
        Initialize SARSA agent.

        Args:
            n_states: Number of discrete states
            n_actions: Number of discrete actions
            alpha: Learning rate
            gamma: Discount factor
            epsilon: Initial exploration rate
            epsilon_decay: Decay factor for epsilon
            epsilon_min: Minimum epsilon value
        """
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min

        self.Q = np.zeros((n_states, n_actions))

    def select_action(self, state: int) -> int:
        """Select action using epsilon-greedy policy."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        else:
            return int(np.argmax(self.Q[state]))

    def update(
        self,
        state: int,
        action: int,
        reward: float,
        next_state: int,
        next_action: Optional[int],
        done: bool
    ):
        """
        Update Q-value using SARSA update rule.

        Q(s,a) <- Q(s,a) + alpha * [r + gamma * Q(s',a') - Q(s,a)]

        Args:
            state: Current state
            action: Action taken
            reward: Reward received
            next_state: Next state
            next_action: Next action (from same policy), None if terminal
            done: Whether episode terminated
        """
        if done:
            td_target = reward
        else:
            # SARSA: use the actual next action's Q-value
            td_target = reward + self.gamma * self.Q[next_state, next_action]

        td_error = td_target - self.Q[state, action]
        self.Q[state, action] += self.alpha * td_error

    def decay_epsilon(self):
        """Decay exploration rate."""
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)


def train_sarsa(
    env: gym.Env,
    agent: SARSAAgent,
    n_episodes: int = 1000,
    max_steps: int = 200
) -> Tuple[List[float], List[int]]:
    """
    Train SARSA agent on environment.

    Note the difference from Q-Learning: we select the next action
    BEFORE updating, because SARSA needs a_{t+1} for the update.

    Args:
        env: Gymnasium environment
        agent: SARSA agent
        n_episodes: Number of training episodes
        max_steps: Maximum steps per episode

    Returns:
        Tuple of (episode_rewards, episode_lengths)
    """
    episode_rewards = []
    episode_lengths = []

    for episode in range(n_episodes):
        state, _ = env.reset()
        action = agent.select_action(state)  # Select first action

        total_reward = 0
        steps = 0

        for step in range(max_steps):
            # Take action
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Select next action (for SARSA update)
            if not done:
                next_action = agent.select_action(next_state)
            else:
                next_action = None

            # Update Q-value using (s, a, r, s', a')
            agent.update(state, action, reward, next_state, next_action, done)

            total_reward += reward
            steps += 1

            if done:
                break

            # Move to next state and action
            state = next_state
            action = next_action

        agent.decay_epsilon()

        episode_rewards.append(total_reward)
        episode_lengths.append(steps)

        if (episode + 1) % 100 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            print(f"Episode {episode + 1}: Avg Reward = {avg_reward:.2f}, "
                  f"Epsilon = {agent.epsilon:.3f}")

    return episode_rewards, episode_lengths


# Training demonstration
print("=== SARSA Training on FrozenLake ===\n")

env = gym.make('FrozenLake-v1', is_slippery=False)

sarsa_agent = SARSAAgent(
    n_states=env.observation_space.n,
    n_actions=env.action_space.n,
    alpha=0.1,
    gamma=0.99,
    epsilon=1.0,
    epsilon_decay=0.995,
    epsilon_min=0.01
)

sarsa_rewards, sarsa_lengths = train_sarsa(env, sarsa_agent, n_episodes=500)

print("\n=== SARSA Learned Policy ===")
action_names = ['Left', 'Down', 'Right', 'Up']
policy = [action_names[np.argmax(sarsa_agent.Q[s])] for s in range(16)]
for i in range(4):
    print(policy[i*4:(i+1)*4])

env.close()

2.4 Q-Learning vs SARSA Comparison

Key Differences

Aspect Q-Learning SARSA
Policy Type Off-policy On-policy
Update Target $r + \gamma \max_{a'} Q(s', a')$ $r + \gamma Q(s', a')$
Learns Optimal policy $\pi^*$ Policy being followed
Exploration Impact Exploration does not affect target Exploration affects value estimates
Risk Sensitivity Risk-neutral (ignores exploration mistakes) Risk-aware (accounts for mistakes)
Best Use Case Simulated environments Real-world, safety-critical systems
Convergence To $Q^*$ (optimal) To $Q^\pi$ (policy-dependent)

On-Policy vs Off-Policy: Practical Implications

Q-Learning (Off-Policy):

SARSA (On-Policy):

When to Choose Which


2.5 Exploration Strategies Deep Dive

Effective exploration is crucial for reinforcement learning. Here we examine three popular strategies.

1. Epsilon-Greedy with Decay

The simplest exploration strategy:

$$ a_t = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg\max_a Q(s_t, a) & \text{with probability } 1 - \epsilon \end{cases} $$

Epsilon typically decays over time: $\epsilon_t = \epsilon_0 \cdot \text{decay}^t$

2. Boltzmann (Softmax) Exploration

Selects actions probabilistically based on their Q-values:

$$ P(a | s) = \frac{\exp(Q(s, a) / \tau)}{\sum_{a'} \exp(Q(s, a') / \tau)} $$

Where $\tau$ is the temperature parameter:

3. Upper Confidence Bound (UCB)

Balances exploration and exploitation using uncertainty:

$$ a_t = \arg\max_a \left[ Q(s_t, a) + c \sqrt{\frac{\ln t}{N(s_t, a)}} \right] $$

Where:

Exploration Strategy Comparison

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0
# - matplotlib>=3.7.0

import numpy as np
import matplotlib.pyplot as plt

class ExplorationStrategy:
    """Base class for exploration strategies."""

    def select_action(self, q_values: np.ndarray, **kwargs) -> int:
        raise NotImplementedError


class EpsilonGreedy(ExplorationStrategy):
    """Epsilon-greedy exploration."""

    def __init__(self, epsilon: float = 0.1):
        self.epsilon = epsilon

    def select_action(self, q_values: np.ndarray, **kwargs) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(len(q_values))
        return int(np.argmax(q_values))


class BoltzmannExploration(ExplorationStrategy):
    """Boltzmann (Softmax) exploration."""

    def __init__(self, temperature: float = 1.0):
        self.temperature = temperature

    def select_action(self, q_values: np.ndarray, **kwargs) -> int:
        # Numerical stability: subtract max
        exp_values = np.exp((q_values - np.max(q_values)) / self.temperature)
        probabilities = exp_values / np.sum(exp_values)
        return np.random.choice(len(q_values), p=probabilities)


class UCBExploration(ExplorationStrategy):
    """Upper Confidence Bound exploration."""

    def __init__(self, c: float = 2.0):
        self.c = c

    def select_action(
        self,
        q_values: np.ndarray,
        visit_counts: np.ndarray = None,
        total_steps: int = 1,
        **kwargs
    ) -> int:
        if visit_counts is None:
            return np.random.randint(len(q_values))

        # Handle unvisited actions
        if np.min(visit_counts) == 0:
            return int(np.argmin(visit_counts))

        # UCB formula
        ucb_values = q_values + self.c * np.sqrt(
            np.log(total_steps) / visit_counts
        )
        return int(np.argmax(ucb_values))


# Compare exploration strategies
print("=== Exploration Strategy Comparison ===\n")

# Simulated Q-values for a 4-action problem
q_values = np.array([1.0, 2.0, 1.5, 0.5])  # Action 1 is best
visit_counts = np.array([100, 50, 75, 25])

strategies = {
    'Epsilon-Greedy (eps=0.1)': EpsilonGreedy(epsilon=0.1),
    'Epsilon-Greedy (eps=0.3)': EpsilonGreedy(epsilon=0.3),
    'Boltzmann (tau=0.5)': BoltzmannExploration(temperature=0.5),
    'Boltzmann (tau=1.0)': BoltzmannExploration(temperature=1.0),
    'UCB (c=2)': UCBExploration(c=2.0),
}

n_samples = 10000

print(f"Q-values: {q_values}")
print(f"Visit counts: {visit_counts}\n")

fig, axes = plt.subplots(1, len(strategies), figsize=(15, 4))

for idx, (name, strategy) in enumerate(strategies.items()):
    action_counts = np.zeros(4)

    for step in range(n_samples):
        if isinstance(strategy, UCBExploration):
            action = strategy.select_action(
                q_values,
                visit_counts=visit_counts,
                total_steps=step + 1
            )
        else:
            action = strategy.select_action(q_values)
        action_counts[action] += 1

    action_probs = action_counts / n_samples

    axes[idx].bar(range(4), action_probs, color=['#e74c3c', '#27ae60', '#3498db', '#9b59b6'])
    axes[idx].set_xlabel('Action')
    axes[idx].set_ylabel('Selection Probability')
    axes[idx].set_title(name)
    axes[idx].set_xticks(range(4))
    axes[idx].set_ylim(0, 1)

    print(f"{name}:")
    print(f"  Action probabilities: {action_probs}")

plt.tight_layout()
plt.savefig('exploration_strategies.png', dpi=150, bbox_inches='tight')
print("\nSaved: exploration_strategies.png")
plt.close()

2.6 Cliff Walking Problem

Environment Description

The Cliff Walking environment is a classic problem that perfectly illustrates the difference between Q-Learning and SARSA. It is a 4x12 grid world:

graph TD subgraph CliffWalking["Cliff Walking Environment (4x12 grid)"] R0["Row 0: Safe path (upper)"] R1["Row 1: Safe path"] R2["Row 2: Safe path"] R3["Row 3: S | CLIFF CLIFF CLIFF CLIFF CLIFF CLIFF CLIFF CLIFF CLIFF CLIFF | G"] end subgraph Legend Start["S = Start (bottom-left)"] Goal["G = Goal (bottom-right)"] Cliff["CLIFF = -100 reward, reset to start"] Safe["Other cells = -1 per step"] end

Why Cliff Walking is Perfect for Comparison

In this environment:

The difference arises because SARSA accounts for the possibility of epsilon-greedy exploration causing the agent to accidentally step off the cliff.

Full Cliff Walking Implementation

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0
# - gymnasium>=0.29.0
# - matplotlib>=3.7.0

import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
from typing import Tuple, List

def train_and_compare_cliff_walking(
    n_episodes: int = 500,
    alpha: float = 0.5,
    gamma: float = 1.0,
    epsilon: float = 0.1
) -> Tuple[np.ndarray, np.ndarray, List[float], List[float]]:
    """
    Train both Q-Learning and SARSA on Cliff Walking and compare.

    Args:
        n_episodes: Number of training episodes
        alpha: Learning rate
        gamma: Discount factor (1.0 for episodic, undiscounted)
        epsilon: Exploration rate (fixed, no decay)

    Returns:
        Tuple of (Q_qlearning, Q_sarsa, qlearning_rewards, sarsa_rewards)
    """
    # Create environments
    env_q = gym.make('CliffWalking-v0')
    env_s = gym.make('CliffWalking-v0')

    n_states = env_q.observation_space.n   # 48 states (4x12)
    n_actions = env_q.action_space.n       # 4 actions (up, right, down, left)

    # Initialize Q-tables
    Q_qlearning = np.zeros((n_states, n_actions))
    Q_sarsa = np.zeros((n_states, n_actions))

    qlearning_rewards = []
    sarsa_rewards = []

    for episode in range(n_episodes):
        # === Q-Learning episode ===
        state_q, _ = env_q.reset()
        total_reward_q = 0

        while True:
            # Epsilon-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.randint(n_actions)
            else:
                action = int(np.argmax(Q_qlearning[state_q]))

            next_state, reward, terminated, truncated, _ = env_q.step(action)
            done = terminated or truncated

            # Q-Learning update: use max over next state
            if done:
                td_target = reward
            else:
                td_target = reward + gamma * np.max(Q_qlearning[next_state])

            Q_qlearning[state_q, action] += alpha * (
                td_target - Q_qlearning[state_q, action]
            )

            total_reward_q += reward
            state_q = next_state

            if done:
                break

        qlearning_rewards.append(total_reward_q)

        # === SARSA episode ===
        state_s, _ = env_s.reset()

        # Select initial action
        if np.random.random() < epsilon:
            action_s = np.random.randint(n_actions)
        else:
            action_s = int(np.argmax(Q_sarsa[state_s]))

        total_reward_s = 0

        while True:
            next_state, reward, terminated, truncated, _ = env_s.step(action_s)
            done = terminated or truncated

            # Select next action (for SARSA)
            if not done:
                if np.random.random() < epsilon:
                    next_action = np.random.randint(n_actions)
                else:
                    next_action = int(np.argmax(Q_sarsa[next_state]))

            # SARSA update: use actual next action
            if done:
                td_target = reward
            else:
                td_target = reward + gamma * Q_sarsa[next_state, next_action]

            Q_sarsa[state_s, action_s] += alpha * (
                td_target - Q_sarsa[state_s, action_s]
            )

            total_reward_s += reward
            state_s = next_state

            if done:
                break

            action_s = next_action

        sarsa_rewards.append(total_reward_s)

    env_q.close()
    env_s.close()

    return Q_qlearning, Q_sarsa, qlearning_rewards, sarsa_rewards


def visualize_cliff_walking_results(
    Q_qlearning: np.ndarray,
    Q_sarsa: np.ndarray,
    qlearning_rewards: List[float],
    sarsa_rewards: List[float]
):
    """Visualize comparison between Q-Learning and SARSA."""

    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # --- Plot 1: Learning Curves ---
    ax1 = axes[0, 0]

    # Smooth rewards using moving average
    window = 10
    q_smooth = np.convolve(qlearning_rewards, np.ones(window)/window, mode='valid')
    s_smooth = np.convolve(sarsa_rewards, np.ones(window)/window, mode='valid')

    ax1.plot(q_smooth, label='Q-Learning', color='#e74c3c', linewidth=2)
    ax1.plot(s_smooth, label='SARSA', color='#3498db', linewidth=2)
    ax1.set_xlabel('Episode')
    ax1.set_ylabel('Sum of Rewards (Smoothed)')
    ax1.set_title('Learning Curves: Q-Learning vs SARSA')
    ax1.legend()
    ax1.grid(True, alpha=0.3)

    # --- Plot 2: Q-Learning Policy Visualization ---
    ax2 = axes[0, 1]
    visualize_policy(Q_qlearning, ax2, "Q-Learning Policy (Optimal but Risky)")

    # --- Plot 3: SARSA Policy Visualization ---
    ax3 = axes[1, 0]
    visualize_policy(Q_sarsa, ax3, "SARSA Policy (Safe Path)")

    # --- Plot 4: Final Performance Comparison ---
    ax4 = axes[1, 1]

    # Last 100 episodes average
    q_final = np.mean(qlearning_rewards[-100:])
    s_final = np.mean(sarsa_rewards[-100:])

    bars = ax4.bar(['Q-Learning', 'SARSA'], [q_final, s_final],
                   color=['#e74c3c', '#3498db'])
    ax4.set_ylabel('Average Reward (Last 100 Episodes)')
    ax4.set_title('Final Performance Comparison')
    ax4.axhline(y=-13, color='green', linestyle='--',
                label='Optimal (no exploration)', linewidth=2)
    ax4.legend()

    # Add value labels on bars
    for bar, val in zip(bars, [q_final, s_final]):
        ax4.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                f'{val:.1f}', ha='center', fontsize=12, fontweight='bold')

    plt.tight_layout()
    plt.savefig('cliff_walking_comparison.png', dpi=150, bbox_inches='tight')
    print("Saved: cliff_walking_comparison.png")
    plt.close()


def visualize_policy(Q: np.ndarray, ax, title: str):
    """Visualize policy as arrows on the cliff walking grid."""

    # Action arrows: up=0, right=1, down=2, left=3
    arrows = {0: (0, 0.3), 1: (0.3, 0), 2: (0, -0.3), 3: (-0.3, 0)}

    # Create grid
    grid = np.zeros((4, 12))

    # Mark cliff
    for j in range(1, 11):
        grid[3, j] = -1  # Cliff

    ax.imshow(grid, cmap='RdYlBu', alpha=0.3)

    # Draw policy arrows
    for state in range(48):
        row = state // 12
        col = state % 12

        # Skip cliff cells and goal
        if row == 3 and 1 <= col <= 10:
            ax.text(col, row, 'X', ha='center', va='center',
                   fontsize=12, color='red', fontweight='bold')
            continue

        if row == 3 and col == 11:
            ax.text(col, row, 'G', ha='center', va='center',
                   fontsize=14, color='green', fontweight='bold')
            continue

        if row == 3 and col == 0:
            ax.text(col, row, 'S', ha='center', va='center',
                   fontsize=14, color='blue', fontweight='bold')

        best_action = np.argmax(Q[state])
        dx, dy = arrows[best_action]
        ax.arrow(col, row, dx, dy, head_width=0.15, head_length=0.1,
                fc='black', ec='black')

    ax.set_xlim(-0.5, 11.5)
    ax.set_ylim(3.5, -0.5)
    ax.set_xticks(range(12))
    ax.set_yticks(range(4))
    ax.set_title(title)
    ax.grid(True, alpha=0.3)


# Run the comparison
print("=== Cliff Walking: Q-Learning vs SARSA ===\n")

Q_q, Q_s, rewards_q, rewards_s = train_and_compare_cliff_walking(
    n_episodes=500,
    alpha=0.5,
    gamma=1.0,
    epsilon=0.1
)

print(f"Q-Learning final avg reward (last 100): {np.mean(rewards_q[-100:]):.2f}")
print(f"SARSA final avg reward (last 100): {np.mean(rewards_s[-100:]):.2f}")

visualize_cliff_walking_results(Q_q, Q_s, rewards_q, rewards_s)

print("\n=== Path Analysis ===")
print("Q-Learning learns to walk right along the cliff edge (optimal but risky)")
print("SARSA learns to go up first, then across, avoiding the cliff")
print("With epsilon=0.1, Q-Learning occasionally falls off during training,")
print("while SARSA learns a safer path that accounts for exploration mistakes.")

Understanding the Path Difference

Why does Q-Learning take the risky path?
Q-Learning uses $\max_{a'} Q(s', a')$ in its update. This assumes the agent will act optimally in the future, ignoring the possibility of exploratory actions. It learns the optimal policy as if epsilon were 0.

Why does SARSA take the safe path?
SARSA uses $Q(s', a')$ where $a'$ is the actual next action (which may be exploratory). Near the cliff, SARSA "knows" that sometimes it will randomly step toward the cliff due to epsilon-greedy exploration, so it learns to stay away.


2.7 Expected SARSA

Taking the Expectation Over Next Actions

Expected SARSA is a variant that uses the expected value under the policy, rather than a single sampled action:

$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \sum_{a'} \pi(a'|s_{t+1}) Q(s_{t+1}, a') - Q(s_t, a_t) \right] $$

For an epsilon-greedy policy:

$$ \sum_{a'} \pi(a'|s_{t+1}) Q(s_{t+1}, a') = \frac{\epsilon}{|A|} \sum_{a'} Q(s', a') + (1-\epsilon) \max_{a'} Q(s', a') $$

Properties of Expected SARSA

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0

import numpy as np

def expected_sarsa_update(
    Q: np.ndarray,
    state: int,
    action: int,
    reward: float,
    next_state: int,
    done: bool,
    alpha: float,
    gamma: float,
    epsilon: float
) -> None:
    """
    Expected SARSA update rule.

    Uses the expected value over next actions under the epsilon-greedy
    policy, rather than sampling a single next action.
    """
    n_actions = Q.shape[1]

    if done:
        td_target = reward
    else:
        # Expected value under epsilon-greedy policy
        # E[Q(s', a')] = eps/n * sum(Q(s', :)) + (1-eps) * max(Q(s', :))
        greedy_action = np.argmax(Q[next_state])
        expected_q = (
            (epsilon / n_actions) * np.sum(Q[next_state]) +
            (1 - epsilon) * Q[next_state, greedy_action]
        )
        td_target = reward + gamma * expected_q

    td_error = td_target - Q[state, action]
    Q[state, action] += alpha * td_error


# Example usage
print("=== Expected SARSA Example ===\n")

# Simple Q-table example
Q = np.array([
    [1.0, 2.0, 1.5, 0.5],  # State 0
    [0.8, 1.2, 0.9, 1.5],  # State 1
])

epsilon = 0.1
n_actions = 4

# For state 1:
# max Q = 1.5 (action 3)
# sum Q = 0.8 + 1.2 + 0.9 + 1.5 = 4.4
# Expected Q = (0.1/4) * 4.4 + 0.9 * 1.5 = 0.11 + 1.35 = 1.46

expected_q = (epsilon / n_actions) * np.sum(Q[1]) + (1 - epsilon) * np.max(Q[1])
print(f"Q-values for state 1: {Q[1]}")
print(f"Expected Q under epsilon-greedy (epsilon={epsilon}): {expected_q:.3f}")
print(f"Max Q (used by Q-Learning): {np.max(Q[1]):.3f}")
print(f"Sample Q (used by SARSA): depends on sampled action")

Summary

In this chapter, we covered the two fundamental TD control algorithms for tabular reinforcement learning:

Key Takeaways

Concept Key Point
Q-Table Stores Q(s,a) for all state-action pairs in discrete spaces
Q-Learning Off-policy: uses $\max_{a'} Q(s', a')$, learns optimal policy
SARSA On-policy: uses $Q(s', a')$, learns policy being followed
Epsilon-Greedy Simple exploration: random with prob epsilon, greedy otherwise
Boltzmann Probabilistic: action selection based on Q-value magnitudes
UCB Optimistic: bonus for less-visited actions
Cliff Walking Q-Learning: risky optimal path; SARSA: safe conservative path
Expected SARSA Lower variance by taking expectation over next actions

Algorithm Selection Guide

Situation Recommended Algorithm Reason
Simulation, failures are cheap Q-Learning Learns optimal policy efficiently
Real hardware, robotics SARSA Safer during exploration
Experience replay needed Q-Learning Off-policy requirement
Low variance critical Expected SARSA Averaging reduces variance

Next Steps

In Chapter 3: Deep Q-Network (DQN), we will learn how to handle large and continuous state spaces using neural network function approximation. Key topics include:


Exercises

Exercise 1: Implement Epsilon Decay Strategies

Implement and compare three epsilon decay strategies on the CliffWalking environment:

  1. Linear decay: $\epsilon_t = \epsilon_0 - t \cdot \text{rate}$
  2. Exponential decay: $\epsilon_t = \epsilon_0 \cdot \text{decay}^t$
  3. Step decay: $\epsilon_t = \epsilon_0 / (1 + t / k)$

Plot the learning curves and final performance for each strategy.

Exercise 2: Double Q-Learning

Implement Double Q-Learning, which uses two Q-tables to reduce overestimation bias:

$$ Q_1(s, a) \leftarrow Q_1(s, a) + \alpha [r + \gamma Q_2(s', \arg\max_{a'} Q_1(s', a')) - Q_1(s, a)] $$

Compare the Q-value estimates of standard Q-Learning vs Double Q-Learning on a simple environment.

Exercise 3: n-Step SARSA

Implement n-step SARSA, which uses n-step returns instead of 1-step:

$$ G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n Q(S_{t+n}, A_{t+n}) $$

Compare n=1, n=3, and n=5 on FrozenLake-v1.

Exercise 4: Adaptive Learning Rate

Implement state-action specific learning rates that decay based on visit counts:

$$ \alpha(s, a) = \frac{1}{1 + N(s, a)} $$

Compare performance with fixed learning rate on the Taxi-v3 environment.

Exercise 5: Windy Gridworld

Implement and solve the Windy Gridworld problem:


Disclaimer