🌐 EN | 🇯🇵 JP | Last sync: 2025-11-16

Chapter 2: Q-Learning and SARSA

Value Function Estimation and Policy Optimization using Temporal Difference Learning

📖 Reading Time: 25-30 minutes 📊 Difficulty: Beginner to Intermediate 💻 Code Examples: 8 📝 Exercises: 5

This chapter covers Q. You will learn Explaining the mechanism, Understanding the characteristics, and Balancing exploration.

Learning Objectives

By reading this chapter, you will master the following:


2.1 Fundamentals of Temporal Difference (TD) Learning

Challenges of Monte Carlo Methods

The Monte Carlo method we learned in Chapter 1 had the constraint of needing to wait until the end of an episode:

$$ V(s_t) \leftarrow V(s_t) + \alpha [G_t - V(s_t)] $$

where $G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots$ is the actual return.

Basic Idea of Temporal Difference Learning

Temporal Difference (TD) learning updates the value function at each step without waiting for the episode to end:

$$ V(s_t) \leftarrow V(s_t) + \alpha [r_{t+1} + \gamma V(s_{t+1}) - V(s_t)] $$

where:

graph LR S1["State s_t"] --> A["Action a_t"] A --> S2["State s_t+1"] S2 --> R["Reward r_t+1"] R --> Update["Update V(s_t)"] S2 --> Update style S1 fill:#b3e5fc style S2 fill:#c5e1a5 style R fill:#fff9c4 style Update fill:#ffab91

Implementation of TD(0)

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np
import gym

def td_0_prediction(env, policy, num_episodes=1000, alpha=0.1, gamma=0.99):
    """
    State value function estimation using TD(0)

    Args:
        env: Environment
        policy: Policy (state -> action probability distribution)
        num_episodes: Number of episodes
        alpha: Learning rate
        gamma: Discount factor

    Returns:
        V: State value function
    """
    # Initialize state value function
    V = np.zeros(env.observation_space.n)

    for episode in range(num_episodes):
        state, _ = env.reset()

        while True:
            # Select action according to policy
            action = np.random.choice(env.action_space.n, p=policy[state])

            # Interact with environment
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # TD(0) update
            td_target = reward + gamma * V[next_state]
            td_error = td_target - V[state]
            V[state] = V[state] + alpha * td_error

            if done:
                break

            state = next_state

    return V


# Example usage: FrozenLake environment
print("=== Value Function Estimation using TD(0) ===")

env = gym.make('FrozenLake-v1', is_slippery=False)

# Random policy
policy = np.ones((env.observation_space.n, env.action_space.n)) / env.action_space.n

# Execute TD(0)
V = td_0_prediction(env, policy, num_episodes=1000, alpha=0.1, gamma=0.99)

print(f"State value function:\n{V.reshape(4, 4)}")
env.close()

Comparison of Monte Carlo Methods and TD Learning

Aspect Monte Carlo Method TD Learning
Update Timing After episode ends After each step
Return Calculation Actual return $G_t$ Estimated return $r + \gamma V(s')$
Bias None (unbiased) Present (depends on initial values)
Variance High Low
Continuing Tasks Not applicable Applicable
Convergence Speed Slow Fast

"TD learning achieves efficient learning through bootstrapping (updating using its own estimates)"


2.2 Q-Learning

Action-Value Function Q(s, a)

Instead of the state value function $V(s)$, we learn the action-value function $Q(s, a)$:

$$ Q(s, a) = \mathbb{E}[G_t | S_t = s, A_t = a] $$

This represents "the expected return after taking action $a$ in state $s$".

Q-Learning Update Equation

Q-learning applies TD learning to the action-value function:

$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right] $$

Key points:

graph TB Start["State s, Action a"] --> Execute["Execute in environment"] Execute --> Observe["Observe s', r"] Observe --> MaxQ["max_a' Q(s', a')"] MaxQ --> Target["TD target = r + γ max Q(s', a')"] Target --> Update["Update Q(s,a)"] Update --> Next["Next step"] style Start fill:#b3e5fc style MaxQ fill:#fff59d style Update fill:#ffab91

Implementation of Q-Learning Algorithm

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np
import gym

class QLearningAgent:
    """Q-Learning Agent"""

    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
        """
        Args:
            n_states: Number of states
            n_actions: Number of actions
            alpha: Learning rate
            gamma: Discount factor
            epsilon: ε for ε-greedy
        """
        self.Q = np.zeros((n_states, n_actions))
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.n_actions = n_actions

    def select_action(self, state):
        """Select action using ε-greedy policy"""
        if np.random.rand() < self.epsilon:
            # Random action (exploration)
            return np.random.randint(self.n_actions)
        else:
            # Best action (exploitation)
            return np.argmax(self.Q[state])

    def update(self, state, action, reward, next_state, done):
        """Update Q-value"""
        if done:
            # Terminal state
            td_target = reward
        else:
            # Q-learning update equation
            td_target = reward + self.gamma * np.max(self.Q[next_state])

        td_error = td_target - self.Q[state, action]
        self.Q[state, action] += self.alpha * td_error


def train_q_learning(env, agent, num_episodes=1000):
    """Train Q-learning"""
    episode_rewards = []

    for episode in range(num_episodes):
        state, _ = env.reset()
        total_reward = 0

        while True:
            # Select action
            action = agent.select_action(state)

            # Execute in environment
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Update Q-value
            agent.update(state, action, reward, next_state, done)

            total_reward += reward

            if done:
                break

            state = next_state

        episode_rewards.append(total_reward)

        # Display progress
        if (episode + 1) % 100 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            print(f"Episode {episode + 1}, Avg Reward: {avg_reward:.2f}")

    return episode_rewards


# Example usage: FrozenLake
print("\n=== Training Q-Learning ===")

env = gym.make('FrozenLake-v1', is_slippery=False)

agent = QLearningAgent(
    n_states=env.observation_space.n,
    n_actions=env.action_space.n,
    alpha=0.1,
    gamma=0.99,
    epsilon=0.1
)

rewards = train_q_learning(env, agent, num_episodes=1000)

print(f"\nLearned Q-table (partial):")
print(agent.Q[:16].reshape(4, 4, -1)[:, :, 0])  # Q-values for action 0
env.close()

Visualizing the Q-Table

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - seaborn>=0.12.0

import matplotlib.pyplot as plt
import seaborn as sns

def visualize_q_table(Q, env_shape=(4, 4)):
    """Visualize Q-table"""
    n_states = Q.shape[0]
    n_actions = Q.shape[1]

    fig, axes = plt.subplots(1, n_actions, figsize=(16, 4))

    action_names = ['LEFT', 'DOWN', 'RIGHT', 'UP']

    for action in range(n_actions):
        Q_action = Q[:, action].reshape(env_shape)

        sns.heatmap(Q_action, annot=True, fmt='.2f', cmap='YlOrRd',
                   ax=axes[action], cbar=True, square=True)
        axes[action].set_title(f'Q-value: {action_names[action]}')
        axes[action].set_xlabel('Column')
        axes[action].set_ylabel('Row')

    plt.tight_layout()
    plt.savefig('q_table_visualization.png', dpi=150, bbox_inches='tight')
    print("Q-table saved: q_table_visualization.png")
    plt.close()


# Visualize Q-table
visualize_q_table(agent.Q)

Visualizing Learning Curves

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

import matplotlib.pyplot as plt
import numpy as np

def plot_learning_curve(rewards, window=100):
    """Plot learning curve"""
    # Calculate moving average
    smoothed_rewards = np.convolve(rewards, np.ones(window)/window, mode='valid')

    plt.figure(figsize=(12, 5))

    # Episode rewards
    plt.subplot(1, 2, 1)
    plt.plot(rewards, alpha=0.3, label='Episode Reward')
    plt.plot(range(window-1, len(rewards)), smoothed_rewards,
             linewidth=2, label=f'{window}-Episode Moving Average')
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.title('Q-Learning Learning Curve')
    plt.legend()
    plt.grid(True, alpha=0.3)

    # Cumulative rewards
    plt.subplot(1, 2, 2)
    cumulative_rewards = np.cumsum(rewards)
    plt.plot(cumulative_rewards, linewidth=2, color='green')
    plt.xlabel('Episode')
    plt.ylabel('Cumulative Reward')
    plt.title('Cumulative Reward')
    plt.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.savefig('q_learning_curve.png', dpi=150, bbox_inches='tight')
    print("Learning curve saved: q_learning_curve.png")
    plt.close()


plot_learning_curve(rewards)

2.3 SARSA (State-Action-Reward-State-Action)

Basic Principle of SARSA

SARSA is the on-policy version of Q-learning. It updates using the action actually taken:

$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right] $$

Key difference:

graph LR S1["S_t"] --> A1["A_t"] A1 --> R["R_t+1"] R --> S2["S_t+1"] S2 --> A2["A_t+1"] A2 --> Update["Update Q(S_t, A_t)"] style S1 fill:#b3e5fc style A1 fill:#c5e1a5 style R fill:#fff9c4 style S2 fill:#b3e5fc style A2 fill:#c5e1a5 style Update fill:#ffab91

Comparison of Q-Learning and SARSA

Aspect Q-Learning SARSA
Learning Type Off-policy On-policy
Update Equation $r + \gamma \max_a Q(s', a)$ $r + \gamma Q(s', a')$
Exploration Impact Does not affect learning Affects learning
Convergence Target Optimal policy Value of current policy
Safety Does not consider risk Considers risk
Application Scenario Simulation environments Real environment learning

Implementation of SARSA

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np
import gym

class SARSAAgent:
    """SARSA Agent"""

    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
        """
        Args:
            n_states: Number of states
            n_actions: Number of actions
            alpha: Learning rate
            gamma: Discount factor
            epsilon: ε for ε-greedy
        """
        self.Q = np.zeros((n_states, n_actions))
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.n_actions = n_actions

    def select_action(self, state):
        """Select action using ε-greedy policy"""
        if np.random.rand() < self.epsilon:
            return np.random.randint(self.n_actions)
        else:
            return np.argmax(self.Q[state])

    def update(self, state, action, reward, next_state, next_action, done):
        """Update Q-value (SARSA)"""
        if done:
            td_target = reward
        else:
            # SARSA update equation (uses action actually taken next)
            td_target = reward + self.gamma * self.Q[next_state, next_action]

        td_error = td_target - self.Q[state, action]
        self.Q[state, action] += self.alpha * td_error


def train_sarsa(env, agent, num_episodes=1000):
    """Train SARSA"""
    episode_rewards = []

    for episode in range(num_episodes):
        state, _ = env.reset()
        action = agent.select_action(state)  # Initial action selection
        total_reward = 0

        while True:
            # Execute in environment
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            if not done:
                # Select next action (characteristic of SARSA)
                next_action = agent.select_action(next_state)
            else:
                next_action = None

            # Update Q-value
            agent.update(state, action, reward, next_state, next_action, done)

            total_reward += reward

            if done:
                break

            state = next_state
            action = next_action  # Transition to next action

        episode_rewards.append(total_reward)

        if (episode + 1) % 100 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            print(f"Episode {episode + 1}, Avg Reward: {avg_reward:.2f}")

    return episode_rewards


# Example usage
print("\n=== Training SARSA ===")

env = gym.make('FrozenLake-v1', is_slippery=False)

sarsa_agent = SARSAAgent(
    n_states=env.observation_space.n,
    n_actions=env.action_space.n,
    alpha=0.1,
    gamma=0.99,
    epsilon=0.1
)

sarsa_rewards = train_sarsa(env, sarsa_agent, num_episodes=1000)

print(f"\nLearned Q-table (SARSA):")
print(sarsa_agent.Q[:16].reshape(4, 4, -1)[:, :, 0])
env.close()

2.4 ε-Greedy Exploration Strategy

Exploration vs Exploitation Trade-off

In reinforcement learning, balancing exploration and exploitation is crucial:

ε-Greedy Policy

The simplest exploration strategy:

$$ a = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg\max_a Q(s, a) & \text{with probability } 1 - \epsilon \end{cases} $$

Epsilon Decay

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

import numpy as np
import matplotlib.pyplot as plt

class EpsilonGreedy:
    """ε-greedy policy (with decay functionality)"""

    def __init__(self, epsilon_start=1.0, epsilon_end=0.01, epsilon_decay=0.995):
        """
        Args:
            epsilon_start: Initial ε
            epsilon_end: Minimum ε
            epsilon_decay: Decay rate
        """
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay

    def select_action(self, Q, state, n_actions):
        """Select action"""
        if np.random.rand() < self.epsilon:
            return np.random.randint(n_actions)
        else:
            return np.argmax(Q[state])

    def decay(self):
        """Decay ε"""
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)


# Visualize epsilon decay patterns
print("\n=== Visualizing ε Decay Patterns ===")

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Different decay rates
decay_rates = [0.99, 0.995, 0.999]

for i, decay_rate in enumerate(decay_rates):
    epsilon_greedy = EpsilonGreedy(epsilon_start=1.0, epsilon_end=0.01,
                                   epsilon_decay=decay_rate)
    epsilons = [epsilon_greedy.epsilon]

    for _ in range(1000):
        epsilon_greedy.decay()
        epsilons.append(epsilon_greedy.epsilon)

    axes[i].plot(epsilons, linewidth=2)
    axes[i].set_xlabel('Episode')
    axes[i].set_ylabel('ε')
    axes[i].set_title(f'Decay Rate = {decay_rate}')
    axes[i].grid(True, alpha=0.3)
    axes[i].set_ylim([0, 1.1])

plt.tight_layout()
plt.savefig('epsilon_decay.png', dpi=150, bbox_inches='tight')
print("ε decay patterns saved: epsilon_decay.png")
plt.close()

Other Exploration Strategies

Softmax (Boltzmann) Exploration

Probabilistic selection based on action values:

$$ P(a | s) = \frac{\exp(Q(s,a) / \tau)}{\sum_{a'} \exp(Q(s,a') / \tau)} $$

$\tau$ is the temperature parameter (higher means more random)

Upper Confidence Bound (UCB)

Exploration considering uncertainty:

$$ a = \arg\max_a \left[ Q(s,a) + c \sqrt{\frac{\ln t}{N(s,a)}} \right] $$

$N(s,a)$ is the number of times action $a$ was selected, $c$ is the exploration coefficient


2.5 Impact of Hyperparameters

Learning Rate α

The learning rate $\alpha$ controls the strength of updates:

Discount Factor γ

The discount factor $\gamma$ determines the importance of future rewards:

Implementation of Hyperparameter Search

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

import numpy as np
import matplotlib.pyplot as plt
import gym

def hyperparameter_search(env_name, param_name, param_values, num_episodes=500):
    """Investigate the impact of hyperparameters"""
    results = {}

    for value in param_values:
        print(f"\nTraining with {param_name} = {value}...")

        env = gym.make(env_name)

        if param_name == 'alpha':
            agent = QLearningAgent(env.observation_space.n, env.action_space.n,
                                  alpha=value, gamma=0.99, epsilon=0.1)
        elif param_name == 'gamma':
            agent = QLearningAgent(env.observation_space.n, env.action_space.n,
                                  alpha=0.1, gamma=value, epsilon=0.1)
        elif param_name == 'epsilon':
            agent = QLearningAgent(env.observation_space.n, env.action_space.n,
                                  alpha=0.1, gamma=0.99, epsilon=value)

        rewards = train_q_learning(env, agent, num_episodes=num_episodes)
        results[value] = rewards
        env.close()

    return results


# Investigate impact of learning rate
print("=== Investigating Impact of Learning Rate α ===")

alpha_values = [0.01, 0.05, 0.1, 0.3, 0.5]
alpha_results = hyperparameter_search('FrozenLake-v1', 'alpha', alpha_values)

# Visualization
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
for alpha, rewards in alpha_results.items():
    smoothed = np.convolve(rewards, np.ones(50)/50, mode='valid')
    plt.plot(smoothed, label=f'α = {alpha}', linewidth=2)

plt.xlabel('Episode')
plt.ylabel('Average Reward')
plt.title('Impact of Learning Rate α')
plt.legend()
plt.grid(True, alpha=0.3)

# Investigate impact of discount factor
gamma_values = [0.5, 0.9, 0.95, 0.99, 0.999]
gamma_results = hyperparameter_search('FrozenLake-v1', 'gamma', gamma_values)

plt.subplot(1, 2, 2)
for gamma, rewards in gamma_results.items():
    smoothed = np.convolve(rewards, np.ones(50)/50, mode='valid')
    plt.plot(smoothed, label=f'γ = {gamma}', linewidth=2)

plt.xlabel('Episode')
plt.ylabel('Average Reward')
plt.title('Impact of Discount Factor γ')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('hyperparameter_impact.png', dpi=150, bbox_inches='tight')
print("\nHyperparameter impact saved: hyperparameter_impact.png")
plt.close()

2.6 Practice: Taxi-v3 Environment

Overview of Taxi-v3 Environment

Taxi-v3 is an environment where a taxi picks up a passenger and delivers them to a destination:

Q-Learning on Taxi-v3

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Q-Learning on Taxi-v3

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np
import gym
import matplotlib.pyplot as plt

# Taxi-v3 environment
print("=== Q-Learning on Taxi-v3 Environment ===")

env = gym.make('Taxi-v3', render_mode=None)

print(f"State space: {env.observation_space.n}")
print(f"Action space: {env.action_space.n}")
print(f"Actions: {['South', 'North', 'East', 'West', 'Pickup', 'Dropoff']}")

# Q-learning agent
taxi_agent = QLearningAgent(
    n_states=env.observation_space.n,
    n_actions=env.action_space.n,
    alpha=0.1,
    gamma=0.99,
    epsilon=0.1
)

# Training
taxi_rewards = train_q_learning(env, taxi_agent, num_episodes=5000)

# Learning curve
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
smoothed = np.convolve(taxi_rewards, np.ones(100)/100, mode='valid')
plt.plot(smoothed, linewidth=2, color='blue')
plt.xlabel('Episode')
plt.ylabel('Average Reward')
plt.title('Taxi-v3 Q-Learning Learning Curve')
plt.grid(True, alpha=0.3)

# Calculate success rate
success_rate = []
window = 100
for i in range(len(taxi_rewards) - window):
    success = np.sum(np.array(taxi_rewards[i:i+window]) > 0) / window
    success_rate.append(success)

plt.subplot(1, 2, 2)
plt.plot(success_rate, linewidth=2, color='green')
plt.xlabel('Episode')
plt.ylabel('Success Rate')
plt.title('Task Success Rate (100-Episode Moving Average)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('taxi_training.png', dpi=150, bbox_inches='tight')
print("Taxi training results saved: taxi_training.png")
plt.close()

env.close()

Evaluating the Trained Agent

def evaluate_agent(env, agent, num_episodes=100, render=False):
    """Evaluate trained agent"""
    total_rewards = []
    total_steps = []

    for episode in range(num_episodes):
        state, _ = env.reset()
        episode_reward = 0
        steps = 0

        while steps < 200:  # Maximum steps
            # Select best action (no exploration)
            action = np.argmax(agent.Q[state])

            state, reward, terminated, truncated, _ = env.step(action)
            episode_reward += reward
            steps += 1

            if terminated or truncated:
                break

        total_rewards.append(episode_reward)
        total_steps.append(steps)

    return total_rewards, total_steps


# Evaluation
print("\n=== Evaluating Trained Agent ===")

env = gym.make('Taxi-v3', render_mode=None)
eval_rewards, eval_steps = evaluate_agent(env, taxi_agent, num_episodes=100)

print(f"Average reward: {np.mean(eval_rewards):.2f} ± {np.std(eval_rewards):.2f}")
print(f"Average steps: {np.mean(eval_steps):.2f} ± {np.std(eval_steps):.2f}")
print(f"Success rate: {np.sum(np.array(eval_rewards) > 0) / len(eval_rewards) * 100:.1f}%")

env.close()

2.7 Practice: Cliff Walking Environment

Definition of Cliff Walking Environment

Cliff Walking is an environment where you must reach the goal while avoiding cliffs. It clearly demonstrates the difference between Q-learning and SARSA:

Implementation on Cliff Walking Environment

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Implementation on Cliff Walking Environment

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np
import gym

# Cliff Walking environment
print("=== Cliff Walking Environment ===")

env = gym.make('CliffWalking-v0')

print(f"State space: {env.observation_space.n}")
print(f"Action space: {env.action_space.n}")
print(f"Grid size: 4×12")

# Q-learning agent
cliff_q_agent = QLearningAgent(
    n_states=env.observation_space.n,
    n_actions=env.action_space.n,
    alpha=0.5,
    gamma=0.99,
    epsilon=0.1
)

# SARSA agent
cliff_sarsa_agent = SARSAAgent(
    n_states=env.observation_space.n,
    n_actions=env.action_space.n,
    alpha=0.5,
    gamma=0.99,
    epsilon=0.1
)

# Training
print("\nTraining with Q-learning...")
q_rewards = train_q_learning(env, cliff_q_agent, num_episodes=500)

env = gym.make('CliffWalking-v0')
print("\nTraining with SARSA...")
sarsa_rewards = train_sarsa(env, cliff_sarsa_agent, num_episodes=500)

# Comparison visualization
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
smoothed_q = np.convolve(q_rewards, np.ones(10)/10, mode='valid')
smoothed_sarsa = np.convolve(sarsa_rewards, np.ones(10)/10, mode='valid')

plt.plot(smoothed_q, label='Q-Learning', linewidth=2)
plt.plot(smoothed_sarsa, label='SARSA', linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Cliff Walking: Q-Learning vs SARSA')
plt.legend()
plt.grid(True, alpha=0.3)

# Visualize policy (display with arrows)
plt.subplot(1, 2, 2)

def visualize_policy(Q, shape=(4, 12)):
    """Visualize learned policy"""
    policy = np.argmax(Q, axis=1)
    policy_grid = policy.reshape(shape)

    # Arrow directions
    arrows = {0: '↑', 1: '→', 2: '↓', 3: '←'}

    fig, ax = plt.subplots(figsize=(12, 4))
    ax.imshow(np.zeros(shape), cmap='Blues', alpha=0.3)

    for i in range(shape[0]):
        for j in range(shape[1]):
            state = i * shape[1] + j
            action = policy[state]

            # Display cliff area in red
            if i == 3 and 1 <= j <= 10:
                ax.add_patch(plt.Rectangle((j-0.5, i-0.5), 1, 1,
                                          fill=True, color='red', alpha=0.3))

            # Goal
            if i == 3 and j == 11:
                ax.add_patch(plt.Rectangle((j-0.5, i-0.5), 1, 1,
                                          fill=True, color='green', alpha=0.3))

            # Arrow
            ax.text(j, i, arrows[action], ha='center', va='center',
                   fontsize=16, fontweight='bold')

    ax.set_xlim(-0.5, shape[1]-0.5)
    ax.set_ylim(shape[0]-0.5, -0.5)
    ax.set_xticks(range(shape[1]))
    ax.set_yticks(range(shape[0]))
    ax.grid(True)
    ax.set_title('Learned Policy (Q-Learning)')


visualize_policy(cliff_q_agent.Q)

plt.tight_layout()
plt.savefig('cliff_walking_comparison.png', dpi=150, bbox_inches='tight')
print("\nCliff Walking comparison saved: cliff_walking_comparison.png")
plt.close()

env.close()

Path Differences between Q-Learning and SARSA

Important Observation: In Cliff Walking, Q-learning learns the shortest path (close to cliff), while SARSA learns a safe path (away from cliff). This is because SARSA incorporates accidental cliff falls from ε-greedy exploration into its learning.


Exercises

Exercise 1: Comparing Convergence Speed of Q-Learning and SARSA

Train Q-learning and SARSA on the FrozenLake environment with the same hyperparameters and compare their convergence speeds.

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Train Q-learning and SARSA on the FrozenLake environment wit

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""

import gym
import numpy as np

# Exercise: Train Q-learning and SARSA with same settings
# Exercise: Plot episode rewards
# Exercise: Compare number of episodes needed for convergence
# Expected: Convergence speed differs depending on environment
Exercise 2: Optimizing ε Decay Schedules

Implement different ε decay patterns (linear decay, exponential decay, step decay) and compare their performance on Taxi-v3.

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Implement different ε decay patterns (linear decay, exponent

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

import numpy as np

# Exercise: Implement 3 types of ε decay schedules
# Exercise: Evaluate each schedule on Taxi-v3
# Exercise: Compare learning curves and final performance
# Hint: Emphasize exploration early, exploitation later
Exercise 3: Implementing Double Q-Learning

Implement Double Q-Learning to prevent overestimation and compare performance with standard Q-learning.

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Implement Double Q-Learning to prevent overestimation and co

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np

# Exercise: Implement Double Q-Learning using two Q-tables
# Exercise: Train on FrozenLake environment
# Exercise: Compare Q-value estimation error with standard Q-learning
# Theory: Double algorithm reduces overestimation bias
Exercise 4: Adaptive Learning Rate Adjustment

Implement an adaptive learning rate that adjusts based on visit count and compare it with fixed learning rate.

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Implement an adaptive learning rate that adjusts based on vi

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

import numpy as np

# Exercise: Implement adaptive learning rate α(s,a) = 1 / (1 + N(s,a))
# Exercise: Compare performance with fixed learning rate
# Exercise: Visualize visit counts for each state
# Expected: Adaptive learning rate leads to more stable convergence
Exercise 5: Experiments on Custom Environments

Discretize the state space for other OpenAI Gym environments (CartPole-v1, MountainCar-v0, etc.) and apply Q-learning.

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Discretize the state space for other OpenAI Gym environments

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

import gym
import numpy as np

# Exercise: Implement function to discretize continuous state space
# Exercise: Apply Q-learning on discretized CartPole environment
# Exercise: Investigate relationship between discretization granularity and performance
# Challenge: Appropriate discretization of continuous space is critical

Summary

In this chapter, we learned Q-learning and SARSA based on temporal difference learning.

Key Points

When to Use Q-Learning vs SARSA

Situation Recommended Algorithm Reason
Simulation environment Q-Learning Efficiently learns optimal policy
Real environment/Robotics SARSA Learns safe policy
Dangerous states present SARSA Risk-averse tendency
Fast convergence needed Q-Learning Flexible with off-policy

Next Steps

In the next chapter, we will learn about Deep Q-Network (DQN). We will master techniques for approximating action-value functions using neural networks for large-scale and continuous state spaces that cannot be handled by Q-tables, including Experience Replay, Target Network, and applications to Atari games.

Disclaimer