This chapter covers Q. You will learn Explaining the mechanism, Understanding the characteristics, and Balancing exploration.
Learning Objectives
By reading this chapter, you will master the following:
- ✅ Understanding the basic principles of Temporal Difference (TD) learning and its differences from Monte Carlo methods
- ✅ Explaining the mechanism and update equations of the Q-Learning algorithm (off-policy)
- ✅ Understanding the characteristics and application scenarios of the SARSA algorithm (on-policy)
- ✅ Balancing exploration and exploitation using ε-greedy policies
- ✅ Explaining the effects of hyperparameters: learning rate and discount factor
- ✅ Implementing solutions for OpenAI Gym's Taxi-v3 and Cliff Walking environments
2.1 Fundamentals of Temporal Difference (TD) Learning
Challenges of Monte Carlo Methods
The Monte Carlo method we learned in Chapter 1 had the constraint of needing to wait until the end of an episode:
$$ V(s_t) \leftarrow V(s_t) + \alpha [G_t - V(s_t)] $$where $G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots$ is the actual return.
Basic Idea of Temporal Difference Learning
Temporal Difference (TD) learning updates the value function at each step without waiting for the episode to end:
$$ V(s_t) \leftarrow V(s_t) + \alpha [r_{t+1} + \gamma V(s_{t+1}) - V(s_t)] $$where:
- $r_{t+1} + \gamma V(s_{t+1})$: TD target
- $\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$: TD error
- $\alpha$: learning rate
Implementation of TD(0)
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
import numpy as np
import gym
def td_0_prediction(env, policy, num_episodes=1000, alpha=0.1, gamma=0.99):
"""
State value function estimation using TD(0)
Args:
env: Environment
policy: Policy (state -> action probability distribution)
num_episodes: Number of episodes
alpha: Learning rate
gamma: Discount factor
Returns:
V: State value function
"""
# Initialize state value function
V = np.zeros(env.observation_space.n)
for episode in range(num_episodes):
state, _ = env.reset()
while True:
# Select action according to policy
action = np.random.choice(env.action_space.n, p=policy[state])
# Interact with environment
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# TD(0) update
td_target = reward + gamma * V[next_state]
td_error = td_target - V[state]
V[state] = V[state] + alpha * td_error
if done:
break
state = next_state
return V
# Example usage: FrozenLake environment
print("=== Value Function Estimation using TD(0) ===")
env = gym.make('FrozenLake-v1', is_slippery=False)
# Random policy
policy = np.ones((env.observation_space.n, env.action_space.n)) / env.action_space.n
# Execute TD(0)
V = td_0_prediction(env, policy, num_episodes=1000, alpha=0.1, gamma=0.99)
print(f"State value function:\n{V.reshape(4, 4)}")
env.close()
Comparison of Monte Carlo Methods and TD Learning
| Aspect | Monte Carlo Method | TD Learning |
|---|---|---|
| Update Timing | After episode ends | After each step |
| Return Calculation | Actual return $G_t$ | Estimated return $r + \gamma V(s')$ |
| Bias | None (unbiased) | Present (depends on initial values) |
| Variance | High | Low |
| Continuing Tasks | Not applicable | Applicable |
| Convergence Speed | Slow | Fast |
"TD learning achieves efficient learning through bootstrapping (updating using its own estimates)"
2.2 Q-Learning
Action-Value Function Q(s, a)
Instead of the state value function $V(s)$, we learn the action-value function $Q(s, a)$:
$$ Q(s, a) = \mathbb{E}[G_t | S_t = s, A_t = a] $$This represents "the expected return after taking action $a$ in state $s$".
Q-Learning Update Equation
Q-learning applies TD learning to the action-value function:
$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right] $$Key points:
- $\max_{a'} Q(s_{t+1}, a')$: Uses the value of the best action in the next state
- Off-policy type: The actual action differs from the action used for updating
- Can directly learn the optimal policy
Implementation of Q-Learning Algorithm
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
import numpy as np
import gym
class QLearningAgent:
"""Q-Learning Agent"""
def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
"""
Args:
n_states: Number of states
n_actions: Number of actions
alpha: Learning rate
gamma: Discount factor
epsilon: ε for ε-greedy
"""
self.Q = np.zeros((n_states, n_actions))
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.n_actions = n_actions
def select_action(self, state):
"""Select action using ε-greedy policy"""
if np.random.rand() < self.epsilon:
# Random action (exploration)
return np.random.randint(self.n_actions)
else:
# Best action (exploitation)
return np.argmax(self.Q[state])
def update(self, state, action, reward, next_state, done):
"""Update Q-value"""
if done:
# Terminal state
td_target = reward
else:
# Q-learning update equation
td_target = reward + self.gamma * np.max(self.Q[next_state])
td_error = td_target - self.Q[state, action]
self.Q[state, action] += self.alpha * td_error
def train_q_learning(env, agent, num_episodes=1000):
"""Train Q-learning"""
episode_rewards = []
for episode in range(num_episodes):
state, _ = env.reset()
total_reward = 0
while True:
# Select action
action = agent.select_action(state)
# Execute in environment
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Update Q-value
agent.update(state, action, reward, next_state, done)
total_reward += reward
if done:
break
state = next_state
episode_rewards.append(total_reward)
# Display progress
if (episode + 1) % 100 == 0:
avg_reward = np.mean(episode_rewards[-100:])
print(f"Episode {episode + 1}, Avg Reward: {avg_reward:.2f}")
return episode_rewards
# Example usage: FrozenLake
print("\n=== Training Q-Learning ===")
env = gym.make('FrozenLake-v1', is_slippery=False)
agent = QLearningAgent(
n_states=env.observation_space.n,
n_actions=env.action_space.n,
alpha=0.1,
gamma=0.99,
epsilon=0.1
)
rewards = train_q_learning(env, agent, num_episodes=1000)
print(f"\nLearned Q-table (partial):")
print(agent.Q[:16].reshape(4, 4, -1)[:, :, 0]) # Q-values for action 0
env.close()
Visualizing the Q-Table
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - seaborn>=0.12.0
import matplotlib.pyplot as plt
import seaborn as sns
def visualize_q_table(Q, env_shape=(4, 4)):
"""Visualize Q-table"""
n_states = Q.shape[0]
n_actions = Q.shape[1]
fig, axes = plt.subplots(1, n_actions, figsize=(16, 4))
action_names = ['LEFT', 'DOWN', 'RIGHT', 'UP']
for action in range(n_actions):
Q_action = Q[:, action].reshape(env_shape)
sns.heatmap(Q_action, annot=True, fmt='.2f', cmap='YlOrRd',
ax=axes[action], cbar=True, square=True)
axes[action].set_title(f'Q-value: {action_names[action]}')
axes[action].set_xlabel('Column')
axes[action].set_ylabel('Row')
plt.tight_layout()
plt.savefig('q_table_visualization.png', dpi=150, bbox_inches='tight')
print("Q-table saved: q_table_visualization.png")
plt.close()
# Visualize Q-table
visualize_q_table(agent.Q)
Visualizing Learning Curves
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
import matplotlib.pyplot as plt
import numpy as np
def plot_learning_curve(rewards, window=100):
"""Plot learning curve"""
# Calculate moving average
smoothed_rewards = np.convolve(rewards, np.ones(window)/window, mode='valid')
plt.figure(figsize=(12, 5))
# Episode rewards
plt.subplot(1, 2, 1)
plt.plot(rewards, alpha=0.3, label='Episode Reward')
plt.plot(range(window-1, len(rewards)), smoothed_rewards,
linewidth=2, label=f'{window}-Episode Moving Average')
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Q-Learning Learning Curve')
plt.legend()
plt.grid(True, alpha=0.3)
# Cumulative rewards
plt.subplot(1, 2, 2)
cumulative_rewards = np.cumsum(rewards)
plt.plot(cumulative_rewards, linewidth=2, color='green')
plt.xlabel('Episode')
plt.ylabel('Cumulative Reward')
plt.title('Cumulative Reward')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('q_learning_curve.png', dpi=150, bbox_inches='tight')
print("Learning curve saved: q_learning_curve.png")
plt.close()
plot_learning_curve(rewards)
2.3 SARSA (State-Action-Reward-State-Action)
Basic Principle of SARSA
SARSA is the on-policy version of Q-learning. It updates using the action actually taken:
$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right] $$Key difference:
- Q-learning: Uses $\max_{a'} Q(s_{t+1}, a')$ (best action)
- SARSA: Uses $Q(s_{t+1}, a_{t+1})$ (action actually taken)
Comparison of Q-Learning and SARSA
| Aspect | Q-Learning | SARSA |
|---|---|---|
| Learning Type | Off-policy | On-policy |
| Update Equation | $r + \gamma \max_a Q(s', a)$ | $r + \gamma Q(s', a')$ |
| Exploration Impact | Does not affect learning | Affects learning |
| Convergence Target | Optimal policy | Value of current policy |
| Safety | Does not consider risk | Considers risk |
| Application Scenario | Simulation environments | Real environment learning |
Implementation of SARSA
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
import numpy as np
import gym
class SARSAAgent:
"""SARSA Agent"""
def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
"""
Args:
n_states: Number of states
n_actions: Number of actions
alpha: Learning rate
gamma: Discount factor
epsilon: ε for ε-greedy
"""
self.Q = np.zeros((n_states, n_actions))
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.n_actions = n_actions
def select_action(self, state):
"""Select action using ε-greedy policy"""
if np.random.rand() < self.epsilon:
return np.random.randint(self.n_actions)
else:
return np.argmax(self.Q[state])
def update(self, state, action, reward, next_state, next_action, done):
"""Update Q-value (SARSA)"""
if done:
td_target = reward
else:
# SARSA update equation (uses action actually taken next)
td_target = reward + self.gamma * self.Q[next_state, next_action]
td_error = td_target - self.Q[state, action]
self.Q[state, action] += self.alpha * td_error
def train_sarsa(env, agent, num_episodes=1000):
"""Train SARSA"""
episode_rewards = []
for episode in range(num_episodes):
state, _ = env.reset()
action = agent.select_action(state) # Initial action selection
total_reward = 0
while True:
# Execute in environment
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
if not done:
# Select next action (characteristic of SARSA)
next_action = agent.select_action(next_state)
else:
next_action = None
# Update Q-value
agent.update(state, action, reward, next_state, next_action, done)
total_reward += reward
if done:
break
state = next_state
action = next_action # Transition to next action
episode_rewards.append(total_reward)
if (episode + 1) % 100 == 0:
avg_reward = np.mean(episode_rewards[-100:])
print(f"Episode {episode + 1}, Avg Reward: {avg_reward:.2f}")
return episode_rewards
# Example usage
print("\n=== Training SARSA ===")
env = gym.make('FrozenLake-v1', is_slippery=False)
sarsa_agent = SARSAAgent(
n_states=env.observation_space.n,
n_actions=env.action_space.n,
alpha=0.1,
gamma=0.99,
epsilon=0.1
)
sarsa_rewards = train_sarsa(env, sarsa_agent, num_episodes=1000)
print(f"\nLearned Q-table (SARSA):")
print(sarsa_agent.Q[:16].reshape(4, 4, -1)[:, :, 0])
env.close()
2.4 ε-Greedy Exploration Strategy
Exploration vs Exploitation Trade-off
In reinforcement learning, balancing exploration and exploitation is crucial:
- Exploration: Try new states and actions to understand the environment
- Exploitation: Select the best action based on current knowledge
ε-Greedy Policy
The simplest exploration strategy:
$$ a = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg\max_a Q(s, a) & \text{with probability } 1 - \epsilon \end{cases} $$Epsilon Decay
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
import numpy as np
import matplotlib.pyplot as plt
class EpsilonGreedy:
"""ε-greedy policy (with decay functionality)"""
def __init__(self, epsilon_start=1.0, epsilon_end=0.01, epsilon_decay=0.995):
"""
Args:
epsilon_start: Initial ε
epsilon_end: Minimum ε
epsilon_decay: Decay rate
"""
self.epsilon = epsilon_start
self.epsilon_end = epsilon_end
self.epsilon_decay = epsilon_decay
def select_action(self, Q, state, n_actions):
"""Select action"""
if np.random.rand() < self.epsilon:
return np.random.randint(n_actions)
else:
return np.argmax(Q[state])
def decay(self):
"""Decay ε"""
self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
# Visualize epsilon decay patterns
print("\n=== Visualizing ε Decay Patterns ===")
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Different decay rates
decay_rates = [0.99, 0.995, 0.999]
for i, decay_rate in enumerate(decay_rates):
epsilon_greedy = EpsilonGreedy(epsilon_start=1.0, epsilon_end=0.01,
epsilon_decay=decay_rate)
epsilons = [epsilon_greedy.epsilon]
for _ in range(1000):
epsilon_greedy.decay()
epsilons.append(epsilon_greedy.epsilon)
axes[i].plot(epsilons, linewidth=2)
axes[i].set_xlabel('Episode')
axes[i].set_ylabel('ε')
axes[i].set_title(f'Decay Rate = {decay_rate}')
axes[i].grid(True, alpha=0.3)
axes[i].set_ylim([0, 1.1])
plt.tight_layout()
plt.savefig('epsilon_decay.png', dpi=150, bbox_inches='tight')
print("ε decay patterns saved: epsilon_decay.png")
plt.close()
Other Exploration Strategies
Softmax (Boltzmann) Exploration
Probabilistic selection based on action values:
$$ P(a | s) = \frac{\exp(Q(s,a) / \tau)}{\sum_{a'} \exp(Q(s,a') / \tau)} $$$\tau$ is the temperature parameter (higher means more random)
Upper Confidence Bound (UCB)
Exploration considering uncertainty:
$$ a = \arg\max_a \left[ Q(s,a) + c \sqrt{\frac{\ln t}{N(s,a)}} \right] $$$N(s,a)$ is the number of times action $a$ was selected, $c$ is the exploration coefficient
2.5 Impact of Hyperparameters
Learning Rate α
The learning rate $\alpha$ controls the strength of updates:
- Large α (e.g., 0.5): Fast learning but unstable
- Small α (e.g., 0.01): Stable but slow convergence
- Recommended value: 0.1 - 0.3
Discount Factor γ
The discount factor $\gamma$ determines the importance of future rewards:
- γ = 0: Only considers immediate rewards (myopic)
- γ → 1: Considers distant future (far-sighted)
- Recommended value: 0.95 - 0.99
Implementation of Hyperparameter Search
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
import numpy as np
import matplotlib.pyplot as plt
import gym
def hyperparameter_search(env_name, param_name, param_values, num_episodes=500):
"""Investigate the impact of hyperparameters"""
results = {}
for value in param_values:
print(f"\nTraining with {param_name} = {value}...")
env = gym.make(env_name)
if param_name == 'alpha':
agent = QLearningAgent(env.observation_space.n, env.action_space.n,
alpha=value, gamma=0.99, epsilon=0.1)
elif param_name == 'gamma':
agent = QLearningAgent(env.observation_space.n, env.action_space.n,
alpha=0.1, gamma=value, epsilon=0.1)
elif param_name == 'epsilon':
agent = QLearningAgent(env.observation_space.n, env.action_space.n,
alpha=0.1, gamma=0.99, epsilon=value)
rewards = train_q_learning(env, agent, num_episodes=num_episodes)
results[value] = rewards
env.close()
return results
# Investigate impact of learning rate
print("=== Investigating Impact of Learning Rate α ===")
alpha_values = [0.01, 0.05, 0.1, 0.3, 0.5]
alpha_results = hyperparameter_search('FrozenLake-v1', 'alpha', alpha_values)
# Visualization
plt.figure(figsize=(14, 5))
plt.subplot(1, 2, 1)
for alpha, rewards in alpha_results.items():
smoothed = np.convolve(rewards, np.ones(50)/50, mode='valid')
plt.plot(smoothed, label=f'α = {alpha}', linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Average Reward')
plt.title('Impact of Learning Rate α')
plt.legend()
plt.grid(True, alpha=0.3)
# Investigate impact of discount factor
gamma_values = [0.5, 0.9, 0.95, 0.99, 0.999]
gamma_results = hyperparameter_search('FrozenLake-v1', 'gamma', gamma_values)
plt.subplot(1, 2, 2)
for gamma, rewards in gamma_results.items():
smoothed = np.convolve(rewards, np.ones(50)/50, mode='valid')
plt.plot(smoothed, label=f'γ = {gamma}', linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Average Reward')
plt.title('Impact of Discount Factor γ')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('hyperparameter_impact.png', dpi=150, bbox_inches='tight')
print("\nHyperparameter impact saved: hyperparameter_impact.png")
plt.close()
2.6 Practice: Taxi-v3 Environment
Overview of Taxi-v3 Environment
Taxi-v3 is an environment where a taxi picks up a passenger and delivers them to a destination:
- State space: 500 states (5×5 grid × 5 passenger locations × 4 destinations)
- Action space: 6 actions (up, down, left, right, pickup, dropoff)
- Rewards: +20 for reaching correct destination, -1 per step, -10 for illegal pickup/dropoff
Q-Learning on Taxi-v3
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Q-Learning on Taxi-v3
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
import numpy as np
import gym
import matplotlib.pyplot as plt
# Taxi-v3 environment
print("=== Q-Learning on Taxi-v3 Environment ===")
env = gym.make('Taxi-v3', render_mode=None)
print(f"State space: {env.observation_space.n}")
print(f"Action space: {env.action_space.n}")
print(f"Actions: {['South', 'North', 'East', 'West', 'Pickup', 'Dropoff']}")
# Q-learning agent
taxi_agent = QLearningAgent(
n_states=env.observation_space.n,
n_actions=env.action_space.n,
alpha=0.1,
gamma=0.99,
epsilon=0.1
)
# Training
taxi_rewards = train_q_learning(env, taxi_agent, num_episodes=5000)
# Learning curve
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
smoothed = np.convolve(taxi_rewards, np.ones(100)/100, mode='valid')
plt.plot(smoothed, linewidth=2, color='blue')
plt.xlabel('Episode')
plt.ylabel('Average Reward')
plt.title('Taxi-v3 Q-Learning Learning Curve')
plt.grid(True, alpha=0.3)
# Calculate success rate
success_rate = []
window = 100
for i in range(len(taxi_rewards) - window):
success = np.sum(np.array(taxi_rewards[i:i+window]) > 0) / window
success_rate.append(success)
plt.subplot(1, 2, 2)
plt.plot(success_rate, linewidth=2, color='green')
plt.xlabel('Episode')
plt.ylabel('Success Rate')
plt.title('Task Success Rate (100-Episode Moving Average)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('taxi_training.png', dpi=150, bbox_inches='tight')
print("Taxi training results saved: taxi_training.png")
plt.close()
env.close()
Evaluating the Trained Agent
def evaluate_agent(env, agent, num_episodes=100, render=False):
"""Evaluate trained agent"""
total_rewards = []
total_steps = []
for episode in range(num_episodes):
state, _ = env.reset()
episode_reward = 0
steps = 0
while steps < 200: # Maximum steps
# Select best action (no exploration)
action = np.argmax(agent.Q[state])
state, reward, terminated, truncated, _ = env.step(action)
episode_reward += reward
steps += 1
if terminated or truncated:
break
total_rewards.append(episode_reward)
total_steps.append(steps)
return total_rewards, total_steps
# Evaluation
print("\n=== Evaluating Trained Agent ===")
env = gym.make('Taxi-v3', render_mode=None)
eval_rewards, eval_steps = evaluate_agent(env, taxi_agent, num_episodes=100)
print(f"Average reward: {np.mean(eval_rewards):.2f} ± {np.std(eval_rewards):.2f}")
print(f"Average steps: {np.mean(eval_steps):.2f} ± {np.std(eval_steps):.2f}")
print(f"Success rate: {np.sum(np.array(eval_rewards) > 0) / len(eval_rewards) * 100:.1f}%")
env.close()
2.7 Practice: Cliff Walking Environment
Definition of Cliff Walking Environment
Cliff Walking is an environment where you must reach the goal while avoiding cliffs. It clearly demonstrates the difference between Q-learning and SARSA:
- 4×12 grid: Start at bottom-left, goal at bottom-right
- Cliff area: Central portion of bottom edge (stepping on it gives -100 penalty)
- Rewards: -1 per step, -100 for cliff, 0 for goal
Implementation on Cliff Walking Environment
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Implementation on Cliff Walking Environment
Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
import numpy as np
import gym
# Cliff Walking environment
print("=== Cliff Walking Environment ===")
env = gym.make('CliffWalking-v0')
print(f"State space: {env.observation_space.n}")
print(f"Action space: {env.action_space.n}")
print(f"Grid size: 4×12")
# Q-learning agent
cliff_q_agent = QLearningAgent(
n_states=env.observation_space.n,
n_actions=env.action_space.n,
alpha=0.5,
gamma=0.99,
epsilon=0.1
)
# SARSA agent
cliff_sarsa_agent = SARSAAgent(
n_states=env.observation_space.n,
n_actions=env.action_space.n,
alpha=0.5,
gamma=0.99,
epsilon=0.1
)
# Training
print("\nTraining with Q-learning...")
q_rewards = train_q_learning(env, cliff_q_agent, num_episodes=500)
env = gym.make('CliffWalking-v0')
print("\nTraining with SARSA...")
sarsa_rewards = train_sarsa(env, cliff_sarsa_agent, num_episodes=500)
# Comparison visualization
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
smoothed_q = np.convolve(q_rewards, np.ones(10)/10, mode='valid')
smoothed_sarsa = np.convolve(sarsa_rewards, np.ones(10)/10, mode='valid')
plt.plot(smoothed_q, label='Q-Learning', linewidth=2)
plt.plot(smoothed_sarsa, label='SARSA', linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Cliff Walking: Q-Learning vs SARSA')
plt.legend()
plt.grid(True, alpha=0.3)
# Visualize policy (display with arrows)
plt.subplot(1, 2, 2)
def visualize_policy(Q, shape=(4, 12)):
"""Visualize learned policy"""
policy = np.argmax(Q, axis=1)
policy_grid = policy.reshape(shape)
# Arrow directions
arrows = {0: '↑', 1: '→', 2: '↓', 3: '←'}
fig, ax = plt.subplots(figsize=(12, 4))
ax.imshow(np.zeros(shape), cmap='Blues', alpha=0.3)
for i in range(shape[0]):
for j in range(shape[1]):
state = i * shape[1] + j
action = policy[state]
# Display cliff area in red
if i == 3 and 1 <= j <= 10:
ax.add_patch(plt.Rectangle((j-0.5, i-0.5), 1, 1,
fill=True, color='red', alpha=0.3))
# Goal
if i == 3 and j == 11:
ax.add_patch(plt.Rectangle((j-0.5, i-0.5), 1, 1,
fill=True, color='green', alpha=0.3))
# Arrow
ax.text(j, i, arrows[action], ha='center', va='center',
fontsize=16, fontweight='bold')
ax.set_xlim(-0.5, shape[1]-0.5)
ax.set_ylim(shape[0]-0.5, -0.5)
ax.set_xticks(range(shape[1]))
ax.set_yticks(range(shape[0]))
ax.grid(True)
ax.set_title('Learned Policy (Q-Learning)')
visualize_policy(cliff_q_agent.Q)
plt.tight_layout()
plt.savefig('cliff_walking_comparison.png', dpi=150, bbox_inches='tight')
print("\nCliff Walking comparison saved: cliff_walking_comparison.png")
plt.close()
env.close()
Path Differences between Q-Learning and SARSA
Important Observation: In Cliff Walking, Q-learning learns the shortest path (close to cliff), while SARSA learns a safe path (away from cliff). This is because SARSA incorporates accidental cliff falls from ε-greedy exploration into its learning.
Exercises
Exercise 1: Comparing Convergence Speed of Q-Learning and SARSA
Train Q-learning and SARSA on the FrozenLake environment with the same hyperparameters and compare their convergence speeds.
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Train Q-learning and SARSA on the FrozenLake environment wit
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
import gym
import numpy as np
# Exercise: Train Q-learning and SARSA with same settings
# Exercise: Plot episode rewards
# Exercise: Compare number of episodes needed for convergence
# Expected: Convergence speed differs depending on environment
Exercise 2: Optimizing ε Decay Schedules
Implement different ε decay patterns (linear decay, exponential decay, step decay) and compare their performance on Taxi-v3.
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Implement different ε decay patterns (linear decay, exponent
Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""
import numpy as np
# Exercise: Implement 3 types of ε decay schedules
# Exercise: Evaluate each schedule on Taxi-v3
# Exercise: Compare learning curves and final performance
# Hint: Emphasize exploration early, exploitation later
Exercise 3: Implementing Double Q-Learning
Implement Double Q-Learning to prevent overestimation and compare performance with standard Q-learning.
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Implement Double Q-Learning to prevent overestimation and co
Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
import numpy as np
# Exercise: Implement Double Q-Learning using two Q-tables
# Exercise: Train on FrozenLake environment
# Exercise: Compare Q-value estimation error with standard Q-learning
# Theory: Double algorithm reduces overestimation bias
Exercise 4: Adaptive Learning Rate Adjustment
Implement an adaptive learning rate that adjusts based on visit count and compare it with fixed learning rate.
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Implement an adaptive learning rate that adjusts based on vi
Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""
import numpy as np
# Exercise: Implement adaptive learning rate α(s,a) = 1 / (1 + N(s,a))
# Exercise: Compare performance with fixed learning rate
# Exercise: Visualize visit counts for each state
# Expected: Adaptive learning rate leads to more stable convergence
Exercise 5: Experiments on Custom Environments
Discretize the state space for other OpenAI Gym environments (CartPole-v1, MountainCar-v0, etc.) and apply Q-learning.
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Discretize the state space for other OpenAI Gym environments
Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""
import gym
import numpy as np
# Exercise: Implement function to discretize continuous state space
# Exercise: Apply Q-learning on discretized CartPole environment
# Exercise: Investigate relationship between discretization granularity and performance
# Challenge: Appropriate discretization of continuous space is critical
Summary
In this chapter, we learned Q-learning and SARSA based on temporal difference learning.
Key Points
- TD Learning: Updates at each step without waiting for episode end
- Q-Learning: Off-policy type, directly learns optimal policy
- SARSA: On-policy type, learning that considers exploration impact
- ε-greedy: Simple policy that controls exploration-exploitation balance
- Learning rate α: Controls update strength (0.1-0.3 recommended)
- Discount factor γ: Importance of future rewards (0.95-0.99 recommended)
- Q-table: Stores value for each state-action pair
- Application: Tasks with discrete state and action spaces
When to Use Q-Learning vs SARSA
| Situation | Recommended Algorithm | Reason |
|---|---|---|
| Simulation environment | Q-Learning | Efficiently learns optimal policy |
| Real environment/Robotics | SARSA | Learns safe policy |
| Dangerous states present | SARSA | Risk-averse tendency |
| Fast convergence needed | Q-Learning | Flexible with off-policy |
Next Steps
In the next chapter, we will learn about Deep Q-Network (DQN). We will master techniques for approximating action-value functions using neural networks for large-scale and continuous state spaces that cannot be handled by Q-tables, including Experience Replay, Target Network, and applications to Atari games.