This chapter covers Optimization Theory. You will learn formulation of optimization problems and Lagrange multipliers.
Learning Objectives
By reading this chapter, you will be able to:
- ✅ Understand the formulation of optimization problems and the concept of convexity
- ✅ Master the principles and implementation methods of gradient descent
- ✅ Choose appropriate optimization algorithms such as Momentum and Adam
- ✅ Understand Lagrange multipliers and KKT conditions
- ✅ Understand the theory and practice of convex optimization
- ✅ Implement optimization of machine learning models
3.1 Fundamentals of Optimization
Formulation of Optimization Problems
An Optimization Problem is a problem of minimizing or maximizing an objective function.
$$ \begin{aligned} \min_{\mathbf{x}} \quad & f(\mathbf{x}) \\ \text{subject to} \quad & g_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, m \\ & h_j(\mathbf{x}) = 0, \quad j = 1, \ldots, p \end{aligned} $$
- $f(\mathbf{x})$: Objective function (to be minimized)
- $g_i(\mathbf{x}) \leq 0$: Inequality constraints
- $h_j(\mathbf{x}) = 0$: Equality constraints
Convex Functions and Convex Sets
Convex Set: A line segment connecting two points is contained within the set
$$ \mathbf{x}, \mathbf{y} \in C, \ \theta \in [0, 1] \Rightarrow \theta \mathbf{x} + (1-\theta) \mathbf{y} \in C $$
Convex Function: Function values between two points lie below the line segment
$$ f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y}) $$
Importance: Convex optimization problems guarantee that local optimal solutions equal global optimal solutions
Visualization of Convexity
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Visualization of Convexity
Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Example of convex function: quadratic function
def convex_function(x, y):
return x**2 + y**2
# Example of non-convex function: Himmelblau function
def non_convex_function(x, y):
return (x**2 + y - 11)**2 + (x + y**2 - 7)**2
# Create grid
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z_convex = convex_function(X, Y)
Z_non_convex = non_convex_function(X, Y)
# Visualization
fig = plt.figure(figsize=(15, 6))
# Convex function
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(X, Y, Z_convex, cmap='viridis', alpha=0.8)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x, y)')
ax1.set_title('Convex Function: $f(x, y) = x^2 + y^2$', fontsize=14)
# Non-convex function
ax2 = fig.add_subplot(122, projection='3d')
ax2.plot_surface(X, Y, Z_non_convex, cmap='plasma', alpha=0.8)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_zlabel('f(x, y)')
ax2.set_title('Non-Convex Function: Himmelblau Function', fontsize=14)
plt.tight_layout()
plt.show()
print("=== Verification of Convexity ===")
print("Convex function: Single global optimal solution (origin)")
print("Non-convex function: Multiple local optimal solutions exist")
Gradient and Hessian
Gradient: Vector representing the rate of change of the function
$$ \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} $$
Hessian Matrix: Matrix of second-order partial derivatives
$$ \mathbf{H}(f) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix} $$
Optimality Conditions
First-order condition (necessary condition):
$$\nabla f(\mathbf{x}^*) = \mathbf{0}$$
Second-order condition (sufficient condition):
$$\mathbf{H}(f)(\mathbf{x}^*) \succeq 0 \quad \text{(positive semidefinite)}$$
3.2 Gradient Descent
Principles of Gradient Descent
Gradient Descent searches for the optimal solution by iteratively moving in the opposite direction of the gradient.
$$ \mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \nabla f(\mathbf{x}_t) $$
- $\alpha$: Learning rate (step size)
- $\nabla f(\mathbf{x}_t)$: Gradient at the current position
Choosing the Learning Rate
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Choosing the Learning Rate
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
import numpy as np
import matplotlib.pyplot as plt
# Objective function: f(x) = x^2 + 4x + 4
def f(x):
return x**2 + 4*x + 4
# Gradient: f'(x) = 2x + 4
def grad_f(x):
return 2*x + 4
# Gradient descent
def gradient_descent(x0, lr, n_iterations):
x = x0
trajectory = [x]
for _ in range(n_iterations):
x = x - lr * grad_f(x)
trajectory.append(x)
return np.array(trajectory)
# Experiments with different learning rates
learning_rates = [0.1, 0.5, 0.9, 1.1]
x0 = 5.0
n_iter = 20
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
x_range = np.linspace(-3, 6, 100)
y_range = f(x_range)
for i, lr in enumerate(learning_rates):
trajectory = gradient_descent(x0, lr, n_iter)
axes[i].plot(x_range, y_range, 'b-', linewidth=2, label='$f(x) = x^2 + 4x + 4$')
axes[i].plot(trajectory, f(trajectory), 'ro-', markersize=6,
linewidth=1.5, alpha=0.7, label='Optimization Trajectory')
axes[i].plot(-2, 0, 'g*', markersize=20, label='Optimal Solution')
axes[i].set_xlabel('x', fontsize=12)
axes[i].set_ylabel('f(x)', fontsize=12)
axes[i].set_title(f'Learning Rate α = {lr}', fontsize=14)
axes[i].legend()
axes[i].grid(True, alpha=0.3)
# Convergence determination
if abs(trajectory[-1] - (-2)) < 0.01:
status = "✓ Converged"
elif lr >= 1.0:
status = "✗ Diverged"
else:
status = "△ Slow Convergence"
axes[i].text(0.05, 0.95, status, transform=axes[i].transAxes,
fontsize=12, verticalalignment='top',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.tight_layout()
plt.show()
print("=== Impact of Learning Rate ===")
print("α = 0.1: Slow convergence")
print("α = 0.5: Appropriate convergence")
print("α = 0.9: Fast convergence")
print("α = 1.1: Divergence (learning rate too large)")
Stochastic Gradient Descent (SGD)
Batch Gradient Descent: Gradient calculation with all data (slow)
Stochastic Gradient Descent (SGD): Gradient calculation with 1 sample (fast, noisy)
Mini-batch Gradient Descent: Gradient calculation with small batches (balanced)
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Mini-batch Gradient Descent: Gradient calculation with small
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import numpy as np
import matplotlib.pyplot as plt
# Generate sample data (linear regression)
np.random.seed(42)
n_samples = 100
X = 2 * np.random.rand(n_samples, 1)
y = 4 + 3 * X + np.random.randn(n_samples, 1)
# Initialize parameters
theta_batch = np.random.randn(2, 1)
theta_sgd = theta_batch.copy()
theta_minibatch = theta_batch.copy()
# Add bias term
X_b = np.c_[np.ones((n_samples, 1)), X]
# Hyperparameters
n_epochs = 50
learning_rate = 0.01
batch_size = 10
# Loss function (MSE)
def compute_loss(X, y, theta):
m = len(y)
predictions = X.dot(theta)
loss = (1/(2*m)) * np.sum((predictions - y)**2)
return loss
# Gradient calculation
def compute_gradient(X, y, theta):
m = len(y)
predictions = X.dot(theta)
gradient = (1/m) * X.T.dot(predictions - y)
return gradient
# Training history
history_batch = []
history_sgd = []
history_minibatch = []
# Batch gradient descent
for epoch in range(n_epochs):
gradient = compute_gradient(X_b, y, theta_batch)
theta_batch -= learning_rate * gradient
history_batch.append(compute_loss(X_b, y, theta_batch))
# Stochastic gradient descent
for epoch in range(n_epochs):
for i in range(n_samples):
random_index = np.random.randint(n_samples)
xi = X_b[random_index:random_index+1]
yi = y[random_index:random_index+1]
gradient = compute_gradient(xi, yi, theta_sgd)
theta_sgd -= learning_rate * gradient
history_sgd.append(compute_loss(X_b, y, theta_sgd))
# Mini-batch gradient descent
for epoch in range(n_epochs):
shuffled_indices = np.random.permutation(n_samples)
X_shuffled = X_b[shuffled_indices]
y_shuffled = y[shuffled_indices]
for i in range(0, n_samples, batch_size):
xi = X_shuffled[i:i+batch_size]
yi = y_shuffled[i:i+batch_size]
gradient = compute_gradient(xi, yi, theta_minibatch)
theta_minibatch -= learning_rate * gradient
history_minibatch.append(compute_loss(X_b, y, theta_minibatch))
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Learning curves
axes[0].plot(history_batch, label='Batch GD', linewidth=2)
axes[0].plot(history_sgd, label='SGD', alpha=0.7, linewidth=2)
axes[0].plot(history_minibatch, label='Mini-batch GD', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss (MSE)', fontsize=12)
axes[0].set_title('Comparison of Learning Curves', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
# Regression lines
axes[1].scatter(X, y, alpha=0.5, label='Data')
x_plot = np.array([[0], [2]])
x_plot_b = np.c_[np.ones((2, 1)), x_plot]
axes[1].plot(x_plot, x_plot_b.dot(theta_batch), 'r-',
linewidth=2, label=f'Batch GD')
axes[1].plot(x_plot, x_plot_b.dot(theta_sgd), 'g--',
linewidth=2, label=f'SGD')
axes[1].plot(x_plot, x_plot_b.dot(theta_minibatch), 'b:',
linewidth=2, label=f'Mini-batch GD')
axes[1].set_xlabel('X', fontsize=12)
axes[1].set_ylabel('y', fontsize=12)
axes[1].set_title('Learned Regression Lines', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("=== Final Parameters ===")
print(f"Batch GD: θ0={theta_batch[0][0]:.3f}, θ1={theta_batch[1][0]:.3f}")
print(f"SGD: θ0={theta_sgd[0][0]:.3f}, θ1={theta_sgd[1][0]:.3f}")
print(f"Mini-batch GD: θ0={theta_minibatch[0][0]:.3f}, θ1={theta_minibatch[1][0]:.3f}")
print(f"\nTrue values: θ0=4.000, θ1=3.000")
Advanced Optimization Algorithms
Momentum
Uses a moving average of gradients to suppress oscillations.
$$ \begin{aligned} \mathbf{v}_{t+1} &= \beta \mathbf{v}_t - \alpha \nabla f(\mathbf{x}_t) \\ \mathbf{x}_{t+1} &= \mathbf{x}_t + \mathbf{v}_{t+1} \end{aligned} $$
Adam (Adaptive Moment Estimation)
Adaptively adjusts first and second moments of the gradient.
$$ \begin{aligned} \mathbf{m}_t &= \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \nabla f(\mathbf{x}_t) \\ \mathbf{v}_t &= \beta_2 \mathbf{v}_{t-1} + (1-\beta_2) (\nabla f(\mathbf{x}_t))^2 \\ \hat{\mathbf{m}}_t &= \frac{\mathbf{m}_t}{1-\beta_1^t} \\ \hat{\mathbf{v}}_t &= \frac{\mathbf{v}_t}{1-\beta_2^t} \\ \mathbf{x}_{t+1} &= \mathbf{x}_t - \alpha \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \end{aligned} $$
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: $$
\begin{aligned}
\mathbf{m}_t &= \beta_1 \mathbf{m}_{t-1}
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
import numpy as np
import matplotlib.pyplot as plt
# Rosenbrock function (difficult non-convex optimization function)
def rosenbrock(x, y):
return (1 - x)**2 + 100 * (y - x**2)**2
def rosenbrock_grad(x, y):
dx = -2 * (1 - x) - 400 * x * (y - x**2)
dy = 200 * (y - x**2)
return np.array([dx, dy])
# Implementation of optimization algorithms
def sgd(start, grad_func, lr=0.001, n_iterations=1000):
x = start.copy()
trajectory = [x.copy()]
for _ in range(n_iterations):
grad = grad_func(x[0], x[1])
x -= lr * grad
trajectory.append(x.copy())
return np.array(trajectory)
def momentum(start, grad_func, lr=0.001, beta=0.9, n_iterations=1000):
x = start.copy()
v = np.zeros_like(x)
trajectory = [x.copy()]
for _ in range(n_iterations):
grad = grad_func(x[0], x[1])
v = beta * v - lr * grad
x += v
trajectory.append(x.copy())
return np.array(trajectory)
def adam(start, grad_func, lr=0.01, beta1=0.9, beta2=0.999,
epsilon=1e-8, n_iterations=1000):
x = start.copy()
m = np.zeros_like(x)
v = np.zeros_like(x)
trajectory = [x.copy()]
for t in range(1, n_iterations + 1):
grad = grad_func(x[0], x[1])
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * (grad ** 2)
m_hat = m / (1 - beta1 ** t)
v_hat = v / (1 - beta2 ** t)
x -= lr * m_hat / (np.sqrt(v_hat) + epsilon)
trajectory.append(x.copy())
return np.array(trajectory)
# Execute optimization
start_point = np.array([-1.0, 1.0])
n_iter = 500
traj_sgd = sgd(start_point, rosenbrock_grad, lr=0.0005, n_iterations=n_iter)
traj_momentum = momentum(start_point, rosenbrock_grad, lr=0.0005, n_iterations=n_iter)
traj_adam = adam(start_point, rosenbrock_grad, lr=0.01, n_iterations=n_iter)
# Contour plot
x = np.linspace(-1.5, 1.5, 100)
y = np.linspace(-0.5, 2.5, 100)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)
plt.figure(figsize=(15, 5))
# SGD
plt.subplot(131)
plt.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.6)
plt.plot(traj_sgd[:, 0], traj_sgd[:, 1], 'r.-', markersize=3,
linewidth=1, alpha=0.7, label='SGD')
plt.plot(1, 1, 'g*', markersize=20, label='Optimal Solution')
plt.xlabel('x')
plt.ylabel('y')
plt.title('SGD', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
# Momentum
plt.subplot(132)
plt.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.6)
plt.plot(traj_momentum[:, 0], traj_momentum[:, 1], 'b.-', markersize=3,
linewidth=1, alpha=0.7, label='Momentum')
plt.plot(1, 1, 'g*', markersize=20, label='Optimal Solution')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Momentum', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
# Adam
plt.subplot(133)
plt.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.6)
plt.plot(traj_adam[:, 0], traj_adam[:, 1], 'm.-', markersize=3,
linewidth=1, alpha=0.7, label='Adam')
plt.plot(1, 1, 'g*', markersize=20, label='Optimal Solution')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Adam', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("=== Comparison of Optimization Algorithms ===")
print(f"SGD Final Position: ({traj_sgd[-1][0]:.4f}, {traj_sgd[-1][1]:.4f})")
print(f"Momentum Final Position: ({traj_momentum[-1][0]:.4f}, {traj_momentum[-1][1]:.4f})")
print(f"Adam Final Position: ({traj_adam[-1][0]:.4f}, {traj_adam[-1][1]:.4f})")
print(f"True Optimal Solution: (1.0000, 1.0000)")
Comparison of Optimization Algorithms
| Algorithm | Advantages | Disadvantages | Recommended Usage |
|---|---|---|---|
| SGD | Simple, memory efficient | Slow convergence, oscillations | Large-scale data |
| Momentum | Suppresses oscillations, fast convergence | Inertia causes overshooting | Valley-shaped functions |
| AdaGrad | Adjusts learning rate per feature | Learning rate decreases rapidly | Sparse data |
| RMSprop | Mitigates learning rate decay | Requires parameter tuning | RNN |
| Adam | Adaptive, high performance | Overfitting risk | General purpose (default) |
3.3 Constrained Optimization
Lagrange Multipliers
Equality-constrained optimization:
$$ \begin{aligned} \min_{\mathbf{x}} \quad & f(\mathbf{x}) \\ \text{subject to} \quad & h(\mathbf{x}) = 0 \end{aligned} $$
Lagrange Function:
$$ \mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda h(\mathbf{x}) $$
Optimality Conditions:
$$ \begin{aligned} \nabla_{\mathbf{x}} \mathcal{L} &= 0 \\ \nabla_{\lambda} \mathcal{L} &= 0 \end{aligned} $$
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: $$
\begin{aligned}
\nabla_{\mathbf{x}} \mathcal{L} &= 0 \\
\
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
# Objective function: f(x, y) = (x - 2)^2 + (y - 1)^2
def objective(x):
return (x[0] - 2)**2 + (x[1] - 1)**2
# Equality constraint: h(x, y) = x + y - 2 = 0
def constraint_eq(x):
return x[0] + x[1] - 2
# Define constraints in dictionary format
constraints = {'type': 'eq', 'fun': constraint_eq}
# Initial point
x0 = np.array([0.0, 0.0])
# Optimization
result = minimize(objective, x0, method='SLSQP', constraints=constraints)
# Visualization
x_range = np.linspace(-1, 4, 100)
y_range = np.linspace(-1, 4, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = (X - 2)**2 + (Y - 1)**2
plt.figure(figsize=(10, 8))
plt.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.6)
plt.colorbar(label='$f(x, y)$')
# Constraint line
x_constraint = np.linspace(-1, 4, 100)
y_constraint = 2 - x_constraint
plt.plot(x_constraint, y_constraint, 'r-', linewidth=3, label='Constraint: $x + y = 2$')
# Optimal point
plt.plot(result.x[0], result.x[1], 'r*', markersize=20, label='Optimal Solution')
# Unconstrained optimal point
plt.plot(2, 1, 'g*', markersize=20, label='Unconstrained Optimal Solution')
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Equality-Constrained Optimization with Lagrange Multipliers', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()
print("=== Lagrange Multiplier Method Results ===")
print(f"Optimal solution: x = {result.x[0]:.4f}, y = {result.x[1]:.4f}")
print(f"Objective function value: f(x*) = {result.fun:.4f}")
print(f"Constraint satisfaction: h(x*) = {constraint_eq(result.x):.6f}")
print(f"Unconstrained optimal solution: x = 2.0, y = 1.0, f = 0.0")
KKT Conditions
Necessary conditions for inequality-constrained optimization
$$ \begin{aligned} \min_{\mathbf{x}} \quad & f(\mathbf{x}) \\ \text{subject to} \quad & g_i(\mathbf{x}) \leq 0 \end{aligned} $$
KKT Conditions (Karush-Kuhn-Tucker Conditions):
- Stationarity: $\nabla f(\mathbf{x}^*) + \sum_i \mu_i \nabla g_i(\mathbf{x}^*) = 0$
- Primal feasibility: $g_i(\mathbf{x}^*) \leq 0$
- Dual feasibility: $\mu_i \geq 0$
- Complementarity: $\mu_i g_i(\mathbf{x}^*) = 0$
Application to SVM
Support Vector Machine (SVM) is formulated as a constrained optimization problem.
$$ \begin{aligned} \min_{\mathbf{w}, b} \quad & \frac{1}{2} \|\mathbf{w}\|^2 \\ \text{subject to} \quad & y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1, \quad i = 1, \ldots, n \end{aligned} $$
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: $$
\begin{aligned}
\min_{\mathbf{w}, b} \quad & \frac{1}{2}
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_blobs
# Generate linearly separable data
np.random.seed(42)
X, y = make_blobs(n_samples=100, centers=2, n_features=2,
cluster_std=1.0, center_box=(-5, 5))
# SVM model (linear kernel)
svm = SVC(kernel='linear', C=1000) # Large C approximates hard margin
svm.fit(X, y)
# Visualization of decision boundary
def plot_svm_decision_boundary(ax, X, y, model):
# Create grid
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
# Decision boundary
Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
ax.contour(xx, yy, Z, levels=[-1, 0, 1],
linestyles=['--', '-', '--'], colors=['r', 'k', 'b'], linewidths=2)
ax.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm',
s=50, edgecolors='k', alpha=0.7)
# Support vectors
ax.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1],
s=200, facecolors='none', edgecolors='g', linewidths=2,
label='Support Vectors')
ax.set_xlabel('$x_1$', fontsize=12)
ax.set_ylabel('$x_2$', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)
plt.figure(figsize=(10, 8))
ax = plt.gca()
plot_svm_decision_boundary(ax, X, y, svm)
plt.title('Linear Separation with SVM (Optimization via KKT Conditions)', fontsize=14)
plt.tight_layout()
plt.show()
print("=== SVM Optimization Results ===")
print(f"Weight vector w: {svm.coef_[0]}")
print(f"Bias b: {svm.intercept_[0]:.4f}")
print(f"Number of support vectors: {len(svm.support_vectors_)}")
print(f"Margin: {2 / np.linalg.norm(svm.coef_):.4f}")
3.4 Convex Optimization
Properties of Convex Optimization
Important Property: In convex optimization problems, local optimal solution equals global optimal solution
This allows efficient algorithms to reliably find optimal solutions.
Linear Programming
$$ \begin{aligned} \min_{\mathbf{x}} \quad & \mathbf{c}^T \mathbf{x} \\ \text{subject to} \quad & \mathbf{A} \mathbf{x} \leq \mathbf{b} \\ & \mathbf{x} \geq 0 \end{aligned} $$
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: $$
\begin{aligned}
\min_{\mathbf{x}} \quad & \mathbf{c}^T \m
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import numpy as np
from scipy.optimize import linprog
import matplotlib.pyplot as plt
# Linear programming problem
# Objective function: minimize -x - 2y (= maximize x + 2y)
c = [-1, -2]
# Inequality constraints: A_ub * x <= b_ub
# Constraint 1: x + y <= 4
# Constraint 2: 2x + y <= 5
# Constraint 3: x >= 0, y >= 0 (specified with bounds)
A_ub = np.array([[1, 1],
[2, 1]])
b_ub = np.array([4, 5])
# Variable bounds
bounds = [(0, None), (0, None)]
# Optimization
result = linprog(c, A_ub=A_ub, b_ub=b_ub, bounds=bounds, method='highs')
# Visualization
x = np.linspace(0, 5, 100)
plt.figure(figsize=(10, 8))
# Visualize constraints
y1 = 4 - x
y2 = 5 - 2*x
plt.plot(x, y1, 'r-', linewidth=2, label='$x + y \leq 4$')
plt.plot(x, y2, 'b-', linewidth=2, label='$2x + y \leq 5$')
plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)
# Feasible region
x_fill = np.linspace(0, 2.5, 100)
y_upper = np.minimum(4 - x_fill, 5 - 2*x_fill)
y_upper = np.maximum(y_upper, 0)
plt.fill_between(x_fill, 0, y_upper, alpha=0.3, color='green',
label='Feasible Region')
# Objective function contours
X, Y = np.meshgrid(np.linspace(0, 5, 100), np.linspace(0, 5, 100))
Z = X + 2*Y
plt.contour(X, Y, Z, levels=10, alpha=0.3, cmap='viridis')
# Optimal solution
plt.plot(result.x[0], result.x[1], 'r*', markersize=20, label='Optimal Solution')
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Linear Programming', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xlim(0, 5)
plt.ylim(0, 5)
plt.tight_layout()
plt.show()
print("=== Linear Programming Results ===")
print(f"Optimal solution: x = {result.x[0]:.4f}, y = {result.x[1]:.4f}")
print(f"Objective function value (maximization): {-result.fun:.4f}")
print(f"Optimization successful: {result.success}")
Quadratic Programming
$$ \begin{aligned} \min_{\mathbf{x}} \quad & \frac{1}{2} \mathbf{x}^T \mathbf{Q} \mathbf{x} + \mathbf{c}^T \mathbf{x} \\ \text{subject to} \quad & \mathbf{A} \mathbf{x} \leq \mathbf{b} \end{aligned} $$
Convex Optimization with CVXPY
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Convex Optimization with CVXPY
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import cvxpy as cp
import numpy as np
import matplotlib.pyplot as plt
# Portfolio optimization problem
# Objective: Minimize risk while achieving a target expected return
# Expected returns and covariance matrix of assets (sample data)
np.random.seed(42)
n_assets = 5
expected_returns = np.random.uniform(0.05, 0.15, n_assets)
cov_matrix = np.random.randn(n_assets, n_assets)
cov_matrix = cov_matrix @ cov_matrix.T / 100 # Positive definite matrix
# Variable: Asset allocation
w = cp.Variable(n_assets)
# Parameter: Target return
target_return = 0.10
# Objective function: Minimize risk (variance)
risk = cp.quad_form(w, cov_matrix)
# Constraints
constraints = [
cp.sum(w) == 1, # Sum of asset allocation = 1
w >= 0, # No short selling
expected_returns @ w >= target_return # Achieve target return
]
# Define and solve problem
problem = cp.Problem(cp.Minimize(risk), constraints)
problem.solve()
# Calculate efficient frontier
target_returns = np.linspace(0.06, 0.14, 20)
risks = []
portfolios = []
for target in target_returns:
constraints = [
cp.sum(w) == 1,
w >= 0,
expected_returns @ w >= target
]
problem = cp.Problem(cp.Minimize(risk), constraints)
problem.solve()
if problem.status == 'optimal':
risks.append(np.sqrt(problem.value))
portfolios.append(w.value)
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Efficient frontier
axes[0].plot(risks, target_returns, 'b-', linewidth=2, label='Efficient Frontier')
axes[0].plot(np.sqrt(problem.value), target_return, 'r*',
markersize=15, label=f'Selected Portfolio (Return={target_return})')
axes[0].set_xlabel('Risk (Standard Deviation)', fontsize=12)
axes[0].set_ylabel('Expected Return', fontsize=12)
axes[0].set_title('Efficient Frontier', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
# Asset allocation
if problem.status == 'optimal':
axes[1].bar(range(n_assets), w.value, alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Asset', fontsize=12)
axes[1].set_ylabel('Allocation Ratio', fontsize=12)
axes[1].set_title('Optimal Asset Allocation', fontsize=14)
axes[1].set_xticks(range(n_assets))
axes[1].set_xticklabels([f'Asset{i+1}' for i in range(n_assets)])
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
print("=== Portfolio Optimization Results ===")
print(f"Target return: {target_return:.2%}")
print(f"Achieved return: {(expected_returns @ w.value):.2%}")
print(f"Risk (standard deviation): {np.sqrt(problem.value):.2%}")
print(f"\nOptimal asset allocation:")
for i, weight in enumerate(w.value):
print(f" Asset{i+1}: {weight:.2%}")
Note: To use CVXPY, install with
pip install cvxpy.
3.5 Practical Application: Machine Learning
Optimization of Logistic Regression
Logistic regression can be solved as a convex optimization problem.
$$ \min_{\mathbf{w}} \sum_{i=1}^n \log(1 + \exp(-y_i \mathbf{w}^T \mathbf{x}_i)) + \lambda \|\mathbf{w}\|^2 $$
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: $$
\min_{\mathbf{w}} \sum_{i=1}^n \log(1 + \exp(-y_i \mathbf
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Generate data
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1,
random_state=42)
y = 2 * y - 1 # {0, 1} -> {-1, 1}
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Sigmoid function
def sigmoid(z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
# Loss function (cross entropy + L2 regularization)
def logistic_loss(w, X, y, lambda_reg=0.01):
z = X @ w
loss = np.mean(np.log(1 + np.exp(-y * z)))
reg = lambda_reg * np.sum(w**2)
return loss + reg
# Gradient
def logistic_gradient(w, X, y, lambda_reg=0.01):
z = X @ w
grad = -X.T @ (y * sigmoid(-y * z)) / len(y)
grad += 2 * lambda_reg * w
return grad
# Optimization with gradient descent
def train_logistic_regression(X, y, lr=0.1, n_iterations=1000, lambda_reg=0.01):
n_features = X.shape[1]
w = np.zeros(n_features)
losses = []
for i in range(n_iterations):
grad = logistic_gradient(w, X, y, lambda_reg)
w -= lr * grad
if i % 10 == 0:
loss = logistic_loss(w, X, y, lambda_reg)
losses.append(loss)
return w, losses
# Training
w_optimal, losses = train_logistic_regression(
X_train_scaled, y_train, lr=0.5, n_iterations=1000, lambda_reg=0.01
)
# Prediction
def predict(X, w):
z = X @ w
return np.sign(z)
y_pred_train = predict(X_train_scaled, w_optimal)
y_pred_test = predict(X_test_scaled, w_optimal)
train_acc = np.mean(y_pred_train == y_train)
test_acc = np.mean(y_pred_test == y_test)
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Learning curve
axes[0].plot(losses, linewidth=2)
axes[0].set_xlabel('Iteration (×10)', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Learning Curve', fontsize=14)
axes[0].grid(True, alpha=0.3)
# Decision boundary
x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1
y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
Z = predict(np.c_[xx.ravel(), yy.ravel()], w_optimal)
Z = Z.reshape(xx.shape)
axes[1].contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm', levels=1)
axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],
c=y_train, cmap='coolwarm', edgecolors='k', s=50, alpha=0.7)
axes[1].set_xlabel('Feature 1', fontsize=12)
axes[1].set_ylabel('Feature 2', fontsize=12)
axes[1].set_title(f'Decision Boundary (Accuracy: {train_acc:.2%})', fontsize=14)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("=== Logistic Regression Results ===")
print(f"Training accuracy: {train_acc:.2%}")
print(f"Test accuracy: {test_acc:.2%}")
print(f"Optimal parameters: {w_optimal}")
Effect of Regularization
Regularization prevents overfitting.
- L1 Regularization (Lasso): $\lambda \|\mathbf{w}\|_1$ → Sparse solution
- L2 Regularization (Ridge): $\lambda \|\mathbf{w}\|_2^2$ → Weight decay
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Regularization prevents overfitting.
Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Generate data
X, y = make_regression(n_samples=100, n_features=50, n_informative=10,
noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Training with different regularization parameters
alphas = np.logspace(-3, 3, 50)
ridge_train_scores = []
ridge_test_scores = []
lasso_train_scores = []
lasso_test_scores = []
for alpha in alphas:
# Ridge
ridge = Ridge(alpha=alpha)
ridge.fit(X_train_scaled, y_train)
ridge_train_scores.append(ridge.score(X_train_scaled, y_train))
ridge_test_scores.append(ridge.score(X_test_scaled, y_test))
# Lasso
lasso = Lasso(alpha=alpha, max_iter=10000)
lasso.fit(X_train_scaled, y_train)
lasso_train_scores.append(lasso.score(X_train_scaled, y_train))
lasso_test_scores.append(lasso.score(X_test_scaled, y_test))
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Ridge
axes[0].semilogx(alphas, ridge_train_scores, 'b-', linewidth=2, label='Training')
axes[0].semilogx(alphas, ridge_test_scores, 'r-', linewidth=2, label='Test')
axes[0].set_xlabel('Regularization Parameter α', fontsize=12)
axes[0].set_ylabel('R² Score', fontsize=12)
axes[0].set_title('Ridge Regression (L2 Regularization)', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
# Lasso
axes[1].semilogx(alphas, lasso_train_scores, 'b-', linewidth=2, label='Training')
axes[1].semilogx(alphas, lasso_test_scores, 'r-', linewidth=2, label='Test')
axes[1].set_xlabel('Regularization Parameter α', fontsize=12)
axes[1].set_ylabel('R² Score', fontsize=12)
axes[1].set_title('Lasso Regression (L1 Regularization)', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Coefficient comparison with optimal α
ridge_best = Ridge(alpha=1.0)
ridge_best.fit(X_train_scaled, y_train)
lasso_best = Lasso(alpha=0.1, max_iter=10000)
lasso_best.fit(X_train_scaled, y_train)
print("=== Effect of Regularization ===")
print(f"Ridge - Non-zero coefficients: {np.sum(np.abs(ridge_best.coef_) > 0.01)}/{len(ridge_best.coef_)}")
print(f"Lasso - Non-zero coefficients: {np.sum(np.abs(lasso_best.coef_) > 0.01)}/{len(lasso_best.coef_)}")
print(f"\nRidge best score: {ridge_best.score(X_test_scaled, y_test):.3f}")
print(f"Lasso best score: {lasso_best.score(X_test_scaled, y_test):.3f}")
3.6 Chapter Summary
What We Learned
Fundamentals of Optimization
- Importance of convexity: Convex optimization guarantees global optimal solutions
- Determining optimality using gradients and Hessians
Gradient Descent
- Learning rate selection determines convergence speed and stability
- Advanced methods such as SGD, Momentum, and Adam
- Understanding implementation and convergence
Constrained Optimization
- Handling equality constraints with Lagrange multipliers
- Theory of inequality constraints via KKT conditions
- Application to SVM
Convex Optimization
- Linear programming and quadratic programming
- Implementation with CVXPY
- Applications such as portfolio optimization
Application to Machine Learning
- Optimization of logistic regression
- Preventing overfitting through regularization
- Implementation with real data
Guidelines for Choosing Optimization Algorithms
| Problem Nature | Recommended Method | Reason |
|---|---|---|
| Convex function, small scale | Gradient Descent | Reliably finds optimal solution |
| Convex function, large scale | SGD, Adam | Computational efficiency |
| Non-convex, deep learning | Adam, RMSprop | Avoids local optima |
| With equality constraints | Lagrange Multipliers | Guarantees constraint satisfaction |
| With inequality constraints | KKT conditions, SLSQP | Feasible solutions |
| Linear or convex quadratic | Specialized solvers (CVXPY) | Fast and stable |
Next Chapter
In Chapter 4, we will study Fundamentals of Probability and Statistics:
- Probability distributions and expected values
- Maximum likelihood estimation and Bayesian estimation
- Hypothesis testing and confidence intervals
- Fundamentals of information theory
Exercises
Problem 1 (Difficulty: Easy)
Find the value of $x$ that minimizes the function $f(x) = x^2 + 4x + 4$ using the gradient (analytically).
Sample Solution
Solution:
Find the point where the gradient equals zero.
$$ \nabla f(x) = \frac{df}{dx} = 2x + 4 $$
Optimality condition:
$$ 2x + 4 = 0 \Rightarrow x^* = -2 $$
Second-order condition (sufficient condition):
$$ \frac{d^2 f}{dx^2} = 2 > 0 $$
Therefore, the minimum value $f(x^*) = 0$ is achieved at $x^* = -2$.
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Therefore, the minimum value $f(x^*) = 0$ is achieved at $x^
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
import numpy as np
import matplotlib.pyplot as plt
# Function and gradient
def f(x):
return x**2 + 4*x + 4
def grad_f(x):
return 2*x + 4
# Visualization
x = np.linspace(-5, 2, 100)
y = f(x)
plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', linewidth=2, label='$f(x) = x^2 + 4x + 4$')
plt.plot(-2, 0, 'r*', markersize=20, label='Optimal Solution: $x^* = -2$')
plt.axhline(y=0, color='k', linewidth=0.5, linestyle='--')
plt.axvline(x=-2, color='r', linewidth=0.5, linestyle='--')
plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Analytical Solution to Optimization Problem', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()
print(f"Optimal solution: x* = -2")
print(f"Minimum value: f(x*) = {f(-2)}")
print(f"Gradient: f'(x*) = {grad_f(-2)}")
Problem 2 (Difficulty: Medium)
Using gradient descent with learning rate $\alpha = 0.1$ and initial value $x_0 = 5$, minimize $f(x) = x^2$. Report the values of $x$ after 5 iterations.
Sample Solution
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Using gradient descent with learning rate $\alpha = 0.1$ and
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""
import numpy as np
# Objective function and gradient
def f(x):
return x**2
def grad_f(x):
return 2*x
# Gradient descent
x = 5.0
alpha = 0.1
n_iterations = 5
print("=== Gradient Descent Trajectory ===")
print(f"Iteration 0: x = {x:.6f}, f(x) = {f(x):.6f}")
for i in range(1, n_iterations + 1):
grad = grad_f(x)
x = x - alpha * grad
print(f"Iteration {i}: x = {x:.6f}, f(x) = {f(x):.6f}, grad = {grad:.6f}")
print(f"\nFinal value: x = {x:.6f}")
print(f"True optimal solution: x* = 0")
Output:
=== Gradient Descent Trajectory ===
Iteration 0: x = 5.000000, f(x) = 25.000000
Iteration 1: x = 4.000000, f(x) = 16.000000, grad = 10.000000
Iteration 2: x = 3.200000, f(x) = 10.240000, grad = 8.000000
Iteration 3: x = 2.560000, f(x) = 6.553600, grad = 6.400000
Iteration 4: x = 2.048000, f(x) = 4.194304, grad = 5.120000
Iteration 5: x = 1.638400, f(x) = 2.684355, grad = 4.096000
Final value: x = 1.638400
True optimal solution: x* = 0
After 5 iterations, it has not fully converged, but it is approaching the optimal solution.
Problem 3 (Difficulty: Medium)
Under the equality constraint $x + y = 1$, solve the problem of minimizing $f(x, y) = x^2 + y^2$ using Lagrange multipliers.
Sample Solution
Solution:
Lagrange function:
$$ \mathcal{L}(x, y, \lambda) = x^2 + y^2 + \lambda(x + y - 1) $$
Optimality conditions:
$$ \begin{aligned} \frac{\partial \mathcal{L}}{\partial x} &= 2x + \lambda = 0 \\ \frac{\partial \mathcal{L}}{\partial y} &= 2y + \lambda = 0 \\ \frac{\partial \mathcal{L}}{\partial \lambda} &= x + y - 1 = 0 \end{aligned} $$
From equations 1 and 2, $x = y$. Substituting into equation 3: $2x = 1 \Rightarrow x = y = 0.5$
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: From equations 1 and 2, $x = y$. Substituting into equation
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
# Objective function
def objective(vars):
x, y = vars
return x**2 + y**2
# Equality constraint
def constraint(vars):
x, y = vars
return x + y - 1
# Define constraints
constraints = {'type': 'eq', 'fun': constraint}
# Initial point
x0 = [0.0, 0.0]
# Optimization
result = minimize(objective, x0, method='SLSQP', constraints=constraints)
print("=== Lagrange Multiplier Method Solution ===")
print(f"Optimal solution: x = {result.x[0]:.4f}, y = {result.x[1]:.4f}")
print(f"Objective function value: f(x*, y*) = {result.fun:.4f}")
print(f"Constraint satisfaction: x + y = {result.x[0] + result.x[1]:.6f}")
# Visualization
x = np.linspace(-0.5, 1.5, 100)
y = np.linspace(-0.5, 1.5, 100)
X, Y = np.meshgrid(x, y)
Z = X**2 + Y**2
plt.figure(figsize=(10, 8))
plt.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.6)
plt.colorbar(label='$f(x, y)$')
# Constraint line
x_line = np.linspace(-0.5, 1.5, 100)
y_line = 1 - x_line
plt.plot(x_line, y_line, 'r-', linewidth=3, label='Constraint: $x + y = 1$')
# Optimal point
plt.plot(result.x[0], result.x[1], 'r*', markersize=20, label='Optimal Solution')
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Equality-Constrained Optimization', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()
Problem 4 (Difficulty: Hard)
Using Momentum method ($\beta = 0.9$), optimize the Rosenbrock function $f(x, y) = (1-x)^2 + 100(y-x^2)^2$ from initial point $(-1, 1)$. Iterate 100 times with learning rate $\alpha = 0.001$ and visualize the trajectory.
Sample Solution
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Using Momentum method ($\beta = 0.9$), optimize the Rosenbro
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
import numpy as np
import matplotlib.pyplot as plt
# Rosenbrock function
def rosenbrock(x, y):
return (1 - x)**2 + 100 * (y - x**2)**2
def rosenbrock_grad(x, y):
dx = -2 * (1 - x) - 400 * x * (y - x**2)
dy = 200 * (y - x**2)
return np.array([dx, dy])
# Momentum method
def momentum(start, grad_func, lr=0.001, beta=0.9, n_iterations=100):
pos = start.copy()
velocity = np.zeros_like(pos)
trajectory = [pos.copy()]
for i in range(n_iterations):
grad = grad_func(pos[0], pos[1])
velocity = beta * velocity - lr * grad
pos += velocity
trajectory.append(pos.copy())
return np.array(trajectory)
# Optimization
start = np.array([-1.0, 1.0])
trajectory = momentum(start, rosenbrock_grad, lr=0.001, beta=0.9, n_iterations=100)
# Visualization
x = np.linspace(-1.5, 1.5, 100)
y = np.linspace(-0.5, 2.5, 100)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)
plt.figure(figsize=(12, 9))
plt.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.6)
plt.colorbar(label='$f(x, y)$')
plt.plot(trajectory[:, 0], trajectory[:, 1], 'r.-', markersize=4,
linewidth=1.5, alpha=0.7, label='Momentum Trajectory')
plt.plot(start[0], start[1], 'bo', markersize=10, label='Starting Point')
plt.plot(1, 1, 'g*', markersize=20, label='Optimal Solution')
plt.plot(trajectory[-1, 0], trajectory[-1, 1], 'rs', markersize=10, label='Ending Point')
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Optimization of Rosenbrock Function with Momentum Method', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("=== Momentum Method Results ===")
print(f"Initial point: ({start[0]:.2f}, {start[1]:.2f})")
print(f"Final point: ({trajectory[-1][0]:.4f}, {trajectory[-1][1]:.4f})")
print(f"Optimal solution: (1.0000, 1.0000)")
print(f"Final function value: {rosenbrock(trajectory[-1][0], trajectory[-1][1]):.6f}")
Problem 5 (Difficulty: Hard)
Explain the difference between L1 and L2 regularization, and describe which one is more likely to generate sparse solutions (many coefficients equal to zero), along with the reason.
Sample Solution
Solution:
L2 Regularization (Ridge):
- Penalty term: $\lambda \sum_i w_i^2$
- Characteristic: Makes coefficients small but not zero
- Gradient: $\nabla (\lambda w^2) = 2\lambda w$ (continuously approaches zero)
L1 Regularization (Lasso):
- Penalty term: $\lambda \sum_i |w_i|$
- Characteristic: Makes coefficients exactly zero (sparse solution)
- Gradient: $\nabla (\lambda |w|) = \lambda \cdot \text{sign}(w)$ (discontinuous near zero)
Reason for Sparsity:
- Geometric interpretation: The feasible region of L1 constraints has corners (e.g., diamond shape). Contour lines of the objective function easily intersect at corners, where some coefficients become zero.
- Gradient properties: The gradient of L1 is constant regardless of coefficient magnitude ($\pm \lambda$), so both small and large coefficients receive the same penalty, making small coefficients more likely to be pushed to zero.
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Reason for Sparsity:
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
import numpy as np
import matplotlib.pyplot as plt
# Visualization of L1 and L2 constraint regions
theta = np.linspace(0, 2*np.pi, 1000)
# L1 constraint (|w1| + |w2| <= 1)
w1_l1 = np.cos(theta) * (np.abs(np.cos(theta)) + np.abs(np.sin(theta)))
w2_l1 = np.sin(theta) * (np.abs(np.cos(theta)) + np.abs(np.sin(theta)))
# L2 constraint (w1^2 + w2^2 <= 1)
w1_l2 = np.cos(theta)
w2_l2 = np.sin(theta)
# Objective function contours (hypothetical example)
w1 = np.linspace(-2, 2, 100)
w2 = np.linspace(-2, 2, 100)
W1, W2 = np.meshgrid(w1, w2)
Z = (W1 - 1.5)**2 + (W2 - 1.0)**2
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# L1 regularization
axes[0].contour(W1, W2, Z, levels=15, alpha=0.6, cmap='viridis')
axes[0].fill(w1_l1, w2_l1, alpha=0.3, color='red', label='L1 Constraint Region')
axes[0].plot(0, 0, 'r*', markersize=20, label='Sparse Solution (w1=0)')
axes[0].set_xlabel('$w_1$', fontsize=12)
axes[0].set_ylabel('$w_2$', fontsize=12)
axes[0].set_title('L1 Regularization (Lasso)', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
axes[0].axis('equal')
# L2 regularization
axes[1].contour(W1, W2, Z, levels=15, alpha=0.6, cmap='viridis')
axes[1].fill(w1_l2, w2_l2, alpha=0.3, color='blue', label='L2 Constraint Region')
circle_intersect_x = 0.8
circle_intersect_y = 0.6
axes[1].plot(circle_intersect_x, circle_intersect_y, 'b*',
markersize=20, label='Non-Sparse Solution')
axes[1].set_xlabel('$w_1$', fontsize=12)
axes[1].set_ylabel('$w_2$', fontsize=12)
axes[1].set_title('L2 Regularization (Ridge)', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
axes[1].axis('equal')
plt.tight_layout()
plt.show()
print("=== L1 vs L2 Regularization ===")
print("L1 (Lasso): Generates sparse solutions (many coefficients are zero)")
print("L2 (Ridge): Generates small but non-zero coefficients")
print("\nApplications:")
print("- L1: When feature selection is needed")
print("- L2: When using all features while preventing overfitting")
References
- Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
- Nocedal, J., & Wright, S. (2006). Numerical Optimization (2nd ed.). Springer.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
- Ruder, S. (2016). "An overview of gradient descent optimization algorithms". arXiv:1609.04747.
Disclaimer
- This content is provided solely for educational, research, and informational purposes and does not constitute professional advice (legal, accounting, technical warranty, etc.).
- This content and accompanying code examples are provided "AS IS" without any warranty, express or implied, including but not limited to merchantability, fitness for a particular purpose, non-infringement, accuracy, completeness, operation, or safety.
- The author and Tohoku University assume no responsibility for the content, availability, or safety of external links, third-party data, tools, libraries, etc.
- To the maximum extent permitted by applicable law, the author and Tohoku University shall not be liable for any direct, indirect, incidental, special, consequential, or punitive damages arising from the use, execution, or interpretation of this content.
- The content may be changed, updated, or discontinued without notice.
- The copyright and license of this content are subject to the stated conditions (e.g., CC BY 4.0). Such licenses typically include no-warranty clauses.