Chapter 3: Optimization Theory

This chapter covers Optimization Theory. You will learn formulation of optimization problems and Lagrange multipliers.

Learning Objectives

By reading this chapter, you will be able to:

✅ Understand the formulation of optimization problems and the concept of convexity
✅ Master the principles and implementation methods of gradient descent
✅ Choose appropriate optimization algorithms such as Momentum and Adam
✅ Understand Lagrange multipliers and KKT conditions
✅ Understand the theory and practice of convex optimization
✅ Implement optimization of machine learning models

3.1 Fundamentals of Optimization

Formulation of Optimization Problems

An Optimization Problem is a problem of minimizing or maximizing an objective function.

$$ \begin{aligned} \min_{\mathbf{x}} \quad & f(\mathbf{x}) \\ \text{subject to} \quad & g_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, m \\ & h_j(\mathbf{x}) = 0, \quad j = 1, \ldots, p \end{aligned} $$

$f(\mathbf{x})$: Objective function (to be minimized)
$g_i(\mathbf{x}) \leq 0$: Inequality constraints
$h_j(\mathbf{x}) = 0$: Equality constraints

Convex Functions and Convex Sets

Convex Set: A line segment connecting two points is contained within the set

$$ \mathbf{x}, \mathbf{y} \in C, \ \theta \in [0, 1] \Rightarrow \theta \mathbf{x} + (1-\theta) \mathbf{y} \in C $$

Convex Function: Function values between two points lie below the line segment

$$ f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y}) $$

Importance: Convex optimization problems guarantee that local optimal solutions equal global optimal solutions

Visualization of Convexity

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Visualization of Convexity

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Example of convex function: quadratic function
def convex_function(x, y):
    return x**2 + y**2

# Example of non-convex function: Himmelblau function
def non_convex_function(x, y):
    return (x**2 + y - 11)**2 + (x + y**2 - 7)**2

# Create grid
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)

Z_convex = convex_function(X, Y)
Z_non_convex = non_convex_function(X, Y)

# Visualization
fig = plt.figure(figsize=(15, 6))

# Convex function
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(X, Y, Z_convex, cmap='viridis', alpha=0.8)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x, y)')
ax1.set_title('Convex Function: $f(x, y) = x^2 + y^2$', fontsize=14)

# Non-convex function
ax2 = fig.add_subplot(122, projection='3d')
ax2.plot_surface(X, Y, Z_non_convex, cmap='plasma', alpha=0.8)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_zlabel('f(x, y)')
ax2.set_title('Non-Convex Function: Himmelblau Function', fontsize=14)

plt.tight_layout()
plt.show()

print("=== Verification of Convexity ===")
print("Convex function: Single global optimal solution (origin)")
print("Non-convex function: Multiple local optimal solutions exist")

Gradient and Hessian

Gradient: Vector representing the rate of change of the function

$$ \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} $$

Hessian Matrix: Matrix of second-order partial derivatives

$$ \mathbf{H}(f) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix} $$

Optimality Conditions

First-order condition (necessary condition):

$$\nabla f(\mathbf{x}^*) = \mathbf{0}$$

Second-order condition (sufficient condition):

$$\mathbf{H}(f)(\mathbf{x}^*) \succeq 0 \quad \text{(positive semidefinite)}$$

3.2 Gradient Descent

Principles of Gradient Descent

Gradient Descent searches for the optimal solution by iteratively moving in the opposite direction of the gradient.

$$ \mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \nabla f(\mathbf{x}_t) $$

$\alpha$: Learning rate (step size)
$\nabla f(\mathbf{x}_t)$: Gradient at the current position

Choosing the Learning Rate

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Choosing the Learning Rate

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
import matplotlib.pyplot as plt

# Objective function: f(x) = x^2 + 4x + 4
def f(x):
    return x**2 + 4*x + 4

# Gradient: f'(x) = 2x + 4
def grad_f(x):
    return 2*x + 4

# Gradient descent
def gradient_descent(x0, lr, n_iterations):
    x = x0
    trajectory = [x]

    for _ in range(n_iterations):
        x = x - lr * grad_f(x)
        trajectory.append(x)

    return np.array(trajectory)

# Experiments with different learning rates
learning_rates = [0.1, 0.5, 0.9, 1.1]
x0 = 5.0
n_iter = 20

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

x_range = np.linspace(-3, 6, 100)
y_range = f(x_range)

for i, lr in enumerate(learning_rates):
    trajectory = gradient_descent(x0, lr, n_iter)

    axes[i].plot(x_range, y_range, 'b-', linewidth=2, label='$f(x) = x^2 + 4x + 4$')
    axes[i].plot(trajectory, f(trajectory), 'ro-', markersize=6,
                 linewidth=1.5, alpha=0.7, label='Optimization Trajectory')
    axes[i].plot(-2, 0, 'g*', markersize=20, label='Optimal Solution')

    axes[i].set_xlabel('x', fontsize=12)
    axes[i].set_ylabel('f(x)', fontsize=12)
    axes[i].set_title(f'Learning Rate α = {lr}', fontsize=14)
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

    # Convergence determination
    if abs(trajectory[-1] - (-2)) < 0.01:
        status = "✓ Converged"
    elif lr >= 1.0:
        status = "✗ Diverged"
    else:
        status = "△ Slow Convergence"

    axes[i].text(0.05, 0.95, status, transform=axes[i].transAxes,
                fontsize=12, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

print("=== Impact of Learning Rate ===")
print("α = 0.1: Slow convergence")
print("α = 0.5: Appropriate convergence")
print("α = 0.9: Fast convergence")
print("α = 1.1: Divergence (learning rate too large)")

Stochastic Gradient Descent (SGD)

Batch Gradient Descent: Gradient calculation with all data (slow)

Stochastic Gradient Descent (SGD): Gradient calculation with 1 sample (fast, noisy)

Mini-batch Gradient Descent: Gradient calculation with small batches (balanced)

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Mini-batch Gradient Descent: Gradient calculation with small

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np
import matplotlib.pyplot as plt

# Generate sample data (linear regression)
np.random.seed(42)
n_samples = 100
X = 2 * np.random.rand(n_samples, 1)
y = 4 + 3 * X + np.random.randn(n_samples, 1)

# Initialize parameters
theta_batch = np.random.randn(2, 1)
theta_sgd = theta_batch.copy()
theta_minibatch = theta_batch.copy()

# Add bias term
X_b = np.c_[np.ones((n_samples, 1)), X]

# Hyperparameters
n_epochs = 50
learning_rate = 0.01
batch_size = 10

# Loss function (MSE)
def compute_loss(X, y, theta):
    m = len(y)
    predictions = X.dot(theta)
    loss = (1/(2*m)) * np.sum((predictions - y)**2)
    return loss

# Gradient calculation
def compute_gradient(X, y, theta):
    m = len(y)
    predictions = X.dot(theta)
    gradient = (1/m) * X.T.dot(predictions - y)
    return gradient

# Training history
history_batch = []
history_sgd = []
history_minibatch = []

# Batch gradient descent
for epoch in range(n_epochs):
    gradient = compute_gradient(X_b, y, theta_batch)
    theta_batch -= learning_rate * gradient
    history_batch.append(compute_loss(X_b, y, theta_batch))

# Stochastic gradient descent
for epoch in range(n_epochs):
    for i in range(n_samples):
        random_index = np.random.randint(n_samples)
        xi = X_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradient = compute_gradient(xi, yi, theta_sgd)
        theta_sgd -= learning_rate * gradient
    history_sgd.append(compute_loss(X_b, y, theta_sgd))

# Mini-batch gradient descent
for epoch in range(n_epochs):
    shuffled_indices = np.random.permutation(n_samples)
    X_shuffled = X_b[shuffled_indices]
    y_shuffled = y[shuffled_indices]

    for i in range(0, n_samples, batch_size):
        xi = X_shuffled[i:i+batch_size]
        yi = y_shuffled[i:i+batch_size]
        gradient = compute_gradient(xi, yi, theta_minibatch)
        theta_minibatch -= learning_rate * gradient
    history_minibatch.append(compute_loss(X_b, y, theta_minibatch))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Learning curves
axes[0].plot(history_batch, label='Batch GD', linewidth=2)
axes[0].plot(history_sgd, label='SGD', alpha=0.7, linewidth=2)
axes[0].plot(history_minibatch, label='Mini-batch GD', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss (MSE)', fontsize=12)
axes[0].set_title('Comparison of Learning Curves', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Regression lines
axes[1].scatter(X, y, alpha=0.5, label='Data')
x_plot = np.array([[0], [2]])
x_plot_b = np.c_[np.ones((2, 1)), x_plot]

axes[1].plot(x_plot, x_plot_b.dot(theta_batch), 'r-',
             linewidth=2, label=f'Batch GD')
axes[1].plot(x_plot, x_plot_b.dot(theta_sgd), 'g--',
             linewidth=2, label=f'SGD')
axes[1].plot(x_plot, x_plot_b.dot(theta_minibatch), 'b:',
             linewidth=2, label=f'Mini-batch GD')
axes[1].set_xlabel('X', fontsize=12)
axes[1].set_ylabel('y', fontsize=12)
axes[1].set_title('Learned Regression Lines', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("=== Final Parameters ===")
print(f"Batch GD:      θ0={theta_batch[0][0]:.3f}, θ1={theta_batch[1][0]:.3f}")
print(f"SGD:           θ0={theta_sgd[0][0]:.3f}, θ1={theta_sgd[1][0]:.3f}")
print(f"Mini-batch GD: θ0={theta_minibatch[0][0]:.3f}, θ1={theta_minibatch[1][0]:.3f}")
print(f"\nTrue values:   θ0=4.000, θ1=3.000")

Advanced Optimization Algorithms

Momentum

Uses a moving average of gradients to suppress oscillations.

$$ \begin{aligned} \mathbf{v}_{t+1} &= \beta \mathbf{v}_t - \alpha \nabla f(\mathbf{x}_t) \\ \mathbf{x}_{t+1} &= \mathbf{x}_t + \mathbf{v}_{t+1} \end{aligned} $$

Adam (Adaptive Moment Estimation)

Adaptively adjusts first and second moments of the gradient.

$$ \begin{aligned} \mathbf{m}_t &= \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \nabla f(\mathbf{x}_t) \\ \mathbf{v}_t &= \beta_2 \mathbf{v}_{t-1} + (1-\beta_2) (\nabla f(\mathbf{x}_t))^2 \\ \hat{\mathbf{m}}_t &= \frac{\mathbf{m}_t}{1-\beta_1^t} \\ \hat{\mathbf{v}}_t &= \frac{\mathbf{v}_t}{1-\beta_2^t} \\ \mathbf{x}_{t+1} &= \mathbf{x}_t - \alpha \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \end{aligned} $$

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: $$
\begin{aligned}
\mathbf{m}_t &= \beta_1 \mathbf{m}_{t-1} 

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
import matplotlib.pyplot as plt

# Rosenbrock function (difficult non-convex optimization function)
def rosenbrock(x, y):
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_grad(x, y):
    dx = -2 * (1 - x) - 400 * x * (y - x**2)
    dy = 200 * (y - x**2)
    return np.array([dx, dy])

# Implementation of optimization algorithms
def sgd(start, grad_func, lr=0.001, n_iterations=1000):
    x = start.copy()
    trajectory = [x.copy()]

    for _ in range(n_iterations):
        grad = grad_func(x[0], x[1])
        x -= lr * grad
        trajectory.append(x.copy())

    return np.array(trajectory)

def momentum(start, grad_func, lr=0.001, beta=0.9, n_iterations=1000):
    x = start.copy()
    v = np.zeros_like(x)
    trajectory = [x.copy()]

    for _ in range(n_iterations):
        grad = grad_func(x[0], x[1])
        v = beta * v - lr * grad
        x += v
        trajectory.append(x.copy())

    return np.array(trajectory)

def adam(start, grad_func, lr=0.01, beta1=0.9, beta2=0.999,
         epsilon=1e-8, n_iterations=1000):
    x = start.copy()
    m = np.zeros_like(x)
    v = np.zeros_like(x)
    trajectory = [x.copy()]

    for t in range(1, n_iterations + 1):
        grad = grad_func(x[0], x[1])

        m = beta1 * m + (1 - beta1) * grad
        v = beta2 * v + (1 - beta2) * (grad ** 2)

        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)

        x -= lr * m_hat / (np.sqrt(v_hat) + epsilon)
        trajectory.append(x.copy())

    return np.array(trajectory)

# Execute optimization
start_point = np.array([-1.0, 1.0])
n_iter = 500

traj_sgd = sgd(start_point, rosenbrock_grad, lr=0.0005, n_iterations=n_iter)
traj_momentum = momentum(start_point, rosenbrock_grad, lr=0.0005, n_iterations=n_iter)
traj_adam = adam(start_point, rosenbrock_grad, lr=0.01, n_iterations=n_iter)

# Contour plot
x = np.linspace(-1.5, 1.5, 100)
y = np.linspace(-0.5, 2.5, 100)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)

plt.figure(figsize=(15, 5))

# SGD
plt.subplot(131)
plt.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.6)
plt.plot(traj_sgd[:, 0], traj_sgd[:, 1], 'r.-', markersize=3,
         linewidth=1, alpha=0.7, label='SGD')
plt.plot(1, 1, 'g*', markersize=20, label='Optimal Solution')
plt.xlabel('x')
plt.ylabel('y')
plt.title('SGD', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

# Momentum
plt.subplot(132)
plt.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.6)
plt.plot(traj_momentum[:, 0], traj_momentum[:, 1], 'b.-', markersize=3,
         linewidth=1, alpha=0.7, label='Momentum')
plt.plot(1, 1, 'g*', markersize=20, label='Optimal Solution')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Momentum', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

# Adam
plt.subplot(133)
plt.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.6)
plt.plot(traj_adam[:, 0], traj_adam[:, 1], 'm.-', markersize=3,
         linewidth=1, alpha=0.7, label='Adam')
plt.plot(1, 1, 'g*', markersize=20, label='Optimal Solution')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Adam', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("=== Comparison of Optimization Algorithms ===")
print(f"SGD Final Position:      ({traj_sgd[-1][0]:.4f}, {traj_sgd[-1][1]:.4f})")
print(f"Momentum Final Position: ({traj_momentum[-1][0]:.4f}, {traj_momentum[-1][1]:.4f})")
print(f"Adam Final Position:     ({traj_adam[-1][0]:.4f}, {traj_adam[-1][1]:.4f})")
print(f"True Optimal Solution:   (1.0000, 1.0000)")

Comparison of Optimization Algorithms

Algorithm	Advantages	Disadvantages	Recommended Usage
SGD	Simple, memory efficient	Slow convergence, oscillations	Large-scale data
Momentum	Suppresses oscillations, fast convergence	Inertia causes overshooting	Valley-shaped functions
AdaGrad	Adjusts learning rate per feature	Learning rate decreases rapidly	Sparse data
RMSprop	Mitigates learning rate decay	Requires parameter tuning	RNN
Adam	Adaptive, high performance	Overfitting risk	General purpose (default)

3.3 Constrained Optimization

Lagrange Multipliers

Equality-constrained optimization:

$$ \begin{aligned} \min_{\mathbf{x}} \quad & f(\mathbf{x}) \\ \text{subject to} \quad & h(\mathbf{x}) = 0 \end{aligned} $$

Lagrange Function:

$$ \mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda h(\mathbf{x}) $$

Optimality Conditions:

$$ \begin{aligned} \nabla_{\mathbf{x}} \mathcal{L} &= 0 \\ \nabla_{\lambda} \mathcal{L} &= 0 \end{aligned} $$

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: $$
\begin{aligned}
\nabla_{\mathbf{x}} \mathcal{L} &= 0 \\
\

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize

# Objective function: f(x, y) = (x - 2)^2 + (y - 1)^2
def objective(x):
    return (x[0] - 2)**2 + (x[1] - 1)**2

# Equality constraint: h(x, y) = x + y - 2 = 0
def constraint_eq(x):
    return x[0] + x[1] - 2

# Define constraints in dictionary format
constraints = {'type': 'eq', 'fun': constraint_eq}

# Initial point
x0 = np.array([0.0, 0.0])

# Optimization
result = minimize(objective, x0, method='SLSQP', constraints=constraints)

# Visualization
x_range = np.linspace(-1, 4, 100)
y_range = np.linspace(-1, 4, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = (X - 2)**2 + (Y - 1)**2

plt.figure(figsize=(10, 8))
plt.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.6)
plt.colorbar(label='$f(x, y)$')

# Constraint line
x_constraint = np.linspace(-1, 4, 100)
y_constraint = 2 - x_constraint
plt.plot(x_constraint, y_constraint, 'r-', linewidth=3, label='Constraint: $x + y = 2$')

# Optimal point
plt.plot(result.x[0], result.x[1], 'r*', markersize=20, label='Optimal Solution')

# Unconstrained optimal point
plt.plot(2, 1, 'g*', markersize=20, label='Unconstrained Optimal Solution')

plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Equality-Constrained Optimization with Lagrange Multipliers', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()

print("=== Lagrange Multiplier Method Results ===")
print(f"Optimal solution: x = {result.x[0]:.4f}, y = {result.x[1]:.4f}")
print(f"Objective function value: f(x*) = {result.fun:.4f}")
print(f"Constraint satisfaction: h(x*) = {constraint_eq(result.x):.6f}")
print(f"Unconstrained optimal solution: x = 2.0, y = 1.0, f = 0.0")

KKT Conditions

Necessary conditions for inequality-constrained optimization

$$ \begin{aligned} \min_{\mathbf{x}} \quad & f(\mathbf{x}) \\ \text{subject to} \quad & g_i(\mathbf{x}) \leq 0 \end{aligned} $$

KKT Conditions (Karush-Kuhn-Tucker Conditions):

Stationarity: $\nabla f(\mathbf{x}^*) + \sum_i \mu_i \nabla g_i(\mathbf{x}^*) = 0$
Primal feasibility: $g_i(\mathbf{x}^*) \leq 0$
Dual feasibility: $\mu_i \geq 0$
Complementarity: $\mu_i g_i(\mathbf{x}^*) = 0$

Application to SVM

Support Vector Machine (SVM) is formulated as a constrained optimization problem.

$$ \begin{aligned} \min_{\mathbf{w}, b} \quad & \frac{1}{2} \|\mathbf{w}\|^2 \\ \text{subject to} \quad & y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1, \quad i = 1, \ldots, n \end{aligned} $$

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: $$
\begin{aligned}
\min_{\mathbf{w}, b} \quad & \frac{1}{2} 

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_blobs

# Generate linearly separable data
np.random.seed(42)
X, y = make_blobs(n_samples=100, centers=2, n_features=2,
                  cluster_std=1.0, center_box=(-5, 5))

# SVM model (linear kernel)
svm = SVC(kernel='linear', C=1000)  # Large C approximates hard margin
svm.fit(X, y)

# Visualization of decision boundary
def plot_svm_decision_boundary(ax, X, y, model):
    # Create grid
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))

    # Decision boundary
    Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot
    ax.contour(xx, yy, Z, levels=[-1, 0, 1],
               linestyles=['--', '-', '--'], colors=['r', 'k', 'b'], linewidths=2)
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm',
               s=50, edgecolors='k', alpha=0.7)

    # Support vectors
    ax.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1],
               s=200, facecolors='none', edgecolors='g', linewidths=2,
               label='Support Vectors')

    ax.set_xlabel('$x_1$', fontsize=12)
    ax.set_ylabel('$x_2$', fontsize=12)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.figure(figsize=(10, 8))
ax = plt.gca()
plot_svm_decision_boundary(ax, X, y, svm)
plt.title('Linear Separation with SVM (Optimization via KKT Conditions)', fontsize=14)
plt.tight_layout()
plt.show()

print("=== SVM Optimization Results ===")
print(f"Weight vector w: {svm.coef_[0]}")
print(f"Bias b: {svm.intercept_[0]:.4f}")
print(f"Number of support vectors: {len(svm.support_vectors_)}")
print(f"Margin: {2 / np.linalg.norm(svm.coef_):.4f}")

3.4 Convex Optimization

Properties of Convex Optimization

Important Property: In convex optimization problems, local optimal solution equals global optimal solution

This allows efficient algorithms to reliably find optimal solutions.

Linear Programming

$$ \begin{aligned} \min_{\mathbf{x}} \quad & \mathbf{c}^T \mathbf{x} \\ \text{subject to} \quad & \mathbf{A} \mathbf{x} \leq \mathbf{b} \\ & \mathbf{x} \geq 0 \end{aligned} $$

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: $$
\begin{aligned}
\min_{\mathbf{x}} \quad & \mathbf{c}^T \m

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np
from scipy.optimize import linprog
import matplotlib.pyplot as plt

# Linear programming problem
# Objective function: minimize -x - 2y (= maximize x + 2y)
c = [-1, -2]

# Inequality constraints: A_ub * x <= b_ub
# Constraint 1: x + y <= 4
# Constraint 2: 2x + y <= 5
# Constraint 3: x >= 0, y >= 0 (specified with bounds)
A_ub = np.array([[1, 1],
                 [2, 1]])
b_ub = np.array([4, 5])

# Variable bounds
bounds = [(0, None), (0, None)]

# Optimization
result = linprog(c, A_ub=A_ub, b_ub=b_ub, bounds=bounds, method='highs')

# Visualization
x = np.linspace(0, 5, 100)

plt.figure(figsize=(10, 8))

# Visualize constraints
y1 = 4 - x
y2 = 5 - 2*x

plt.plot(x, y1, 'r-', linewidth=2, label='$x + y \leq 4$')
plt.plot(x, y2, 'b-', linewidth=2, label='$2x + y \leq 5$')
plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)

# Feasible region
x_fill = np.linspace(0, 2.5, 100)
y_upper = np.minimum(4 - x_fill, 5 - 2*x_fill)
y_upper = np.maximum(y_upper, 0)

plt.fill_between(x_fill, 0, y_upper, alpha=0.3, color='green',
                 label='Feasible Region')

# Objective function contours
X, Y = np.meshgrid(np.linspace(0, 5, 100), np.linspace(0, 5, 100))
Z = X + 2*Y
plt.contour(X, Y, Z, levels=10, alpha=0.3, cmap='viridis')

# Optimal solution
plt.plot(result.x[0], result.x[1], 'r*', markersize=20, label='Optimal Solution')

plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Linear Programming', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xlim(0, 5)
plt.ylim(0, 5)
plt.tight_layout()
plt.show()

print("=== Linear Programming Results ===")
print(f"Optimal solution: x = {result.x[0]:.4f}, y = {result.x[1]:.4f}")
print(f"Objective function value (maximization): {-result.fun:.4f}")
print(f"Optimization successful: {result.success}")

Quadratic Programming

$$ \begin{aligned} \min_{\mathbf{x}} \quad & \frac{1}{2} \mathbf{x}^T \mathbf{Q} \mathbf{x} + \mathbf{c}^T \mathbf{x} \\ \text{subject to} \quad & \mathbf{A} \mathbf{x} \leq \mathbf{b} \end{aligned} $$

Convex Optimization with CVXPY

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Convex Optimization with CVXPY

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import cvxpy as cp
import numpy as np
import matplotlib.pyplot as plt

# Portfolio optimization problem
# Objective: Minimize risk while achieving a target expected return

# Expected returns and covariance matrix of assets (sample data)
np.random.seed(42)
n_assets = 5
expected_returns = np.random.uniform(0.05, 0.15, n_assets)
cov_matrix = np.random.randn(n_assets, n_assets)
cov_matrix = cov_matrix @ cov_matrix.T / 100  # Positive definite matrix

# Variable: Asset allocation
w = cp.Variable(n_assets)

# Parameter: Target return
target_return = 0.10

# Objective function: Minimize risk (variance)
risk = cp.quad_form(w, cov_matrix)

# Constraints
constraints = [
    cp.sum(w) == 1,           # Sum of asset allocation = 1
    w >= 0,                    # No short selling
    expected_returns @ w >= target_return  # Achieve target return
]

# Define and solve problem
problem = cp.Problem(cp.Minimize(risk), constraints)
problem.solve()

# Calculate efficient frontier
target_returns = np.linspace(0.06, 0.14, 20)
risks = []
portfolios = []

for target in target_returns:
    constraints = [
        cp.sum(w) == 1,
        w >= 0,
        expected_returns @ w >= target
    ]
    problem = cp.Problem(cp.Minimize(risk), constraints)
    problem.solve()

    if problem.status == 'optimal':
        risks.append(np.sqrt(problem.value))
        portfolios.append(w.value)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Efficient frontier
axes[0].plot(risks, target_returns, 'b-', linewidth=2, label='Efficient Frontier')
axes[0].plot(np.sqrt(problem.value), target_return, 'r*',
             markersize=15, label=f'Selected Portfolio (Return={target_return})')
axes[0].set_xlabel('Risk (Standard Deviation)', fontsize=12)
axes[0].set_ylabel('Expected Return', fontsize=12)
axes[0].set_title('Efficient Frontier', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Asset allocation
if problem.status == 'optimal':
    axes[1].bar(range(n_assets), w.value, alpha=0.7, edgecolor='black')
    axes[1].set_xlabel('Asset', fontsize=12)
    axes[1].set_ylabel('Allocation Ratio', fontsize=12)
    axes[1].set_title('Optimal Asset Allocation', fontsize=14)
    axes[1].set_xticks(range(n_assets))
    axes[1].set_xticklabels([f'Asset{i+1}' for i in range(n_assets)])
    axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("=== Portfolio Optimization Results ===")
print(f"Target return: {target_return:.2%}")
print(f"Achieved return: {(expected_returns @ w.value):.2%}")
print(f"Risk (standard deviation): {np.sqrt(problem.value):.2%}")
print(f"\nOptimal asset allocation:")
for i, weight in enumerate(w.value):
    print(f"  Asset{i+1}: {weight:.2%}")

Note: To use CVXPY, install with pip install cvxpy.

3.5 Practical Application: Machine Learning

Optimization of Logistic Regression

Logistic regression can be solved as a convex optimization problem.

$$ \min_{\mathbf{w}} \sum_{i=1}^n \log(1 + \exp(-y_i \mathbf{w}^T \mathbf{x}_i)) + \lambda \|\mathbf{w}\|^2 $$

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: $$
\min_{\mathbf{w}} \sum_{i=1}^n \log(1 + \exp(-y_i \mathbf

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Generate data
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2,
                          n_redundant=0, n_clusters_per_class=1,
                          random_state=42)
y = 2 * y - 1  # {0, 1} -> {-1, 1}

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

# Loss function (cross entropy + L2 regularization)
def logistic_loss(w, X, y, lambda_reg=0.01):
    z = X @ w
    loss = np.mean(np.log(1 + np.exp(-y * z)))
    reg = lambda_reg * np.sum(w**2)
    return loss + reg

# Gradient
def logistic_gradient(w, X, y, lambda_reg=0.01):
    z = X @ w
    grad = -X.T @ (y * sigmoid(-y * z)) / len(y)
    grad += 2 * lambda_reg * w
    return grad

# Optimization with gradient descent
def train_logistic_regression(X, y, lr=0.1, n_iterations=1000, lambda_reg=0.01):
    n_features = X.shape[1]
    w = np.zeros(n_features)

    losses = []

    for i in range(n_iterations):
        grad = logistic_gradient(w, X, y, lambda_reg)
        w -= lr * grad

        if i % 10 == 0:
            loss = logistic_loss(w, X, y, lambda_reg)
            losses.append(loss)

    return w, losses

# Training
w_optimal, losses = train_logistic_regression(
    X_train_scaled, y_train, lr=0.5, n_iterations=1000, lambda_reg=0.01
)

# Prediction
def predict(X, w):
    z = X @ w
    return np.sign(z)

y_pred_train = predict(X_train_scaled, w_optimal)
y_pred_test = predict(X_test_scaled, w_optimal)

train_acc = np.mean(y_pred_train == y_train)
test_acc = np.mean(y_pred_test == y_test)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Learning curve
axes[0].plot(losses, linewidth=2)
axes[0].set_xlabel('Iteration (×10)', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Learning Curve', fontsize=14)
axes[0].grid(True, alpha=0.3)

# Decision boundary
x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1
y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))

Z = predict(np.c_[xx.ravel(), yy.ravel()], w_optimal)
Z = Z.reshape(xx.shape)

axes[1].contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm', levels=1)
axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],
                c=y_train, cmap='coolwarm', edgecolors='k', s=50, alpha=0.7)
axes[1].set_xlabel('Feature 1', fontsize=12)
axes[1].set_ylabel('Feature 2', fontsize=12)
axes[1].set_title(f'Decision Boundary (Accuracy: {train_acc:.2%})', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("=== Logistic Regression Results ===")
print(f"Training accuracy: {train_acc:.2%}")
print(f"Test accuracy: {test_acc:.2%}")
print(f"Optimal parameters: {w_optimal}")

Effect of Regularization

Regularization prevents overfitting.

L1 Regularization (Lasso): $\lambda \|\mathbf{w}\|_1$ → Sparse solution
L2 Regularization (Ridge): $\lambda \|\mathbf{w}\|_2^2$ → Weight decay

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Regularization prevents overfitting.

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Generate data
X, y = make_regression(n_samples=100, n_features=50, n_informative=10,
                       noise=10, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Training with different regularization parameters
alphas = np.logspace(-3, 3, 50)

ridge_train_scores = []
ridge_test_scores = []
lasso_train_scores = []
lasso_test_scores = []

for alpha in alphas:
    # Ridge
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    ridge_train_scores.append(ridge.score(X_train_scaled, y_train))
    ridge_test_scores.append(ridge.score(X_test_scaled, y_test))

    # Lasso
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train_scaled, y_train)
    lasso_train_scores.append(lasso.score(X_train_scaled, y_train))
    lasso_test_scores.append(lasso.score(X_test_scaled, y_test))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Ridge
axes[0].semilogx(alphas, ridge_train_scores, 'b-', linewidth=2, label='Training')
axes[0].semilogx(alphas, ridge_test_scores, 'r-', linewidth=2, label='Test')
axes[0].set_xlabel('Regularization Parameter α', fontsize=12)
axes[0].set_ylabel('R² Score', fontsize=12)
axes[0].set_title('Ridge Regression (L2 Regularization)', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Lasso
axes[1].semilogx(alphas, lasso_train_scores, 'b-', linewidth=2, label='Training')
axes[1].semilogx(alphas, lasso_test_scores, 'r-', linewidth=2, label='Test')
axes[1].set_xlabel('Regularization Parameter α', fontsize=12)
axes[1].set_ylabel('R² Score', fontsize=12)
axes[1].set_title('Lasso Regression (L1 Regularization)', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Coefficient comparison with optimal α
ridge_best = Ridge(alpha=1.0)
ridge_best.fit(X_train_scaled, y_train)

lasso_best = Lasso(alpha=0.1, max_iter=10000)
lasso_best.fit(X_train_scaled, y_train)

print("=== Effect of Regularization ===")
print(f"Ridge - Non-zero coefficients: {np.sum(np.abs(ridge_best.coef_) > 0.01)}/{len(ridge_best.coef_)}")
print(f"Lasso - Non-zero coefficients: {np.sum(np.abs(lasso_best.coef_) > 0.01)}/{len(lasso_best.coef_)}")
print(f"\nRidge best score: {ridge_best.score(X_test_scaled, y_test):.3f}")
print(f"Lasso best score: {lasso_best.score(X_test_scaled, y_test):.3f}")

3.6 Chapter Summary

What We Learned

Fundamentals of Optimization
- Importance of convexity: Convex optimization guarantees global optimal solutions
- Determining optimality using gradients and Hessians
Gradient Descent
- Learning rate selection determines convergence speed and stability
- Advanced methods such as SGD, Momentum, and Adam
- Understanding implementation and convergence
Constrained Optimization
- Handling equality constraints with Lagrange multipliers
- Theory of inequality constraints via KKT conditions
- Application to SVM
Convex Optimization
- Linear programming and quadratic programming
- Implementation with CVXPY
- Applications such as portfolio optimization
Application to Machine Learning
- Optimization of logistic regression
- Preventing overfitting through regularization
- Implementation with real data

Guidelines for Choosing Optimization Algorithms

Problem Nature	Recommended Method	Reason
Convex function, small scale	Gradient Descent	Reliably finds optimal solution
Convex function, large scale	SGD, Adam	Computational efficiency
Non-convex, deep learning	Adam, RMSprop	Avoids local optima
With equality constraints	Lagrange Multipliers	Guarantees constraint satisfaction
With inequality constraints	KKT conditions, SLSQP	Feasible solutions
Linear or convex quadratic	Specialized solvers (CVXPY)	Fast and stable

Next Chapter

In Chapter 4, we will study Fundamentals of Probability and Statistics:

Probability distributions and expected values
Maximum likelihood estimation and Bayesian estimation
Hypothesis testing and confidence intervals
Fundamentals of information theory

Exercises

Problem 1 (Difficulty: Easy)

Find the value of $x$ that minimizes the function $f(x) = x^2 + 4x + 4$ using the gradient (analytically).

Sample Solution

Solution:

Find the point where the gradient equals zero.

$$ \nabla f(x) = \frac{df}{dx} = 2x + 4 $$

Optimality condition:

$$ 2x + 4 = 0 \Rightarrow x^* = -2 $$

Second-order condition (sufficient condition):

$$ \frac{d^2 f}{dx^2} = 2 > 0 $$

Therefore, the minimum value $f(x^*) = 0$ is achieved at $x^* = -2$.

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Therefore, the minimum value $f(x^*) = 0$ is achieved at $x^

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
import matplotlib.pyplot as plt

# Function and gradient
def f(x):
    return x**2 + 4*x + 4

def grad_f(x):
    return 2*x + 4

# Visualization
x = np.linspace(-5, 2, 100)
y = f(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', linewidth=2, label='$f(x) = x^2 + 4x + 4$')
plt.plot(-2, 0, 'r*', markersize=20, label='Optimal Solution: $x^* = -2$')
plt.axhline(y=0, color='k', linewidth=0.5, linestyle='--')
plt.axvline(x=-2, color='r', linewidth=0.5, linestyle='--')
plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Analytical Solution to Optimization Problem', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print(f"Optimal solution: x* = -2")
print(f"Minimum value: f(x*) = {f(-2)}")
print(f"Gradient: f'(x*) = {grad_f(-2)}")

Problem 2 (Difficulty: Medium)

Using gradient descent with learning rate $\alpha = 0.1$ and initial value $x_0 = 5$, minimize $f(x) = x^2$. Report the values of $x$ after 5 iterations.

Sample Solution

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Using gradient descent with learning rate $\alpha = 0.1$ and

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import numpy as np

# Objective function and gradient
def f(x):
    return x**2

def grad_f(x):
    return 2*x

# Gradient descent
x = 5.0
alpha = 0.1
n_iterations = 5

print("=== Gradient Descent Trajectory ===")
print(f"Iteration 0: x = {x:.6f}, f(x) = {f(x):.6f}")

for i in range(1, n_iterations + 1):
    grad = grad_f(x)
    x = x - alpha * grad
    print(f"Iteration {i}: x = {x:.6f}, f(x) = {f(x):.6f}, grad = {grad:.6f}")

print(f"\nFinal value: x = {x:.6f}")
print(f"True optimal solution: x* = 0")

Output:

=== Gradient Descent Trajectory ===
Iteration 0: x = 5.000000, f(x) = 25.000000
Iteration 1: x = 4.000000, f(x) = 16.000000, grad = 10.000000
Iteration 2: x = 3.200000, f(x) = 10.240000, grad = 8.000000
Iteration 3: x = 2.560000, f(x) = 6.553600, grad = 6.400000
Iteration 4: x = 2.048000, f(x) = 4.194304, grad = 5.120000
Iteration 5: x = 1.638400, f(x) = 2.684355, grad = 4.096000

Final value: x = 1.638400
True optimal solution: x* = 0

After 5 iterations, it has not fully converged, but it is approaching the optimal solution.

Problem 3 (Difficulty: Medium)

Under the equality constraint $x + y = 1$, solve the problem of minimizing $f(x, y) = x^2 + y^2$ using Lagrange multipliers.

Sample Solution

Solution:

Lagrange function:

$$ \mathcal{L}(x, y, \lambda) = x^2 + y^2 + \lambda(x + y - 1) $$

Optimality conditions:

$$ \begin{aligned} \frac{\partial \mathcal{L}}{\partial x} &= 2x + \lambda = 0 \\ \frac{\partial \mathcal{L}}{\partial y} &= 2y + \lambda = 0 \\ \frac{\partial \mathcal{L}}{\partial \lambda} &= x + y - 1 = 0 \end{aligned} $$

From equations 1 and 2, $x = y$. Substituting into equation 3: $2x = 1 \Rightarrow x = y = 0.5$

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: From equations 1 and 2, $x = y$. Substituting into equation 

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize

# Objective function
def objective(vars):
    x, y = vars
    return x**2 + y**2

# Equality constraint
def constraint(vars):
    x, y = vars
    return x + y - 1

# Define constraints
constraints = {'type': 'eq', 'fun': constraint}

# Initial point
x0 = [0.0, 0.0]

# Optimization
result = minimize(objective, x0, method='SLSQP', constraints=constraints)

print("=== Lagrange Multiplier Method Solution ===")
print(f"Optimal solution: x = {result.x[0]:.4f}, y = {result.x[1]:.4f}")
print(f"Objective function value: f(x*, y*) = {result.fun:.4f}")
print(f"Constraint satisfaction: x + y = {result.x[0] + result.x[1]:.6f}")

# Visualization
x = np.linspace(-0.5, 1.5, 100)
y = np.linspace(-0.5, 1.5, 100)
X, Y = np.meshgrid(x, y)
Z = X**2 + Y**2

plt.figure(figsize=(10, 8))
plt.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.6)
plt.colorbar(label='$f(x, y)$')

# Constraint line
x_line = np.linspace(-0.5, 1.5, 100)
y_line = 1 - x_line
plt.plot(x_line, y_line, 'r-', linewidth=3, label='Constraint: $x + y = 1$')

# Optimal point
plt.plot(result.x[0], result.x[1], 'r*', markersize=20, label='Optimal Solution')

plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Equality-Constrained Optimization', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()

Problem 4 (Difficulty: Hard)

Using Momentum method ($\beta = 0.9$), optimize the Rosenbrock function $f(x, y) = (1-x)^2 + 100(y-x^2)^2$ from initial point $(-1, 1)$. Iterate 100 times with learning rate $\alpha = 0.001$ and visualize the trajectory.

Sample Solution

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Using Momentum method ($\beta = 0.9$), optimize the Rosenbro

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
import matplotlib.pyplot as plt

# Rosenbrock function
def rosenbrock(x, y):
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_grad(x, y):
    dx = -2 * (1 - x) - 400 * x * (y - x**2)
    dy = 200 * (y - x**2)
    return np.array([dx, dy])

# Momentum method
def momentum(start, grad_func, lr=0.001, beta=0.9, n_iterations=100):
    pos = start.copy()
    velocity = np.zeros_like(pos)
    trajectory = [pos.copy()]

    for i in range(n_iterations):
        grad = grad_func(pos[0], pos[1])
        velocity = beta * velocity - lr * grad
        pos += velocity
        trajectory.append(pos.copy())

    return np.array(trajectory)

# Optimization
start = np.array([-1.0, 1.0])
trajectory = momentum(start, rosenbrock_grad, lr=0.001, beta=0.9, n_iterations=100)

# Visualization
x = np.linspace(-1.5, 1.5, 100)
y = np.linspace(-0.5, 2.5, 100)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)

plt.figure(figsize=(12, 9))
plt.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.6)
plt.colorbar(label='$f(x, y)$')

plt.plot(trajectory[:, 0], trajectory[:, 1], 'r.-', markersize=4,
         linewidth=1.5, alpha=0.7, label='Momentum Trajectory')
plt.plot(start[0], start[1], 'bo', markersize=10, label='Starting Point')
plt.plot(1, 1, 'g*', markersize=20, label='Optimal Solution')
plt.plot(trajectory[-1, 0], trajectory[-1, 1], 'rs', markersize=10, label='Ending Point')

plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Optimization of Rosenbrock Function with Momentum Method', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("=== Momentum Method Results ===")
print(f"Initial point: ({start[0]:.2f}, {start[1]:.2f})")
print(f"Final point: ({trajectory[-1][0]:.4f}, {trajectory[-1][1]:.4f})")
print(f"Optimal solution: (1.0000, 1.0000)")
print(f"Final function value: {rosenbrock(trajectory[-1][0], trajectory[-1][1]):.6f}")

Problem 5 (Difficulty: Hard)

Explain the difference between L1 and L2 regularization, and describe which one is more likely to generate sparse solutions (many coefficients equal to zero), along with the reason.

Sample Solution

Solution:

L2 Regularization (Ridge):

Penalty term: $\lambda \sum_i w_i^2$
Characteristic: Makes coefficients small but not zero
Gradient: $\nabla (\lambda w^2) = 2\lambda w$ (continuously approaches zero)

L1 Regularization (Lasso):

Penalty term: $\lambda \sum_i |w_i|$
Characteristic: Makes coefficients exactly zero (sparse solution)
Gradient: $\nabla (\lambda |w|) = \lambda \cdot \text{sign}(w)$ (discontinuous near zero)

Reason for Sparsity:

Geometric interpretation: The feasible region of L1 constraints has corners (e.g., diamond shape). Contour lines of the objective function easily intersect at corners, where some coefficients become zero.
Gradient properties: The gradient of L1 is constant regardless of coefficient magnitude ($\pm \lambda$), so both small and large coefficients receive the same penalty, making small coefficients more likely to be pushed to zero.

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Reason for Sparsity:

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np
import matplotlib.pyplot as plt

# Visualization of L1 and L2 constraint regions
theta = np.linspace(0, 2*np.pi, 1000)

# L1 constraint (|w1| + |w2| <= 1)
w1_l1 = np.cos(theta) * (np.abs(np.cos(theta)) + np.abs(np.sin(theta)))
w2_l1 = np.sin(theta) * (np.abs(np.cos(theta)) + np.abs(np.sin(theta)))

# L2 constraint (w1^2 + w2^2 <= 1)
w1_l2 = np.cos(theta)
w2_l2 = np.sin(theta)

# Objective function contours (hypothetical example)
w1 = np.linspace(-2, 2, 100)
w2 = np.linspace(-2, 2, 100)
W1, W2 = np.meshgrid(w1, w2)
Z = (W1 - 1.5)**2 + (W2 - 1.0)**2

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# L1 regularization
axes[0].contour(W1, W2, Z, levels=15, alpha=0.6, cmap='viridis')
axes[0].fill(w1_l1, w2_l1, alpha=0.3, color='red', label='L1 Constraint Region')
axes[0].plot(0, 0, 'r*', markersize=20, label='Sparse Solution (w1=0)')
axes[0].set_xlabel('$w_1$', fontsize=12)
axes[0].set_ylabel('$w_2$', fontsize=12)
axes[0].set_title('L1 Regularization (Lasso)', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
axes[0].axis('equal')

# L2 regularization
axes[1].contour(W1, W2, Z, levels=15, alpha=0.6, cmap='viridis')
axes[1].fill(w1_l2, w2_l2, alpha=0.3, color='blue', label='L2 Constraint Region')
circle_intersect_x = 0.8
circle_intersect_y = 0.6
axes[1].plot(circle_intersect_x, circle_intersect_y, 'b*',
             markersize=20, label='Non-Sparse Solution')
axes[1].set_xlabel('$w_1$', fontsize=12)
axes[1].set_ylabel('$w_2$', fontsize=12)
axes[1].set_title('L2 Regularization (Ridge)', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
axes[1].axis('equal')

plt.tight_layout()
plt.show()

print("=== L1 vs L2 Regularization ===")
print("L1 (Lasso): Generates sparse solutions (many coefficients are zero)")
print("L2 (Ridge): Generates small but non-zero coefficients")
print("\nApplications:")
print("- L1: When feature selection is needed")
print("- L2: When using all features while preventing overfitting")

References

Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Nocedal, J., & Wright, S. (2006). Numerical Optimization (2nd ed.). Springer.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Ruder, S. (2016). "An overview of gradient descent optimization algorithms". arXiv:1609.04747.

Disclaimer

This content is provided solely for educational, research, and informational purposes and does not constitute professional advice (legal, accounting, technical warranty, etc.).
This content and accompanying code examples are provided "AS IS" without any warranty, express or implied, including but not limited to merchantability, fitness for a particular purpose, non-infringement, accuracy, completeness, operation, or safety.
The author and Tohoku University assume no responsibility for the content, availability, or safety of external links, third-party data, tools, libraries, etc.
To the maximum extent permitted by applicable law, the author and Tohoku University shall not be liable for any direct, indirect, incidental, special, consequential, or punitive damages arising from the use, execution, or interpretation of this content.
The content may be changed, updated, or discontinued without notice.
The copyright and license of this content are subject to the stated conditions (e.g., CC BY 4.0). Such licenses typically include no-warranty clauses.