This chapter covers Information Theory. You will learn essential concepts and techniques.
Deeply understand information theory supporting machine learning, from entropy to VAE, through theory and implementation
- Mathematical definition and intuitive understanding of entropy and information content
- Relationship between KL divergence and cross-entropy
- Theoretical foundation of feature selection using mutual information
- Information-theoretic interpretation of VAE and information bottleneck
- Information-theoretic meaning of loss functions in machine learning
1. Entropy
1.1 Information Content and Shannon Entropy
The foundation of information theory begins with quantifying "the amount of information in an event." When the probability of event x occurring is P(x), its self-information is defined as follows.
This definition has deep meaning:
- Events with lower probability have greater information content: Rare events are more surprising
- Certain events (P(x)=1) have zero information content: There is no surprise in predictable events
- Information content of independent events is additive: I(x,y) = I(x) + I(y)
Shannon entropy represents the average information content of an entire probability distribution.
1.2 Conditional Entropy
The entropy of X under the condition that variable Y is given is called conditional entropy.
As an important property, the chain rule holds:
Implementation Example 1: Entropy Calculation and Visualization
import numpy as np
import matplotlib.pyplot as plt
class InformationMeasures:
"""Class for calculating fundamental quantities in information theory"""
@staticmethod
def entropy(p, base=2):
"""
Calculate Shannon entropy
Parameters:
-----------
p : array-like
Probability distribution (must sum to 1)
base : int
Logarithm base (2: bits, e: nats)
Returns:
--------
float : Entropy
"""
p = np.array(p)
# Define 0 * log(0) = 0 (numerically stable implementation)
p = p[p > 0] # Consider only positive probabilities
if base == 2:
return -np.sum(p * np.log2(p))
elif base == np.e:
return -np.sum(p * np.log(p))
else:
return -np.sum(p * np.log(p)) / np.log(base)
@staticmethod
def conditional_entropy(joint_p, axis=1):
"""
Calculate conditional entropy H(X|Y)
Parameters:
-----------
joint_p : ndarray
Joint probability distribution P(X,Y)
axis : int
Axis of the conditioning variable (0: H(Y|X), 1: H(X|Y))
Returns:
--------
float : Conditional entropy
"""
joint_p = np.array(joint_p)
# Calculate marginal probability
marginal_p = np.sum(joint_p, axis=axis)
# Calculate conditional entropy
h_cond = 0
for i, p_y in enumerate(marginal_p):
if p_y > 0:
if axis == 1:
conditional_p = joint_p[:, i] / p_y
else:
conditional_p = joint_p[i, :] / p_y
h_cond += p_y * InformationMeasures.entropy(conditional_p)
return h_cond
@staticmethod
def joint_entropy(joint_p):
"""
Calculate joint entropy H(X,Y)
"""
joint_p = np.array(joint_p).flatten()
return InformationMeasures.entropy(joint_p)
# Usage Example 1: Entropy of binary variable
print("=" * 50)
print("Entropy of Binary Variable")
print("=" * 50)
# Coin probability distribution
probs = np.linspace(0.01, 0.99, 99)
entropies = [InformationMeasures.entropy([p, 1-p]) for p in probs]
plt.figure(figsize=(10, 5))
plt.plot(probs, entropies, 'b-', linewidth=2)
plt.axvline(x=0.5, color='r', linestyle='--', label='Maximum entropy (p=0.5)')
plt.xlabel('Probability P(X=1)')
plt.ylabel('Entropy H(X) [bits]')
plt.title('Entropy of Binary Random Variable')
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.savefig('entropy_binary.png', dpi=150, bbox_inches='tight')
print(f"Maximum entropy: {max(entropies):.4f} bits (p=0.5)")
# Usage Example 2: Joint probability distribution and conditional entropy
print("\n" + "=" * 50)
print("Example of Conditional Entropy")
print("=" * 50)
# Joint probability distribution P(X,Y)
joint_prob = np.array([
[0.2, 0.1], # P(X=0, Y=0), P(X=0, Y=1)
[0.15, 0.55] # P(X=1, Y=0), P(X=1, Y=1)
])
# Marginal probabilities
p_x = np.sum(joint_prob, axis=1)
p_y = np.sum(joint_prob, axis=0)
# Various entropies
h_x = InformationMeasures.entropy(p_x)
h_y = InformationMeasures.entropy(p_y)
h_xy = InformationMeasures.joint_entropy(joint_prob)
h_x_given_y = InformationMeasures.conditional_entropy(joint_prob, axis=1)
h_y_given_x = InformationMeasures.conditional_entropy(joint_prob, axis=0)
print(f"H(X) = {h_x:.4f} bits")
print(f"H(Y) = {h_y:.4f} bits")
print(f"H(X,Y) = {h_xy:.4f} bits")
print(f"H(X|Y) = {h_x_given_y:.4f} bits")
print(f"H(Y|X) = {h_y_given_x:.4f} bits")
# Verify chain rule: H(X,Y) = H(X) + H(Y|X)
print(f"\nChain rule verification:")
print(f"H(X) + H(Y|X) = {h_x + h_y_given_x:.4f}")
print(f"H(X,Y) = {h_xy:.4f}")
print(f"Difference: {abs(h_xy - (h_x + h_y_given_x)):.10f}")
2. KL Divergence and Cross-Entropy
2.1 KL Divergence (Kullback-Leibler Divergence)
KL divergence is a metric that measures the "difference" between two probability distributions P(x) and Q(x).
Important properties:
- Asymmetry: \(D_{KL}(P||Q) \neq D_{KL}(Q||P)\) (not a distance metric)
- Non-negativity: \(D_{KL}(P||Q) \geq 0\), equality holds when \(P = Q\)
- Gibbs' inequality: \(\mathbb{E}_P[\log P(x)] \geq \mathbb{E}_P[\log Q(x)]\)
2.2 Cross-Entropy
Cross-entropy is the average number of bits needed to encode events under true distribution P using model distribution Q.
From this relationship, minimizing cross-entropy is equivalent to minimizing KL divergence (since H(P) is constant).
Implementation Example 2: KL Divergence and Cross-Entropy
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import softmax
class DivergenceMeasures:
"""Calculation of divergence metrics"""
@staticmethod
def kl_divergence(p, q, epsilon=1e-10):
"""
Calculate KL divergence D_KL(P||Q)
Parameters:
-----------
p, q : array-like
Probability distributions (must be normalized)
epsilon : float
Small value for numerical stability
Returns:
--------
float : KL divergence [nats or bits]
"""
p = np.array(p)
q = np.array(q)
# Prevent division by zero
q = np.clip(q, epsilon, 1.0)
p = np.clip(p, epsilon, 1.0)
return np.sum(p * np.log(p / q))
@staticmethod
def cross_entropy(p, q, epsilon=1e-10):
"""
Calculate cross-entropy H(P,Q)
Returns:
--------
float : Cross-entropy
"""
p = np.array(p)
q = np.array(q)
q = np.clip(q, epsilon, 1.0)
return -np.sum(p * np.log(q))
@staticmethod
def js_divergence(p, q):
"""
Jensen-Shannon divergence (symmetric version of KL)
JS(P||Q) = 0.5 * D_KL(P||M) + 0.5 * D_KL(Q||M)
where M = 0.5 * (P + Q)
"""
p = np.array(p)
q = np.array(q)
m = 0.5 * (p + q)
return 0.5 * DivergenceMeasures.kl_divergence(p, m) + \
0.5 * DivergenceMeasures.kl_divergence(q, m)
# Usage Example 1: KL divergence of Gaussian distributions
print("=" * 50)
print("KL Divergence of Gaussian Distributions")
print("=" * 50)
# Two normal distributions
x = np.linspace(-5, 5, 1000)
p = np.exp(-0.5 * x**2) / np.sqrt(2 * np.pi) # N(0, 1)
q = np.exp(-0.5 * (x - 1)**2) / np.sqrt(2 * np.pi) # N(1, 1)
# Normalize
p = p / np.sum(p)
q = q / np.sum(q)
kl_pq = DivergenceMeasures.kl_divergence(p, q)
kl_qp = DivergenceMeasures.kl_divergence(q, p)
js = DivergenceMeasures.js_divergence(p, q)
print(f"D_KL(P||Q) = {kl_pq:.4f}")
print(f"D_KL(Q||P) = {kl_qp:.4f}")
print(f"JS(P||Q) = {js:.4f}")
print(f"Asymmetry: |D_KL(P||Q) - D_KL(Q||P)| = {abs(kl_pq - kl_qp):.4f}")
# Visualization
plt.figure(figsize=(10, 5))
plt.plot(x, p * 1000, 'b-', linewidth=2, label='P: N(0,1)')
plt.plot(x, q * 1000, 'r-', linewidth=2, label='Q: N(1,1)')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title(f'KL Divergence: D_KL(P||Q) = {kl_pq:.4f}')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('kl_divergence.png', dpi=150, bbox_inches='tight')
# Usage Example 2: Cross-entropy loss in classification
print("\n" + "=" * 50)
print("Cross-Entropy Loss in Classification")
print("=" * 50)
# True label (one-hot encoding)
true_label = np.array([0, 1, 0, 0]) # Class 1 is correct
# Model predictions (logits)
logits_good = np.array([1.0, 3.5, 0.5, 0.8]) # Good prediction
logits_bad = np.array([2.0, 0.5, 1.5, 1.0]) # Poor prediction
# Convert to probabilities with softmax
pred_good = softmax(logits_good)
pred_bad = softmax(logits_bad)
# Cross-entropy loss
ce_good = DivergenceMeasures.cross_entropy(true_label, pred_good)
ce_bad = DivergenceMeasures.cross_entropy(true_label, pred_bad)
print(f"Good prediction probability distribution: {pred_good}")
print(f"Cross-entropy loss: {ce_good:.4f}\n")
print(f"Poor prediction probability distribution: {pred_bad}")
print(f"Cross-entropy loss: {ce_bad:.4f}\n")
print(f"Loss difference: {ce_bad - ce_good:.4f}")
print("→ Good prediction has lower loss")
3. Mutual Information
3.1 Definition of Mutual Information
Mutual Information is a metric that measures the statistical dependence between two random variables X and Y.
Mutual information can also be expressed using entropy as follows:
Important properties:
- Symmetry: \(I(X;Y) = I(Y;X)\)
- Non-negativity: \(I(X;Y) \geq 0\), equality holds when X and Y are independent
- Information reduction: \(I(X;Y) \leq \min(H(X), H(Y))\)
3.2 Application to Feature Selection
In machine learning, mutual information can be used to select important features. By calculating the mutual information I(X_i;Y) between target variable Y and each feature X_i, features with large values are selected.
Implementation Example 3: Feature Selection Using Mutual Information
import numpy as np
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification
class MutualInformation:
"""Mutual information calculation and feature selection"""
@staticmethod
def mutual_information_discrete(x, y):
"""
Calculate mutual information for discrete variables
I(X;Y) = H(X) + H(Y) - H(X,Y)
Parameters:
-----------
x, y : array-like
Discrete-valued random variables
Returns:
--------
float : Mutual information
"""
x = np.array(x)
y = np.array(y)
# Create joint frequency matrix
xy = np.c_[x, y]
unique_xy, counts_xy = np.unique(xy, axis=0, return_counts=True)
joint_prob = counts_xy / len(x)
# Marginal probabilities
unique_x, counts_x = np.unique(x, return_counts=True)
p_x = counts_x / len(x)
unique_y, counts_y = np.unique(y, return_counts=True)
p_y = counts_y / len(y)
# Calculate entropies
from scipy.stats import entropy
h_x = entropy(p_x, base=2)
h_y = entropy(p_y, base=2)
h_xy = entropy(joint_prob, base=2)
# Mutual information
mi = h_x + h_y - h_xy
return mi
@staticmethod
def feature_selection_by_mi(X, y, n_features=5):
"""
Feature selection using mutual information
Parameters:
-----------
X : ndarray of shape (n_samples, n_features)
Feature matrix
y : array-like
Target variable
n_features : int
Number of features to select
Returns:
--------
selected_indices : array
Indices of selected features
mi_scores : array
Mutual information scores for each feature
"""
# scikit-learn's mutual information calculation
mi_scores = mutual_info_classif(X, y, random_state=42)
# Sort in descending order of scores
selected_indices = np.argsort(mi_scores)[::-1][:n_features]
return selected_indices, mi_scores
# Usage Example 1: Mutual information of discrete variables
print("=" * 50)
print("Mutual Information of Discrete Variables")
print("=" * 50)
# Example: Weather (X) and umbrella use (Y)
# 0: sunny/not used, 1: rainy/used
np.random.seed(42)
# Case with strong correlation
weather = np.array([0, 0, 0, 0, 1, 1, 1, 1, 0, 1] * 10)
umbrella_corr = np.array([0, 0, 0, 1, 1, 1, 1, 1, 0, 1] * 10) # Correlated with weather
umbrella_rand = np.random.randint(0, 2, 100) # Random
mi_corr = MutualInformation.mutual_information_discrete(weather, umbrella_corr)
mi_rand = MutualInformation.mutual_information_discrete(weather, umbrella_rand)
print(f"Mutual information of weather and umbrella (correlated): {mi_corr:.4f} bits")
print(f"Mutual information of weather and umbrella (random): {mi_rand:.4f} bits")
print(f"→ Mutual information is larger in the correlated case")
# Usage Example 2: Feature selection
print("\n" + "=" * 50)
print("Feature Selection Using Mutual Information")
print("=" * 50)
# Generate synthetic data (20 features, 5 of which are useful)
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=5, # Useful features
n_redundant=5, # Redundant features
n_repeated=0,
n_classes=2,
random_state=42
)
# Feature selection with mutual information
selected_idx, mi_scores = MutualInformation.feature_selection_by_mi(X, y, n_features=5)
print(f"Total number of features: {X.shape[1]}")
print(f"\nMutual information scores (top 10 features):")
for i in range(10):
print(f" Feature {i}: MI = {mi_scores[i]:.4f}")
print(f"\nSelected features (top 5): {selected_idx}")
print(f"MI scores of selected features: {mi_scores[selected_idx]}")
# Visualization
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.bar(range(len(mi_scores)), mi_scores, color='steelblue')
plt.bar(selected_idx, mi_scores[selected_idx], color='crimson', label='Selected features')
plt.xlabel('Feature Index')
plt.ylabel('Mutual Information Score')
plt.title('Mutual Information of Each Feature')
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
sorted_idx = np.argsort(mi_scores)[::-1]
plt.plot(range(1, len(mi_scores)+1), mi_scores[sorted_idx], 'o-', linewidth=2)
plt.axvline(x=5, color='r', linestyle='--', label='Number of selections=5')
plt.xlabel('Rank')
plt.ylabel('Mutual Information Score')
plt.title('Rank of Mutual Information Scores')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('mutual_information.png', dpi=150, bbox_inches='tight')
print("\nVisualization of mutual information saved")
4. Information Theory and Machine Learning
4.1 Variational Autoencoder (VAE)
VAE can be understood from an information-theoretic perspective. When learning the relationship between latent variable z and data x, the following objective function is maximized.
The right-hand side is called ELBO (Evidence Lower BOund) and consists of:
- First term (reconstruction term): Data reconstruction quality
- Second term (KL term): Closeness between latent distribution and prior distribution
4.2 Information Bottleneck Theory
Information bottleneck theory formulates representation learning in information-theoretic terms. For input X and label Y, representation Z should satisfy:
This expresses the tradeoff of "compressing input information (minimizing I(X;Z)) while retaining label information (maximizing I(Z;Y))."
Implementation Example 4: ELBO Calculation in VAE
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
class VAE(nn.Module):
"""Variational Autoencoder"""
def __init__(self, input_dim=784, hidden_dim=400, latent_dim=20):
"""
Parameters:
-----------
input_dim : int
Input dimension (e.g., 28x28=784 for MNIST)
hidden_dim : int
Hidden layer dimension
latent_dim : int
Latent variable dimension
"""
super(VAE, self).__init__()
# Encoder q(z|x)
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc_mu = nn.Linear(hidden_dim, latent_dim)
self.fc_logvar = nn.Linear(hidden_dim, latent_dim)
# Decoder p(x|z)
self.fc3 = nn.Linear(latent_dim, hidden_dim)
self.fc4 = nn.Linear(hidden_dim, input_dim)
def encode(self, x):
"""
Encoder: x → (μ, log σ²)
Returns:
--------
mu, logvar : Parameters of latent distribution
"""
h = F.relu(self.fc1(x))
mu = self.fc_mu(h)
logvar = self.fc_logvar(h)
return mu, logvar
def reparameterize(self, mu, logvar):
"""
Reparameterization trick: z = μ + σ * ε, ε ~ N(0,1)
This makes stochastic sampling differentiable
"""
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
z = mu + eps * std
return z
def decode(self, z):
"""
Decoder: z → x̂
"""
h = F.relu(self.fc3(z))
x_recon = torch.sigmoid(self.fc4(h))
return x_recon
def forward(self, x):
"""
Forward pass: x → z → x̂
"""
mu, logvar = self.encode(x.view(-1, 784))
z = self.reparameterize(mu, logvar)
x_recon = self.decode(z)
return x_recon, mu, logvar
def vae_loss(x, x_recon, mu, logvar, beta=1.0):
"""
VAE loss function (negative ELBO)
ELBO = E[log p(x|z)] - β * D_KL(q(z|x)||p(z))
Parameters:
-----------
x : Tensor
Original data
x_recon : Tensor
Reconstructed data
mu, logvar : Tensor
Parameters of latent distribution
beta : float
Weight of KL term (β-VAE)
Returns:
--------
loss, recon_loss, kl_loss : Total loss, reconstruction loss, KL loss
"""
# Reconstruction loss (negative log likelihood)
# For binary data: BCE loss
recon_loss = F.binary_cross_entropy(
x_recon, x.view(-1, 784), reduction='sum'
)
# KL divergence loss
# D_KL(N(μ,σ²)||N(0,1)) = 0.5 * Σ(μ² + σ² - log(σ²) - 1)
kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
# Total loss (negative ELBO)
total_loss = recon_loss + beta * kl_loss
return total_loss, recon_loss, kl_loss
# Usage example
print("=" * 50)
print("ELBO Calculation in VAE")
print("=" * 50)
# Model initialization
vae = VAE(input_dim=784, hidden_dim=400, latent_dim=20)
# Dummy data (batch size 32, 28x28 images)
torch.manual_seed(42)
x = torch.rand(32, 1, 28, 28)
# Forward pass
x_recon, mu, logvar = vae(x)
# Loss calculation
loss, recon_loss, kl_loss = vae_loss(x, x_recon, mu, logvar, beta=1.0)
print(f"Total loss (-ELBO): {loss.item():.2f}")
print(f" Reconstruction loss: {recon_loss.item():.2f}")
print(f" KL loss: {kl_loss.item():.2f}")
# Check the effect of β
print("\n" + "=" * 50)
print("β-VAE: Effect of β Parameter")
print("=" * 50)
betas = [0.5, 1.0, 2.0, 5.0]
for beta in betas:
loss, recon, kl = vae_loss(x, x_recon, mu, logvar, beta=beta)
print(f"β={beta:.1f}: Total loss={loss.item():.2f}, "
f"Reconstruction={recon.item():.2f}, KL={kl.item():.2f}")
print("\nInterpretation:")
print("- Large β: Emphasizes KL term → Latent space approaches normal distribution")
print("- Small β: Emphasizes reconstruction → Reconstruction quality improves")
5. Practical Applications
5.1 Cross-Entropy Loss Function
The most common loss function in classification problems is cross-entropy loss.
Here, y_{i,c} is the true label (one-hot), and \(\hat{y}_{i,c}\) is the model's predicted probability.
5.2 KL Loss and Label Smoothing
Label smoothing is a regularization technique that prevents overconfidence. It transforms hard labels [0,1,0] to [ε/K, 1-ε+ε/K, ε/K].
Implementation Example 5: Cross-Entropy and KL Loss
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
class LossFunctions:
"""Loss functions based on information theory"""
@staticmethod
def cross_entropy_loss(logits, targets):
"""
Cross-entropy loss
Parameters:
-----------
logits : Tensor of shape (batch_size, num_classes)
Model output (before softmax)
targets : Tensor of shape (batch_size,)
True class labels
Returns:
--------
loss : Tensor
Cross-entropy loss
"""
return F.cross_entropy(logits, targets)
@staticmethod
def kl_div_loss(logits, target_dist):
"""
KL divergence loss
Calculate D_KL(target || pred)
Parameters:
-----------
logits : Tensor
Model output
target_dist : Tensor
Target distribution (probability distribution)
Returns:
--------
loss : Tensor
KL loss
"""
log_pred = F.log_softmax(logits, dim=-1)
return F.kl_div(log_pred, target_dist, reduction='batchmean')
@staticmethod
def label_smoothing_loss(logits, targets, smoothing=0.1):
"""
Cross-entropy with label smoothing
Parameters:
-----------
smoothing : float
Smoothing parameter (0: none, 1: completely uniform)
"""
n_classes = logits.size(-1)
log_pred = F.log_softmax(logits, dim=-1)
# Label smoothing
# True class: 1 - ε + ε/K
# Other classes: ε/K
with torch.no_grad():
true_dist = torch.zeros_like(log_pred)
true_dist.fill_(smoothing / (n_classes - 1))
true_dist.scatter_(1, targets.unsqueeze(1), 1.0 - smoothing)
return torch.mean(torch.sum(-true_dist * log_pred, dim=-1))
# Usage Example 1: Comparison of loss functions
print("=" * 50)
print("Comparison of Loss Functions")
print("=" * 50)
torch.manual_seed(42)
# Data preparation
batch_size = 4
num_classes = 3
logits = torch.randn(batch_size, num_classes) * 2
targets = torch.tensor([0, 1, 2, 1])
# Calculate each loss
ce_loss = LossFunctions.cross_entropy_loss(logits, targets)
# Label smoothing loss
ls_loss_01 = LossFunctions.label_smoothing_loss(logits, targets, smoothing=0.1)
ls_loss_03 = LossFunctions.label_smoothing_loss(logits, targets, smoothing=0.3)
print(f"Cross-entropy loss: {ce_loss.item():.4f}")
print(f"Label smoothing loss (ε=0.1): {ls_loss_01.item():.4f}")
print(f"Label smoothing loss (ε=0.3): {ls_loss_03.item():.4f}")
# Display prediction probabilities
probs = F.softmax(logits, dim=-1)
print(f"\nPrediction probabilities:")
for i in range(batch_size):
print(f" Sample {i} (true label={targets[i]}): {probs[i].numpy()}")
# Usage Example 2: Relationship between confidence and loss
print("\n" + "=" * 50)
print("Relationship Between Model Confidence and Loss")
print("=" * 50)
# Vary confidence
confidences = np.linspace(0.1, 0.99, 50)
losses = []
for conf in confidences:
# Create logits with probability conf for the correct class
# [conf, (1-conf)/2, (1-conf)/2]
logits_conf = torch.tensor([[
np.log(conf),
np.log((1-conf)/2),
np.log((1-conf)/2)
]])
target = torch.tensor([0])
loss = LossFunctions.cross_entropy_loss(logits_conf, target)
losses.append(loss.item())
# Visualization
plt.figure(figsize=(10, 5))
plt.plot(confidences, losses, 'b-', linewidth=2)
plt.xlabel('Predicted Probability for Correct Class')
plt.ylabel('Cross-Entropy Loss')
plt.title('Relationship Between Prediction Confidence and Loss')
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='r', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig('confidence_loss.png', dpi=150, bbox_inches='tight')
print("Visualized the relationship between confidence and loss")
print(f"\nObservations:")
print(f"- Loss at confidence 0.5: {losses[20]:.4f}")
print(f"- Loss at confidence 0.9: {losses[40]:.4f}")
print(f"- Loss at confidence 0.99: {losses[-1]:.4f}")
print("→ Loss decreases as confidence increases (exponential decrease)")
5.3 ELBO (Evidence Lower Bound)
ELBO, which is important in learning generative models, provides a lower bound on the log marginal likelihood.
Implementation Example 6: Detailed ELBO Calculation
import torch
import numpy as np
import matplotlib.pyplot as plt
class ELBOAnalysis:
"""Detailed analysis of ELBO"""
@staticmethod
def compute_elbo_components(x, x_recon, mu, logvar, n_samples=1000):
"""
Calculate each component of ELBO in detail
Parameters:
-----------
x : Tensor
Original data
x_recon : Tensor
Reconstructed data
mu, logvar : Tensor
Encoder output (parameters of latent distribution)
n_samples : int
Number of Monte Carlo samples
Returns:
--------
dict : Components of ELBO
"""
batch_size = x.size(0)
latent_dim = mu.size(1)
# 1. Reconstruction term: E_q[log p(x|z)]
# Log likelihood for binary data
recon_term = -F.binary_cross_entropy(
x_recon, x.view(batch_size, -1), reduction='sum'
) / batch_size
# 2. KL term (analytical calculation): D_KL(q(z|x)||p(z))
# q(z|x) = N(μ, σ²), p(z) = N(0, I)
kl_term = -0.5 * torch.sum(
1 + logvar - mu.pow(2) - logvar.exp()
) / batch_size
# 3. ELBO calculation
elbo = recon_term - kl_term
# 4. Verification by Monte Carlo estimation
# Sample estimation of E_q[log p(x|z)]
std = torch.exp(0.5 * logvar)
recon_mc = 0
for _ in range(n_samples):
eps = torch.randn_like(std)
z = mu + eps * std
# Simplified reconstruction likelihood
recon_mc += -F.binary_cross_entropy(
x_recon, x.view(batch_size, -1), reduction='sum'
)
recon_mc = recon_mc / (n_samples * batch_size)
return {
'elbo': elbo.item(),
'reconstruction': recon_term.item(),
'kl_divergence': kl_term.item(),
'reconstruction_mc': recon_mc.item(),
'log_marginal_lower_bound': elbo.item()
}
@staticmethod
def analyze_latent_distribution(mu, logvar):
"""
Analyze statistics of latent distribution
Returns:
--------
dict : Statistics
"""
std = torch.exp(0.5 * logvar)
return {
'mean_mu': mu.mean().item(),
'std_mu': mu.std().item(),
'mean_sigma': std.mean().item(),
'std_sigma': std.std().item(),
'min_sigma': std.min().item(),
'max_sigma': std.max().item()
}
# Usage example
print("=" * 50)
print("Detailed Analysis of ELBO")
print("=" * 50)
# Dummy VAE model output
torch.manual_seed(42)
batch_size = 16
input_dim = 784
latent_dim = 20
x = torch.rand(batch_size, 1, 28, 28)
mu = torch.randn(batch_size, latent_dim) * 0.5
logvar = torch.randn(batch_size, latent_dim) * 0.5
# Reparameterization
std = torch.exp(0.5 * logvar)
z = mu + torch.randn_like(std) * std
x_recon = torch.sigmoid(torch.randn(batch_size, input_dim))
# ELBO calculation
elbo_components = ELBOAnalysis.compute_elbo_components(
x, x_recon, mu, logvar, n_samples=100
)
print("ELBO components:")
for key, value in elbo_components.items():
print(f" {key}: {value:.4f}")
# Analysis of latent distribution
latent_stats = ELBOAnalysis.analyze_latent_distribution(mu, logvar)
print("\nStatistics of latent distribution:")
for key, value in latent_stats.items():
print(f" {key}: {value:.4f}")
# Visualization: μ and σ for each latent dimension
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Distribution of mean μ
axes[0].hist(mu.detach().numpy().flatten(), bins=30, alpha=0.7, color='blue')
axes[0].set_xlabel('μ')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Latent Variable Mean (μ)')
axes[0].axvline(x=0, color='r', linestyle='--', label='Prior mean')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Distribution of standard deviation σ
axes[1].hist(std.detach().numpy().flatten(), bins=30, alpha=0.7, color='green')
axes[1].set_xlabel('σ')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Latent Variable Standard Deviation (σ)')
axes[1].axvline(x=1, color='r', linestyle='--', label='Prior standard deviation')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('elbo_analysis.png', dpi=150, bbox_inches='tight')
print("\nSaved visualization of ELBO analysis")
print("\n" + "=" * 50)
print("Information-Theoretic Interpretation")
print("=" * 50)
print("ELBO = Reconstruction term - KL term")
print(" Reconstruction term: Ability to explain data with latent variables")
print(" KL term: How far the latent distribution deviates from the prior")
print(" → Tradeoff: Good reconstruction vs regularized latent space")
Summary
In this chapter, we learned the fundamentals of information theory that support machine learning.
- Entropy: Quantification of uncertainty and conditional entropy
- KL Divergence: Asymmetric metric measuring the difference between probability distributions
- Cross-Entropy: Theoretical foundation of loss functions in classification problems
- Mutual Information: Dependence between variables and application to feature selection
- VAE and ELBO: Information-theoretic interpretation of generative models
Exercises
- Calculate and compare the entropy of a die (6 sides) and a biased coin (P(heads)=0.8)
- Analytically calculate the KL divergence between two Gaussian distributions N(0,1) and N(2,2)
- Explain what kind of relationship exists between X and Y when mutual information I(X;Y)=0
- Experimentally verify the difference in latent space when β=0.5, 1.0, 2.0 in β-VAE
- Explain from an information-theoretic perspective why label smoothing prevents model overconfidence
References
- Claude E. Shannon, "A Mathematical Theory of Communication" (1948)
- Thomas M. Cover and Joy A. Thomas, "Elements of Information Theory" (2006)
- D.P. Kingma and M. Welling, "Auto-Encoding Variational Bayes" (2013)
- Naftali Tishby et al., "The Information Bottleneck Method" (2000)
Disclaimer
- This content is provided solely for educational, research, and informational purposes and does not constitute professional advice (legal, accounting, technical warranty, etc.).
- This content and accompanying code examples are provided "AS IS" without any warranty, express or implied, including but not limited to merchantability, fitness for a particular purpose, non-infringement, accuracy, completeness, operation, or safety.
- The author and Tohoku University assume no responsibility for the content, availability, or safety of external links, third-party data, tools, libraries, etc.
- To the maximum extent permitted by applicable law, the author and Tohoku University shall not be liable for any direct, indirect, incidental, special, consequential, or punitive damages arising from the use, execution, or interpretation of this content.
- The content may be changed, updated, or discontinued without notice.
- The copyright and license of this content are subject to the stated conditions (e.g., CC BY 4.0). Such licenses typically include no-warranty clauses.