Introduction
The Transformer architecture is the foundation of all modern Large Language Models (LLMs). In this chapter, we'll dive deep into the core mechanisms that make LLMs work, including the revolutionary Self-Attention mechanism, various position encoding schemes, and the cutting-edge Mixture of Experts (MoE) architecture that powers models like DeepSeek and Mixtral.
What You'll Learn
- How Self-Attention computes relationships between all tokens
- Multi-Head Attention and why it matters
- Position encoding: from sinusoidal to RoPE and ALiBi
- Decoder-Only vs Encoder-Decoder architectures
- NEW: Mixture of Experts (MoE) - the secret behind efficient giant models
- NEW: FlashAttention and efficient attention mechanisms
2.1 Self-Attention Mechanism
The Core Idea
Self-Attention (also called Scaled Dot-Product Attention) allows each token in a sequence to attend to all other tokens, capturing dependencies regardless of their distance in the sequence.
The key insight is: instead of processing tokens sequentially like RNNs, Self-Attention computes relationships between all pairs of tokens in parallel.
Mathematical Formulation
Given an input sequence, Self-Attention computes three matrices:
- Query (Q): "What am I looking for?"
- Key (K): "What information do I contain?"
- Value (V): "What information do I provide?"
The attention output is computed as:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Where \(d_k\) is the dimension of the key vectors. The \(\sqrt{d_k}\) scaling prevents the dot products from becoming too large.
Implementation in Python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class SelfAttention(nn.Module):
"""Scaled Dot-Product Self-Attention"""
def __init__(self, embed_dim: int):
super().__init__()
self.embed_dim = embed_dim
# Linear projections for Q, K, V
self.q_proj = nn.Linear(embed_dim, embed_dim)
self.k_proj = nn.Linear(embed_dim, embed_dim)
self.v_proj = nn.Linear(embed_dim, embed_dim)
# Output projection
self.out_proj = nn.Linear(embed_dim, embed_dim)
# Scaling factor
self.scale = math.sqrt(embed_dim)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
"""
Args:
x: Input tensor of shape (batch, seq_len, embed_dim)
mask: Optional attention mask
Returns:
Output tensor of shape (batch, seq_len, embed_dim)
"""
# Compute Q, K, V
Q = self.q_proj(x) # (batch, seq_len, embed_dim)
K = self.k_proj(x)
V = self.v_proj(x)
# Compute attention scores
# (batch, seq_len, embed_dim) @ (batch, embed_dim, seq_len)
# = (batch, seq_len, seq_len)
scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
# Apply mask if provided (for causal/autoregressive models)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Softmax to get attention weights
attn_weights = F.softmax(scores, dim=-1)
# Apply attention to values
output = torch.matmul(attn_weights, V)
# Final projection
return self.out_proj(output)
# Example usage
batch_size, seq_len, embed_dim = 2, 10, 64
x = torch.randn(batch_size, seq_len, embed_dim)
attention = SelfAttention(embed_dim)
output = attention(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
# Input shape: torch.Size([2, 10, 64])
# Output shape: torch.Size([2, 10, 64])
2.2 Multi-Head Attention
Why Multiple Heads?
A single attention head can only focus on one type of relationship at a time. Multi-Head Attention allows the model to jointly attend to information from different representation subspaces.
Intuition
Think of it like having multiple "experts" looking at the same text:
- Head 1 might focus on syntactic relationships (subject-verb)
- Head 2 might capture semantic similarity
- Head 3 might track coreference (pronouns to nouns)
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$
where \(\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\)
class MultiHeadAttention(nn.Module):
"""Multi-Head Self-Attention as used in Transformer"""
def __init__(self, embed_dim: int, num_heads: int, dropout: float = 0.1):
super().__init__()
assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
# Combined projection for Q, K, V (more efficient)
self.qkv_proj = nn.Linear(embed_dim, 3 * embed_dim)
self.out_proj = nn.Linear(embed_dim, embed_dim)
self.dropout = nn.Dropout(dropout)
self.scale = math.sqrt(self.head_dim)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
batch_size, seq_len, _ = x.shape
# Project and reshape for multi-head
qkv = self.qkv_proj(x) # (batch, seq_len, 3 * embed_dim)
qkv = qkv.reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4) # (3, batch, heads, seq_len, head_dim)
Q, K, V = qkv[0], qkv[1], qkv[2]
# Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn_weights = F.softmax(scores, dim=-1)
attn_weights = self.dropout(attn_weights)
# Apply attention to values
output = torch.matmul(attn_weights, V)
# Reshape and project
output = output.transpose(1, 2).reshape(batch_size, seq_len, self.embed_dim)
return self.out_proj(output)
# Example: 8-head attention
mha = MultiHeadAttention(embed_dim=512, num_heads=8)
x = torch.randn(2, 100, 512) # batch=2, seq_len=100
output = mha(x)
print(f"Multi-head output shape: {output.shape}")
# Multi-head output shape: torch.Size([2, 100, 512])
2.3 Position Encoding
Self-Attention is permutation-invariant - it doesn't inherently know the order of tokens. Position encodings inject positional information into the model.
Sinusoidal Position Encoding (Original Transformer)
The original Transformer used sinusoidal functions:
$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$
RoPE (Rotary Position Embedding)
RoPE is the position encoding used by most modern LLMs including LLaMA, Mistral, and Qwen. It encodes position by rotating the query and key vectors.
Why RoPE is Superior
- Relative position: Naturally captures relative distances
- Length extrapolation: Can generalize to longer sequences than training
- Efficient: Applied only to Q and K, not V
def apply_rope(x: torch.Tensor, freqs_cos: torch.Tensor, freqs_sin: torch.Tensor):
"""Apply Rotary Position Embedding (RoPE)
Args:
x: Query or Key tensor of shape (batch, heads, seq_len, head_dim)
freqs_cos, freqs_sin: Precomputed cosine and sine frequencies
"""
# Split x into pairs for rotation
x1 = x[..., ::2] # Even indices
x2 = x[..., 1::2] # Odd indices
# Apply rotation
# [x1, x2] * [cos, -sin; sin, cos] = [x1*cos - x2*sin, x1*sin + x2*cos]
rotated = torch.stack([
x1 * freqs_cos - x2 * freqs_sin,
x1 * freqs_sin + x2 * freqs_cos
], dim=-1).flatten(-2)
return rotated
def precompute_rope_freqs(dim: int, max_seq_len: int, base: float = 10000.0):
"""Precompute RoPE frequencies"""
# Compute inverse frequencies
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
# Compute position indices
positions = torch.arange(max_seq_len).float()
# Outer product to get (seq_len, dim/2) frequency matrix
freqs = torch.outer(positions, inv_freq)
return torch.cos(freqs), torch.sin(freqs)
# Example
head_dim = 64
max_len = 2048
freqs_cos, freqs_sin = precompute_rope_freqs(head_dim, max_len)
print(f"RoPE frequency shapes: cos={freqs_cos.shape}, sin={freqs_sin.shape}")
# RoPE frequency shapes: cos=torch.Size([2048, 32]), sin=torch.Size([2048, 32])
ALiBi (Attention with Linear Biases)
ALiBi adds a linear bias to attention scores based on distance, without modifying embeddings:
$$\text{softmax}(q_i \cdot k_j - m \cdot |i - j|)$$
where \(m\) is a head-specific slope. ALiBi is used by models like BLOOM and MPT.
2.4 Decoder-Only vs Encoder-Decoder
Architecture Comparison
| Architecture | Models | Use Case | Attention Type |
|---|---|---|---|
| Encoder-Only | BERT, RoBERTa | Understanding, Classification | Bidirectional |
| Decoder-Only | GPT, LLaMA, Claude | Generation | Causal (unidirectional) |
| Encoder-Decoder | T5, BART | Translation, Summarization | Mixed |
Why Decoder-Only Dominates (2024-2025)
Most modern LLMs use Decoder-Only architecture because:
- Unified training: Same objective (next token prediction) for all tasks
- Simplicity: Fewer components, easier to scale
- Emergence: Scaling decoder-only models shows remarkable emergent abilities
- In-context learning: Natural fit for few-shot prompting
2.5 Mixture of Experts (MoE)
2025-2026 Key Development
Mixture of Experts has become the dominant architecture for frontier models. DeepSeek-V3, Mixtral, and Llama 4 all use MoE to achieve better performance with less compute.
What is MoE?
Mixture of Experts (MoE) replaces the dense Feed-Forward Network (FFN) layer with multiple "expert" networks. A router network selects which experts to activate for each token.
Key Benefits
- Compute Efficiency: Only 10-40% of parameters are active per token
- Parameter Scaling: Can have trillions of total parameters
- Specialization: Different experts learn different skills
MoE Implementation
class MoELayer(nn.Module):
"""Mixture of Experts Layer (Simplified)"""
def __init__(
self,
embed_dim: int,
num_experts: int = 8,
top_k: int = 2,
expert_dim: int = None
):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
expert_dim = expert_dim or embed_dim * 4
# Router: determines which experts to use
self.router = nn.Linear(embed_dim, num_experts, bias=False)
# Expert networks (simplified as MLPs)
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(embed_dim, expert_dim),
nn.GELU(),
nn.Linear(expert_dim, embed_dim)
)
for _ in range(num_experts)
])
def forward(self, x: torch.Tensor):
"""
Args:
x: Input tensor of shape (batch, seq_len, embed_dim)
"""
batch_size, seq_len, embed_dim = x.shape
# Compute router logits
router_logits = self.router(x) # (batch, seq_len, num_experts)
# Get top-k experts for each token
router_probs = F.softmax(router_logits, dim=-1)
top_k_probs, top_k_indices = torch.topk(router_probs, self.top_k, dim=-1)
# Normalize top-k probabilities
top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)
# Initialize output
output = torch.zeros_like(x)
# Process each expert (simplified - real implementations are more efficient)
for expert_idx in range(self.num_experts):
# Find tokens routed to this expert
expert_mask = (top_k_indices == expert_idx).any(dim=-1)
if expert_mask.any():
# Get weight for this expert
weight_mask = (top_k_indices == expert_idx)
expert_weights = (top_k_probs * weight_mask.float()).sum(dim=-1)
# Apply expert
expert_input = x[expert_mask]
expert_output = self.experts[expert_idx](expert_input)
# Add weighted expert output
output[expert_mask] += expert_weights[expert_mask].unsqueeze(-1) * expert_output
return output
# Example: MoE with 8 experts, top-2 routing
moe = MoELayer(embed_dim=512, num_experts=8, top_k=2)
x = torch.randn(2, 100, 512)
output = moe(x)
print(f"MoE output shape: {output.shape}")
# MoE output shape: torch.Size([2, 100, 512])
DeepSeek's Innovation: Shared + Routed Experts
DeepSeek introduced a novel MoE variant with two types of experts:
- Shared Experts: Always activated, learn common knowledge (grammar, etc.)
- Routed Experts: Dynamically selected, learn specialized knowledge
DeepSeek MoE Results
DeepSeekMoE 16B achieves comparable performance to DeepSeek 7B with only 40.5% of computations, and outperforms LLaMA2 7B on most benchmarks with 39.6% of computations.
| Model | Total Params | Active Params | Architecture |
|---|---|---|---|
| Mixtral 8x7B | 46.7B | 12.9B | 8 experts, top-2 |
| DeepSeek-V2 | 236B | 21B | Shared + Routed |
| DeepSeek-V3 | 685B | ~37B | Fine-grained MoE |
| Llama 4 Scout | Unknown | ~17B | 16 experts, top-1 |
2.6 Efficient Attention Mechanisms
The Quadratic Problem
Standard self-attention has O(n^2) complexity in sequence length, making it expensive for long contexts. With 1M+ token context windows becoming standard in 2025, efficient attention is crucial.
FlashAttention
FlashAttention (Dao et al., 2022) is an IO-aware algorithm that computes exact attention 2-4x faster with less memory by:
- Tiling: Processing attention in blocks that fit in fast SRAM
- Recomputation: Trading compute for memory by not storing attention matrices
- Kernel fusion: Combining operations to reduce memory bandwidth
FlashAttention in Practice
FlashAttention is now the default in PyTorch 2.0+ via torch.nn.functional.scaled_dot_product_attention with is_causal=True.
# Using FlashAttention via PyTorch 2.0+
import torch.nn.functional as F
def efficient_attention(Q, K, V, is_causal=True):
"""Use PyTorch's optimized attention (uses FlashAttention when available)"""
return F.scaled_dot_product_attention(
Q, K, V,
is_causal=is_causal,
dropout_p=0.0, # No dropout for inference
scale=None # Auto-compute scale
)
# Example
Q = torch.randn(2, 8, 1024, 64) # batch, heads, seq_len, head_dim
K = torch.randn(2, 8, 1024, 64)
V = torch.randn(2, 8, 1024, 64)
# This automatically uses FlashAttention on compatible hardware
output = efficient_attention(Q, K, V)
print(f"Efficient attention output: {output.shape}")
# Efficient attention output: torch.Size([2, 8, 1024, 64])
Other Efficient Attention Variants
| Method | Complexity | Used By |
|---|---|---|
| Standard Attention | O(n^2) | Original Transformer |
| FlashAttention | O(n^2) but IO-efficient | LLaMA, Mistral, etc. |
| Sparse Attention | O(n * sqrt(n)) | GPT-3, Longformer |
| Linear Attention | O(n) | Mamba, RWKV |
| Ring Attention | O(n^2 / devices) | Distributed training |
2.7 Chapter Summary
Key Takeaways
- Self-Attention computes relationships between all token pairs, enabling long-range dependencies
- Multi-Head Attention allows models to attend to multiple aspects simultaneously
- RoPE is the dominant position encoding for modern LLMs, enabling length extrapolation
- Decoder-Only architecture dominates due to simplicity and scaling properties
- MoE enables massive models with sparse activation, used by DeepSeek, Mixtral, Llama 4
- FlashAttention makes long-context models practical through IO-aware computation
Exercises
Exercise 1: Attention Visualization
Implement a function that visualizes attention weights as a heatmap. Use it to see how a simple self-attention layer attends to tokens in the sentence "The cat sat on the mat".
Solution
import matplotlib.pyplot as plt
import seaborn as sns
def visualize_attention(attn_weights, tokens):
"""Visualize attention weights as a heatmap"""
plt.figure(figsize=(8, 6))
sns.heatmap(
attn_weights.detach().numpy(),
xticklabels=tokens,
yticklabels=tokens,
cmap='Blues',
annot=True,
fmt='.2f'
)
plt.xlabel('Key Tokens')
plt.ylabel('Query Tokens')
plt.title('Self-Attention Weights')
plt.tight_layout()
plt.show()
# Example (would need actual model to get real weights)
tokens = ["The", "cat", "sat", "on", "the", "mat"]
# attn_weights would be from your attention layer
# visualize_attention(attn_weights[0, 0], tokens) # First batch, first head
Exercise 2: RoPE vs Sinusoidal
Compare the positional embeddings from sinusoidal encoding and RoPE. Plot the first 8 dimensions for positions 0-100.
Exercise 3: MoE Load Balancing
Implement a load balancing loss for MoE that encourages even distribution of tokens across experts:
$$\mathcal{L}_{balance} = N \cdot \sum_{i=1}^{N} f_i \cdot P_i$$
where \(f_i\) is the fraction of tokens routed to expert \(i\) and \(P_i\) is the average routing probability for expert \(i\).
Exercise 4: Attention Complexity
Measure the time and memory usage of standard attention vs FlashAttention for sequence lengths [512, 1024, 2048, 4096]. Plot the results.
Next Chapter
In the next chapter, we'll explore LLM Training and Alignment - how models are pre-trained on massive datasets, and how techniques like RLHF, DPO, and Constitutional AI align them with human values. We'll also cover the revolutionary inference-time scaling that powers reasoning models like o1 and o3.