Chapter 2: Transformer Architecture

Self-Attention, Position Encoding, MoE, and Efficient Attention

Reading Time: 30-35 min Difficulty: Intermediate Code Examples: 8 Exercises: 4

Introduction

The Transformer architecture is the foundation of all modern Large Language Models (LLMs). In this chapter, we'll dive deep into the core mechanisms that make LLMs work, including the revolutionary Self-Attention mechanism, various position encoding schemes, and the cutting-edge Mixture of Experts (MoE) architecture that powers models like DeepSeek and Mixtral.

What You'll Learn

2.1 Self-Attention Mechanism

The Core Idea

Self-Attention (also called Scaled Dot-Product Attention) allows each token in a sequence to attend to all other tokens, capturing dependencies regardless of their distance in the sequence.

The key insight is: instead of processing tokens sequentially like RNNs, Self-Attention computes relationships between all pairs of tokens in parallel.

Mathematical Formulation

Given an input sequence, Self-Attention computes three matrices:

The attention output is computed as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where \(d_k\) is the dimension of the key vectors. The \(\sqrt{d_k}\) scaling prevents the dot products from becoming too large.

graph LR A[Input X] --> B[Linear: Q] A --> C[Linear: K] A --> D[Linear: V] B --> E[MatMul Q*K^T] C --> E E --> F[Scale by sqrt dk] F --> G[Softmax] G --> H[MatMul with V] D --> H H --> I[Output] style A fill:#e3f2fd style I fill:#e8f5e9

Implementation in Python

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    """Scaled Dot-Product Self-Attention"""

    def __init__(self, embed_dim: int):
        super().__init__()
        self.embed_dim = embed_dim

        # Linear projections for Q, K, V
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)

        # Output projection
        self.out_proj = nn.Linear(embed_dim, embed_dim)

        # Scaling factor
        self.scale = math.sqrt(embed_dim)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        """
        Args:
            x: Input tensor of shape (batch, seq_len, embed_dim)
            mask: Optional attention mask
        Returns:
            Output tensor of shape (batch, seq_len, embed_dim)
        """
        # Compute Q, K, V
        Q = self.q_proj(x)  # (batch, seq_len, embed_dim)
        K = self.k_proj(x)
        V = self.v_proj(x)

        # Compute attention scores
        # (batch, seq_len, embed_dim) @ (batch, embed_dim, seq_len)
        # = (batch, seq_len, seq_len)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale

        # Apply mask if provided (for causal/autoregressive models)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        # Softmax to get attention weights
        attn_weights = F.softmax(scores, dim=-1)

        # Apply attention to values
        output = torch.matmul(attn_weights, V)

        # Final projection
        return self.out_proj(output)

# Example usage
batch_size, seq_len, embed_dim = 2, 10, 64
x = torch.randn(batch_size, seq_len, embed_dim)

attention = SelfAttention(embed_dim)
output = attention(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
# Input shape: torch.Size([2, 10, 64])
# Output shape: torch.Size([2, 10, 64])

2.2 Multi-Head Attention

Why Multiple Heads?

A single attention head can only focus on one type of relationship at a time. Multi-Head Attention allows the model to jointly attend to information from different representation subspaces.

Intuition

Think of it like having multiple "experts" looking at the same text:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

where \(\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\)

class MultiHeadAttention(nn.Module):
    """Multi-Head Self-Attention as used in Transformer"""

    def __init__(self, embed_dim: int, num_heads: int, dropout: float = 0.1):
        super().__init__()
        assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"

        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        # Combined projection for Q, K, V (more efficient)
        self.qkv_proj = nn.Linear(embed_dim, 3 * embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)

        self.dropout = nn.Dropout(dropout)
        self.scale = math.sqrt(self.head_dim)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        batch_size, seq_len, _ = x.shape

        # Project and reshape for multi-head
        qkv = self.qkv_proj(x)  # (batch, seq_len, 3 * embed_dim)
        qkv = qkv.reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, batch, heads, seq_len, head_dim)
        Q, K, V = qkv[0], qkv[1], qkv[2]

        # Compute attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Apply attention to values
        output = torch.matmul(attn_weights, V)

        # Reshape and project
        output = output.transpose(1, 2).reshape(batch_size, seq_len, self.embed_dim)
        return self.out_proj(output)

# Example: 8-head attention
mha = MultiHeadAttention(embed_dim=512, num_heads=8)
x = torch.randn(2, 100, 512)  # batch=2, seq_len=100
output = mha(x)
print(f"Multi-head output shape: {output.shape}")
# Multi-head output shape: torch.Size([2, 100, 512])

2.3 Position Encoding

Self-Attention is permutation-invariant - it doesn't inherently know the order of tokens. Position encodings inject positional information into the model.

Sinusoidal Position Encoding (Original Transformer)

The original Transformer used sinusoidal functions:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

RoPE (Rotary Position Embedding)

RoPE is the position encoding used by most modern LLMs including LLaMA, Mistral, and Qwen. It encodes position by rotating the query and key vectors.

Why RoPE is Superior

def apply_rope(x: torch.Tensor, freqs_cos: torch.Tensor, freqs_sin: torch.Tensor):
    """Apply Rotary Position Embedding (RoPE)

    Args:
        x: Query or Key tensor of shape (batch, heads, seq_len, head_dim)
        freqs_cos, freqs_sin: Precomputed cosine and sine frequencies
    """
    # Split x into pairs for rotation
    x1 = x[..., ::2]   # Even indices
    x2 = x[..., 1::2]  # Odd indices

    # Apply rotation
    # [x1, x2] * [cos, -sin; sin, cos] = [x1*cos - x2*sin, x1*sin + x2*cos]
    rotated = torch.stack([
        x1 * freqs_cos - x2 * freqs_sin,
        x1 * freqs_sin + x2 * freqs_cos
    ], dim=-1).flatten(-2)

    return rotated

def precompute_rope_freqs(dim: int, max_seq_len: int, base: float = 10000.0):
    """Precompute RoPE frequencies"""
    # Compute inverse frequencies
    inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))

    # Compute position indices
    positions = torch.arange(max_seq_len).float()

    # Outer product to get (seq_len, dim/2) frequency matrix
    freqs = torch.outer(positions, inv_freq)

    return torch.cos(freqs), torch.sin(freqs)

# Example
head_dim = 64
max_len = 2048
freqs_cos, freqs_sin = precompute_rope_freqs(head_dim, max_len)
print(f"RoPE frequency shapes: cos={freqs_cos.shape}, sin={freqs_sin.shape}")
# RoPE frequency shapes: cos=torch.Size([2048, 32]), sin=torch.Size([2048, 32])

ALiBi (Attention with Linear Biases)

ALiBi adds a linear bias to attention scores based on distance, without modifying embeddings:

$$\text{softmax}(q_i \cdot k_j - m \cdot |i - j|)$$

where \(m\) is a head-specific slope. ALiBi is used by models like BLOOM and MPT.

2.4 Decoder-Only vs Encoder-Decoder

Architecture Comparison

Architecture Models Use Case Attention Type
Encoder-Only BERT, RoBERTa Understanding, Classification Bidirectional
Decoder-Only GPT, LLaMA, Claude Generation Causal (unidirectional)
Encoder-Decoder T5, BART Translation, Summarization Mixed
graph TB subgraph "Encoder-Only (BERT)" E1[Token 1] <--> E2[Token 2] E2 <--> E3[Token 3] E1 <--> E3 end subgraph "Decoder-Only (GPT)" D1[Token 1] --> D2[Token 2] D1 --> D3[Token 3] D2 --> D3 end subgraph "Encoder-Decoder (T5)" ED1[Encoder] --> ED2[Cross-Attention] ED3[Decoder] --> ED2 end style E1 fill:#e3f2fd style D1 fill:#fff3e0 style ED1 fill:#e8f5e9

Why Decoder-Only Dominates (2024-2025)

Most modern LLMs use Decoder-Only architecture because:

2.5 Mixture of Experts (MoE)

2025-2026 Key Development

Mixture of Experts has become the dominant architecture for frontier models. DeepSeek-V3, Mixtral, and Llama 4 all use MoE to achieve better performance with less compute.

What is MoE?

Mixture of Experts (MoE) replaces the dense Feed-Forward Network (FFN) layer with multiple "expert" networks. A router network selects which experts to activate for each token.

graph TD A[Input Token] --> B[Router Network] B --> |Top-K Selection| C[Expert 1] B --> |Top-K Selection| D[Expert 2] B -.-> E[Expert 3] B -.-> F[Expert N] C --> G[Weighted Sum] D --> G G --> H[Output] style B fill:#fff3e0 style C fill:#e8f5e9 style D fill:#e8f5e9 style E fill:#f5f5f5 style F fill:#f5f5f5

Key Benefits

MoE Implementation

class MoELayer(nn.Module):
    """Mixture of Experts Layer (Simplified)"""

    def __init__(
        self,
        embed_dim: int,
        num_experts: int = 8,
        top_k: int = 2,
        expert_dim: int = None
    ):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        expert_dim = expert_dim or embed_dim * 4

        # Router: determines which experts to use
        self.router = nn.Linear(embed_dim, num_experts, bias=False)

        # Expert networks (simplified as MLPs)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(embed_dim, expert_dim),
                nn.GELU(),
                nn.Linear(expert_dim, embed_dim)
            )
            for _ in range(num_experts)
        ])

    def forward(self, x: torch.Tensor):
        """
        Args:
            x: Input tensor of shape (batch, seq_len, embed_dim)
        """
        batch_size, seq_len, embed_dim = x.shape

        # Compute router logits
        router_logits = self.router(x)  # (batch, seq_len, num_experts)

        # Get top-k experts for each token
        router_probs = F.softmax(router_logits, dim=-1)
        top_k_probs, top_k_indices = torch.topk(router_probs, self.top_k, dim=-1)

        # Normalize top-k probabilities
        top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        # Initialize output
        output = torch.zeros_like(x)

        # Process each expert (simplified - real implementations are more efficient)
        for expert_idx in range(self.num_experts):
            # Find tokens routed to this expert
            expert_mask = (top_k_indices == expert_idx).any(dim=-1)

            if expert_mask.any():
                # Get weight for this expert
                weight_mask = (top_k_indices == expert_idx)
                expert_weights = (top_k_probs * weight_mask.float()).sum(dim=-1)

                # Apply expert
                expert_input = x[expert_mask]
                expert_output = self.experts[expert_idx](expert_input)

                # Add weighted expert output
                output[expert_mask] += expert_weights[expert_mask].unsqueeze(-1) * expert_output

        return output

# Example: MoE with 8 experts, top-2 routing
moe = MoELayer(embed_dim=512, num_experts=8, top_k=2)
x = torch.randn(2, 100, 512)
output = moe(x)
print(f"MoE output shape: {output.shape}")
# MoE output shape: torch.Size([2, 100, 512])

DeepSeek's Innovation: Shared + Routed Experts

DeepSeek introduced a novel MoE variant with two types of experts:

DeepSeek MoE Results

DeepSeekMoE 16B achieves comparable performance to DeepSeek 7B with only 40.5% of computations, and outperforms LLaMA2 7B on most benchmarks with 39.6% of computations.

Model Total Params Active Params Architecture
Mixtral 8x7B 46.7B 12.9B 8 experts, top-2
DeepSeek-V2 236B 21B Shared + Routed
DeepSeek-V3 685B ~37B Fine-grained MoE
Llama 4 Scout Unknown ~17B 16 experts, top-1

2.6 Efficient Attention Mechanisms

The Quadratic Problem

Standard self-attention has O(n^2) complexity in sequence length, making it expensive for long contexts. With 1M+ token context windows becoming standard in 2025, efficient attention is crucial.

FlashAttention

FlashAttention (Dao et al., 2022) is an IO-aware algorithm that computes exact attention 2-4x faster with less memory by:

FlashAttention in Practice

FlashAttention is now the default in PyTorch 2.0+ via torch.nn.functional.scaled_dot_product_attention with is_causal=True.

# Using FlashAttention via PyTorch 2.0+
import torch.nn.functional as F

def efficient_attention(Q, K, V, is_causal=True):
    """Use PyTorch's optimized attention (uses FlashAttention when available)"""
    return F.scaled_dot_product_attention(
        Q, K, V,
        is_causal=is_causal,
        dropout_p=0.0,  # No dropout for inference
        scale=None  # Auto-compute scale
    )

# Example
Q = torch.randn(2, 8, 1024, 64)  # batch, heads, seq_len, head_dim
K = torch.randn(2, 8, 1024, 64)
V = torch.randn(2, 8, 1024, 64)

# This automatically uses FlashAttention on compatible hardware
output = efficient_attention(Q, K, V)
print(f"Efficient attention output: {output.shape}")
# Efficient attention output: torch.Size([2, 8, 1024, 64])

Other Efficient Attention Variants

Method Complexity Used By
Standard Attention O(n^2) Original Transformer
FlashAttention O(n^2) but IO-efficient LLaMA, Mistral, etc.
Sparse Attention O(n * sqrt(n)) GPT-3, Longformer
Linear Attention O(n) Mamba, RWKV
Ring Attention O(n^2 / devices) Distributed training

2.7 Chapter Summary

Key Takeaways

  1. Self-Attention computes relationships between all token pairs, enabling long-range dependencies
  2. Multi-Head Attention allows models to attend to multiple aspects simultaneously
  3. RoPE is the dominant position encoding for modern LLMs, enabling length extrapolation
  4. Decoder-Only architecture dominates due to simplicity and scaling properties
  5. MoE enables massive models with sparse activation, used by DeepSeek, Mixtral, Llama 4
  6. FlashAttention makes long-context models practical through IO-aware computation

Exercises

Exercise 1: Attention Visualization

Implement a function that visualizes attention weights as a heatmap. Use it to see how a simple self-attention layer attends to tokens in the sentence "The cat sat on the mat".

Solution
import matplotlib.pyplot as plt
import seaborn as sns

def visualize_attention(attn_weights, tokens):
    """Visualize attention weights as a heatmap"""
    plt.figure(figsize=(8, 6))
    sns.heatmap(
        attn_weights.detach().numpy(),
        xticklabels=tokens,
        yticklabels=tokens,
        cmap='Blues',
        annot=True,
        fmt='.2f'
    )
    plt.xlabel('Key Tokens')
    plt.ylabel('Query Tokens')
    plt.title('Self-Attention Weights')
    plt.tight_layout()
    plt.show()

# Example (would need actual model to get real weights)
tokens = ["The", "cat", "sat", "on", "the", "mat"]
# attn_weights would be from your attention layer
# visualize_attention(attn_weights[0, 0], tokens)  # First batch, first head

Exercise 2: RoPE vs Sinusoidal

Compare the positional embeddings from sinusoidal encoding and RoPE. Plot the first 8 dimensions for positions 0-100.

Exercise 3: MoE Load Balancing

Implement a load balancing loss for MoE that encourages even distribution of tokens across experts:

$$\mathcal{L}_{balance} = N \cdot \sum_{i=1}^{N} f_i \cdot P_i$$

where \(f_i\) is the fraction of tokens routed to expert \(i\) and \(P_i\) is the average routing probability for expert \(i\).

Exercise 4: Attention Complexity

Measure the time and memory usage of standard attention vs FlashAttention for sequence lengths [512, 1024, 2048, 4096]. Plot the results.

Next Chapter

In the next chapter, we'll explore LLM Training and Alignment - how models are pre-trained on massive datasets, and how techniques like RLHF, DPO, and Constitutional AI align them with human values. We'll also cover the revolutionary inference-time scaling that powers reasoning models like o1 and o3.

Disclaimer