Chapter 2: Transformer Architecture - LLM Basics Introduction

Introduction

The Transformer architecture is the foundation of all modern Large Language Models (LLMs). In this chapter, we'll dive deep into the core mechanisms that make LLMs work, including the revolutionary Self-Attention mechanism, various position encoding schemes, and the cutting-edge Mixture of Experts (MoE) architecture that powers models like DeepSeek and Mixtral.

What You'll Learn

How Self-Attention computes relationships between all tokens
Multi-Head Attention and why it matters
Position encoding: from sinusoidal to RoPE and ALiBi
Decoder-Only vs Encoder-Decoder architectures
NEW: Mixture of Experts (MoE) - the secret behind efficient giant models
NEW: FlashAttention and efficient attention mechanisms

2.1 Self-Attention Mechanism

The Core Idea

Self-Attention (also called Scaled Dot-Product Attention) allows each token in a sequence to attend to all other tokens, capturing dependencies regardless of their distance in the sequence.

The key insight is: instead of processing tokens sequentially like RNNs, Self-Attention computes relationships between all pairs of tokens in parallel.

Mathematical Formulation

Given an input sequence, Self-Attention computes three matrices:

Query (Q): "What am I looking for?"
Key (K): "What information do I contain?"
Value (V): "What information do I provide?"

The attention output is computed as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where $d_k$ is the dimension of the key vectors. The $\sqrt{d_k}$ scaling prevents the dot products from becoming too large.

graph LR A[Input X] --> B[Linear: Q] A --> C[Linear: K] A --> D[Linear: V] B --> E[MatMul Q*K^T] C --> E E --> F[Scale by sqrt dk] F --> G[Softmax] G --> H[MatMul with V] D --> H H --> I[Output] style A fill:#e3f2fd style I fill:#e8f5e9

Implementation in Python

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    """Scaled Dot-Product Self-Attention"""

    def __init__(self, embed_dim: int):
        super().__init__()
        self.embed_dim = embed_dim

        # Linear projections for Q, K, V
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)

        # Output projection
        self.out_proj = nn.Linear(embed_dim, embed_dim)

        # Scaling factor
        self.scale = math.sqrt(embed_dim)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        """
        Args:
            x: Input tensor of shape (batch, seq_len, embed_dim)
            mask: Optional attention mask
        Returns:
            Output tensor of shape (batch, seq_len, embed_dim)
        """
        # Compute Q, K, V
        Q = self.q_proj(x)  # (batch, seq_len, embed_dim)
        K = self.k_proj(x)
        V = self.v_proj(x)

        # Compute attention scores
        # (batch, seq_len, embed_dim) @ (batch, embed_dim, seq_len)
        # = (batch, seq_len, seq_len)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale

        # Apply mask if provided (for causal/autoregressive models)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        # Softmax to get attention weights
        attn_weights = F.softmax(scores, dim=-1)

        # Apply attention to values
        output = torch.matmul(attn_weights, V)

        # Final projection
        return self.out_proj(output)

# Example usage
batch_size, seq_len, embed_dim = 2, 10, 64
x = torch.randn(batch_size, seq_len, embed_dim)

attention = SelfAttention(embed_dim)
output = attention(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
# Input shape: torch.Size([2, 10, 64])
# Output shape: torch.Size([2, 10, 64])

2.2 Multi-Head Attention

Why Multiple Heads?

A single attention head can only focus on one type of relationship at a time. Multi-Head Attention allows the model to jointly attend to information from different representation subspaces.

Intuition

Think of it like having multiple "experts" looking at the same text:

Head 1 might focus on syntactic relationships (subject-verb)
Head 2 might capture semantic similarity
Head 3 might track coreference (pronouns to nouns)

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

class MultiHeadAttention(nn.Module):
    """Multi-Head Self-Attention as used in Transformer"""

    def __init__(self, embed_dim: int, num_heads: int, dropout: float = 0.1):
        super().__init__()
        assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"

        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        # Combined projection for Q, K, V (more efficient)
        self.qkv_proj = nn.Linear(embed_dim, 3 * embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)

        self.dropout = nn.Dropout(dropout)
        self.scale = math.sqrt(self.head_dim)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        batch_size, seq_len, _ = x.shape

        # Project and reshape for multi-head
        qkv = self.qkv_proj(x)  # (batch, seq_len, 3 * embed_dim)
        qkv = qkv.reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, batch, heads, seq_len, head_dim)
        Q, K, V = qkv[0], qkv[1], qkv[2]

        # Compute attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Apply attention to values
        output = torch.matmul(attn_weights, V)

        # Reshape and project
        output = output.transpose(1, 2).reshape(batch_size, seq_len, self.embed_dim)
        return self.out_proj(output)

# Example: 8-head attention
mha = MultiHeadAttention(embed_dim=512, num_heads=8)
x = torch.randn(2, 100, 512)  # batch=2, seq_len=100
output = mha(x)
print(f"Multi-head output shape: {output.shape}")
# Multi-head output shape: torch.Size([2, 100, 512])

2.3 Position Encoding

Self-Attention is permutation-invariant - it doesn't inherently know the order of tokens. Position encodings inject positional information into the model.

Sinusoidal Position Encoding (Original Transformer)

The original Transformer used sinusoidal functions:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

RoPE (Rotary Position Embedding)

RoPE is the position encoding used by most modern LLMs including LLaMA, Mistral, and Qwen. It encodes position by rotating the query and key vectors.

Why RoPE is Superior

Relative position: Naturally captures relative distances
Length extrapolation: Can generalize to longer sequences than training
Efficient: Applied only to Q and K, not V

def apply_rope(x: torch.Tensor, freqs_cos: torch.Tensor, freqs_sin: torch.Tensor):
    """Apply Rotary Position Embedding (RoPE)

    Args:
        x: Query or Key tensor of shape (batch, heads, seq_len, head_dim)
        freqs_cos, freqs_sin: Precomputed cosine and sine frequencies
    """
    # Split x into pairs for rotation
    x1 = x[..., ::2]   # Even indices
    x2 = x[..., 1::2]  # Odd indices

    # Apply rotation
    # [x1, x2] * [cos, -sin; sin, cos] = [x1*cos - x2*sin, x1*sin + x2*cos]
    rotated = torch.stack([
        x1 * freqs_cos - x2 * freqs_sin,
        x1 * freqs_sin + x2 * freqs_cos
    ], dim=-1).flatten(-2)

    return rotated

def precompute_rope_freqs(dim: int, max_seq_len: int, base: float = 10000.0):
    """Precompute RoPE frequencies"""
    # Compute inverse frequencies
    inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))

    # Compute position indices
    positions = torch.arange(max_seq_len).float()

    # Outer product to get (seq_len, dim/2) frequency matrix
    freqs = torch.outer(positions, inv_freq)

    return torch.cos(freqs), torch.sin(freqs)

# Example
head_dim = 64
max_len = 2048
freqs_cos, freqs_sin = precompute_rope_freqs(head_dim, max_len)
print(f"RoPE frequency shapes: cos={freqs_cos.shape}, sin={freqs_sin.shape}")
# RoPE frequency shapes: cos=torch.Size([2048, 32]), sin=torch.Size([2048, 32])

ALiBi (Attention with Linear Biases)

ALiBi adds a linear bias to attention scores based on distance, without modifying embeddings:

$$\text{softmax}(q_i \cdot k_j - m \cdot |i - j|)$$

where $m$ is a head-specific slope. ALiBi is used by models like BLOOM and MPT.

2.4 Decoder-Only vs Encoder-Decoder

Architecture Comparison

Architecture	Models	Use Case	Attention Type
Encoder-Only	BERT, RoBERTa	Understanding, Classification	Bidirectional
Decoder-Only	GPT, LLaMA, Claude	Generation	Causal (unidirectional)
Encoder-Decoder	T5, BART	Translation, Summarization	Mixed

graph TB subgraph "Encoder-Only (BERT)" E1[Token 1] <--> E2[Token 2] E2 <--> E3[Token 3] E1 <--> E3 end subgraph "Decoder-Only (GPT)" D1[Token 1] --> D2[Token 2] D1 --> D3[Token 3] D2 --> D3 end subgraph "Encoder-Decoder (T5)" ED1[Encoder] --> ED2[Cross-Attention] ED3[Decoder] --> ED2 end style E1 fill:#e3f2fd style D1 fill:#fff3e0 style ED1 fill:#e8f5e9

Why Decoder-Only Dominates (2024-2025)

Most modern LLMs use Decoder-Only architecture because:

Unified training: Same objective (next token prediction) for all tasks
Simplicity: Fewer components, easier to scale
Emergence: Scaling decoder-only models shows remarkable emergent abilities
In-context learning: Natural fit for few-shot prompting

2.5 Mixture of Experts (MoE)

2025-2026 Key Development

Mixture of Experts has become the dominant architecture for frontier models. DeepSeek-V3, Mixtral, and Llama 4 all use MoE to achieve better performance with less compute.

What is MoE?

Mixture of Experts (MoE) replaces the dense Feed-Forward Network (FFN) layer with multiple "expert" networks. A router network selects which experts to activate for each token.

graph TD A[Input Token] --> B[Router Network] B --> |Top-K Selection| C[Expert 1] B --> |Top-K Selection| D[Expert 2] B -.-> E[Expert 3] B -.-> F[Expert N] C --> G[Weighted Sum] D --> G G --> H[Output] style B fill:#fff3e0 style C fill:#e8f5e9 style D fill:#e8f5e9 style E fill:#f5f5f5 style F fill:#f5f5f5

Key Benefits

Compute Efficiency: Only 10-40% of parameters are active per token
Parameter Scaling: Can have trillions of total parameters
Specialization: Different experts learn different skills

MoE Implementation

class MoELayer(nn.Module):
    """Mixture of Experts Layer (Simplified)"""

    def __init__(
        self,
        embed_dim: int,
        num_experts: int = 8,
        top_k: int = 2,
        expert_dim: int = None
    ):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        expert_dim = expert_dim or embed_dim * 4

        # Router: determines which experts to use
        self.router = nn.Linear(embed_dim, num_experts, bias=False)

        # Expert networks (simplified as MLPs)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(embed_dim, expert_dim),
                nn.GELU(),
                nn.Linear(expert_dim, embed_dim)
            )
            for _ in range(num_experts)
        ])

    def forward(self, x: torch.Tensor):
        """
        Args:
            x: Input tensor of shape (batch, seq_len, embed_dim)
        """
        batch_size, seq_len, embed_dim = x.shape

        # Compute router logits
        router_logits = self.router(x)  # (batch, seq_len, num_experts)

        # Get top-k experts for each token
        router_probs = F.softmax(router_logits, dim=-1)
        top_k_probs, top_k_indices = torch.topk(router_probs, self.top_k, dim=-1)

        # Normalize top-k probabilities
        top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        # Initialize output
        output = torch.zeros_like(x)

        # Process each expert (simplified - real implementations are more efficient)
        for expert_idx in range(self.num_experts):
            # Find tokens routed to this expert
            expert_mask = (top_k_indices == expert_idx).any(dim=-1)

            if expert_mask.any():
                # Get weight for this expert
                weight_mask = (top_k_indices == expert_idx)
                expert_weights = (top_k_probs * weight_mask.float()).sum(dim=-1)

                # Apply expert
                expert_input = x[expert_mask]
                expert_output = self.experts[expert_idx](expert_input)

                # Add weighted expert output
                output[expert_mask] += expert_weights[expert_mask].unsqueeze(-1) * expert_output

        return output

# Example: MoE with 8 experts, top-2 routing
moe = MoELayer(embed_dim=512, num_experts=8, top_k=2)
x = torch.randn(2, 100, 512)
output = moe(x)
print(f"MoE output shape: {output.shape}")
# MoE output shape: torch.Size([2, 100, 512])

DeepSeek's Innovation: Shared + Routed Experts

DeepSeek introduced a novel MoE variant with two types of experts:

Shared Experts: Always activated, learn common knowledge (grammar, etc.)
Routed Experts: Dynamically selected, learn specialized knowledge

DeepSeek MoE Results

DeepSeekMoE 16B achieves comparable performance to DeepSeek 7B with only 40.5% of computations, and outperforms LLaMA2 7B on most benchmarks with 39.6% of computations.

Model	Total Params	Active Params	Architecture
Mixtral 8x7B	46.7B	12.9B	8 experts, top-2
DeepSeek-V2	236B	21B	Shared + Routed
DeepSeek-V3	685B	~37B	Fine-grained MoE
Llama 4 Scout	Unknown	~17B	16 experts, top-1

2.6 Efficient Attention Mechanisms

The Quadratic Problem

Standard self-attention has O(n^2) complexity in sequence length, making it expensive for long contexts. With 1M+ token context windows becoming standard in 2025, efficient attention is crucial.

FlashAttention

FlashAttention (Dao et al., 2022) is an IO-aware algorithm that computes exact attention 2-4x faster with less memory by:

Tiling: Processing attention in blocks that fit in fast SRAM
Recomputation: Trading compute for memory by not storing attention matrices
Kernel fusion: Combining operations to reduce memory bandwidth

FlashAttention in Practice

FlashAttention is now the default in PyTorch 2.0+ via torch.nn.functional.scaled_dot_product_attention with is_causal=True.

# Using FlashAttention via PyTorch 2.0+
import torch.nn.functional as F

def efficient_attention(Q, K, V, is_causal=True):
    """Use PyTorch's optimized attention (uses FlashAttention when available)"""
    return F.scaled_dot_product_attention(
        Q, K, V,
        is_causal=is_causal,
        dropout_p=0.0,  # No dropout for inference
        scale=None  # Auto-compute scale
    )

# Example
Q = torch.randn(2, 8, 1024, 64)  # batch, heads, seq_len, head_dim
K = torch.randn(2, 8, 1024, 64)
V = torch.randn(2, 8, 1024, 64)

# This automatically uses FlashAttention on compatible hardware
output = efficient_attention(Q, K, V)
print(f"Efficient attention output: {output.shape}")
# Efficient attention output: torch.Size([2, 8, 1024, 64])

Other Efficient Attention Variants

Method	Complexity	Used By
Standard Attention	O(n^2)	Original Transformer
FlashAttention	O(n^2) but IO-efficient	LLaMA, Mistral, etc.
Sparse Attention	O(n * sqrt(n))	GPT-3, Longformer
Linear Attention	O(n)	Mamba, RWKV
Ring Attention	O(n^2 / devices)	Distributed training

2.7 Chapter Summary

Key Takeaways

Self-Attention computes relationships between all token pairs, enabling long-range dependencies
Multi-Head Attention allows models to attend to multiple aspects simultaneously
RoPE is the dominant position encoding for modern LLMs, enabling length extrapolation
Decoder-Only architecture dominates due to simplicity and scaling properties
MoE enables massive models with sparse activation, used by DeepSeek, Mixtral, Llama 4
FlashAttention makes long-context models practical through IO-aware computation

Exercises

Exercise 1: Attention Visualization

Implement a function that visualizes attention weights as a heatmap. Use it to see how a simple self-attention layer attends to tokens in the sentence "The cat sat on the mat".

Solution

import matplotlib.pyplot as plt
import seaborn as sns

def visualize_attention(attn_weights, tokens):
    """Visualize attention weights as a heatmap"""
    plt.figure(figsize=(8, 6))
    sns.heatmap(
        attn_weights.detach().numpy(),
        xticklabels=tokens,
        yticklabels=tokens,
        cmap='Blues',
        annot=True,
        fmt='.2f'
    )
    plt.xlabel('Key Tokens')
    plt.ylabel('Query Tokens')
    plt.title('Self-Attention Weights')
    plt.tight_layout()
    plt.show()

# Example (would need actual model to get real weights)
tokens = ["The", "cat", "sat", "on", "the", "mat"]
# attn_weights would be from your attention layer
# visualize_attention(attn_weights[0, 0], tokens)  # First batch, first head

Exercise 2: RoPE vs Sinusoidal

Compare the positional embeddings from sinusoidal encoding and RoPE. Plot the first 8 dimensions for positions 0-100.

Exercise 3: MoE Load Balancing

Implement a load balancing loss for MoE that encourages even distribution of tokens across experts:

$$\mathcal{L}_{balance} = N \cdot \sum_{i=1}^{N} f_i \cdot P_i$$

where $f_i$ is the fraction of tokens routed to expert $i$ and $P_i$ is the average routing probability for expert $i$.

Exercise 4: Attention Complexity

Measure the time and memory usage of standard attention vs FlashAttention for sequence lengths [512, 1024, 2048, 4096]. Plot the results.

Next Chapter

In the next chapter, we'll explore LLM Training and Alignment - how models are pre-trained on massive datasets, and how techniques like RLHF, DPO, and Constitutional AI align them with human values. We'll also cover the revolutionary inference-time scaling that powers reasoning models like o1 and o3.

Previous: Chapter 1 Next: Chapter 3