Chapter 3: LLM Training and Alignment

From Pre-training to RLHF, DPO, and Inference-Time Scaling

Reading Time: 35-40 min Difficulty: Intermediate Code Examples: 7 Exercises: 4

Introduction

Training a Large Language Model involves two major phases: pre-training on massive text corpora, and alignment to make the model helpful, harmless, and honest. This chapter covers the evolution of alignment techniques from RLHF to DPO, the revolutionary inference-time scaling that powers reasoning models, and the scaling laws that guide LLM development.

What You'll Learn

3.1 Pre-training

The Pre-training Objective

Pre-training teaches the model to predict the next token given all previous tokens (autoregressive language modeling):

$$\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, ..., x_{t-1}; \theta)$$

graph LR A[Raw Text Data] --> B[Filtering & Cleaning] B --> C[Tokenization] C --> D[Pre-training] D --> E[Base Model] E --> F[Alignment] F --> G[Final Model] style D fill:#e3f2fd style F fill:#e8f5e9

Training Data

Modern LLMs are trained on trillions of tokens from diverse sources:

SourceExamplesPurpose
Web CrawlsCommon Crawl, C4General knowledge
BooksBookCorpus, GutenbergLong-form reasoning
CodeGitHub, Stack OverflowProgramming ability
ScientificarXiv, PubMedTechnical knowledge
ConversationsReddit, ForumsDialogue patterns

Data Quality Matters

The Densing Law (2025) shows that capability density (capability per parameter) doubles every ~3.5 months, primarily driven by:

3.2 RLHF (Reinforcement Learning from Human Feedback)

The Three-Stage Pipeline

RLHF, popularized by ChatGPT, aligns models with human preferences through three stages:

graph TD subgraph Stage1["Stage 1: SFT"] A[Base Model] --> B[Supervised Fine-tuning] B --> C[SFT Model] end subgraph Stage2["Stage 2: RM Training"] C --> D[Generate Responses] D --> E[Human Rankings] E --> F[Train Reward Model] end subgraph Stage3["Stage 3: PPO"] C --> G[Generate] F --> H[Score] G --> H H --> I[PPO Update] I --> J[Aligned Model] end style C fill:#fff3e0 style F fill:#e3f2fd style J fill:#e8f5e9

Stage 1: Supervised Fine-Tuning (SFT)

Fine-tune the base model on high-quality demonstrations:

# SFT Training Example
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b")

# SFT dataset format
sft_examples = [
    {"prompt": "Explain quantum computing", "response": "Quantum computing uses..."},
    {"prompt": "Write a poem about AI", "response": "In circuits deep..."},
]

# Training arguments
training_args = TrainingArguments(
    output_dir="./sft_model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    warmup_steps=100,
)

# Train
trainer = Trainer(model=model, args=training_args, train_dataset=sft_dataset)
trainer.train()

Stage 2: Reward Model Training

Train a model to predict human preferences:

import torch
import torch.nn as nn

class RewardModel(nn.Module):
    """Reward model for RLHF"""

    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model
        self.reward_head = nn.Linear(base_model.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        # Get hidden states from base model
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True
        )
        # Use last hidden state of last token
        last_hidden = outputs.hidden_states[-1][:, -1, :]
        reward = self.reward_head(last_hidden)
        return reward

def compute_preference_loss(reward_chosen, reward_rejected):
    """Bradley-Terry preference loss"""
    return -torch.log(torch.sigmoid(reward_chosen - reward_rejected)).mean()

Stage 3: PPO Training

Use Proximal Policy Optimization to maximize reward while staying close to the SFT model:

$$\mathcal{L}_{PPO} = \mathbb{E}\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right] - \beta \cdot D_{KL}(\pi_\theta || \pi_{SFT})$$

Challenges with RLHF

3.3 DPO (Direct Preference Optimization)

2024-2025 Breakthrough

DPO simplifies alignment by eliminating the reward model and RL loop entirely, achieving 50% compute reduction while maintaining 95% alignment effectiveness.

The Key Insight

DPO shows that the optimal policy under the RLHF objective has a closed-form solution. Instead of training a reward model, we can directly optimize:

$$\mathcal{L}_{DPO}(\pi_\theta; \pi_{ref}) = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]$$

where \(y_w\) is the preferred response and \(y_l\) is the rejected response.

graph LR subgraph "RLHF (Traditional)" A1[SFT Model] --> B1[Reward Model] B1 --> C1[PPO Training] C1 --> D1[Aligned Model] end subgraph "DPO (Simplified)" A2[SFT Model] --> B2[Direct Preference Optimization] B2 --> D2[Aligned Model] end style C1 fill:#ffcdd2 style B2 fill:#c8e6c9

DPO Implementation

import torch
import torch.nn.functional as F

def dpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    reference_chosen_logps: torch.Tensor,
    reference_rejected_logps: torch.Tensor,
    beta: float = 0.1
) -> torch.Tensor:
    """
    Direct Preference Optimization loss

    Args:
        policy_chosen_logps: Log probs of chosen responses under policy
        policy_rejected_logps: Log probs of rejected responses under policy
        reference_chosen_logps: Log probs of chosen responses under reference
        reference_rejected_logps: Log probs of rejected responses under reference
        beta: Temperature parameter (higher = more conservative updates)

    Returns:
        DPO loss value
    """
    # Compute log ratios
    chosen_logratios = policy_chosen_logps - reference_chosen_logps
    rejected_logratios = policy_rejected_logps - reference_rejected_logps

    # DPO loss: -log(sigmoid(beta * (chosen_logratio - rejected_logratio)))
    logits = beta * (chosen_logratios - rejected_logratios)
    loss = -F.logsigmoid(logits).mean()

    # Compute metrics for monitoring
    chosen_rewards = beta * chosen_logratios.detach()
    rejected_rewards = beta * rejected_logratios.detach()
    reward_margin = (chosen_rewards - rejected_rewards).mean()

    return loss, {
        'reward_margin': reward_margin.item(),
        'chosen_rewards': chosen_rewards.mean().item(),
        'rejected_rewards': rejected_rewards.mean().item()
    }

# Example usage
batch_size = 4
seq_len = 100

# Simulated log probabilities
policy_chosen = torch.randn(batch_size) * 0.1 - 50
policy_rejected = torch.randn(batch_size) * 0.1 - 55
ref_chosen = torch.randn(batch_size) * 0.1 - 52
ref_rejected = torch.randn(batch_size) * 0.1 - 54

loss, metrics = dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected)
print(f"DPO Loss: {loss.item():.4f}")
print(f"Reward Margin: {metrics['reward_margin']:.4f}")

GRPO (Group Relative Policy Optimization)

GRPO extends DPO by considering groups of responses rather than pairs, further improving training efficiency and stability.

3.4 Constitutional AI

Self-Critique Based Alignment

Constitutional AI (Anthropic, 2022) replaces human preference labeling with AI self-critique based on a set of principles (a "constitution").

graph TD A[Prompt] --> B[Initial Response] B --> C{Self-Critique} C --> |"Violates principles"| D[Revision] D --> C C --> |"Passes"| E[Final Response] F[Constitution] --> C style F fill:#e3f2fd style E fill:#e8f5e9

Example Constitution Principles

CONSTITUTION = [
    "Please choose the response that is most helpful, harmless, and honest.",
    "Please choose the response that does not encourage illegal activities.",
    "Please choose the response that is not racist, sexist, or discriminatory.",
    "Please choose the response that does not contain personal attacks.",
    "Please choose the response that provides accurate information.",
]

def constitutional_critique(response: str, principles: list) -> dict:
    """
    Self-critique a response against constitutional principles

    In practice, this would use an LLM to evaluate
    """
    critique_prompt = f"""
Response to evaluate:
{response}

Evaluate this response against the following principles:
{chr(10).join(f'{i+1}. {p}' for i, p in enumerate(principles))}

For each principle, indicate if the response:
- PASSES: Adheres to the principle
- FAILS: Violates the principle (explain how)
- UNCERTAIN: Cannot determine

Provide a revised response if any principles are violated.
"""
    # In practice: call LLM with critique_prompt
    return {"critique": critique_prompt, "needs_revision": False}

Claude's Approach

Claude 3.5 Sonnet layers Constitutional AI penalties on top of RLHF to refuse disallowed requests without losing helpfulness. This hybrid approach combines the best of both methods.

3.5 Inference-Time Scaling

2025 Revolution: Test-Time Compute

OpenAI's o1 and o3 models opened a new frontier: instead of just scaling training compute, we can scale inference-time compute to improve reasoning quality.

The Core Innovation

Traditional scaling law: More training compute = Better model

New insight: More inference compute = Better decisions

graph TB subgraph "Traditional LLM" A1[Input] --> B1[Single Forward Pass] B1 --> C1[Output] end subgraph "Reasoning Model (o1/o3)" A2[Input] --> B2[Chain of Thought] B2 --> C2[Self-Reflection] C2 --> D2{Good enough?} D2 --> |No| B2 D2 --> |Yes| E2[Output] end style B2 fill:#e3f2fd style C2 fill:#fff3e0

How o1 Works

o1 uses reinforcement learning to train the model to:

  1. Generate a long chain of thought before answering
  2. Recognize and correct its mistakes
  3. Break down complex problems into simpler steps
  4. Try different approaches when stuck

Thinking Levels

ModeComputeUse Case
o3-mini Low~1xSimple questions, fast responses
o3-mini Medium~3xModerate reasoning tasks
o3-mini High~10xComplex math, coding, analysis
o3 Full~100xResearch-level problems

TOPS: Thinking-Optimal Scaling

Research shows that optimal thinking time varies by problem difficulty. TOPS allows LLMs to decide how many tokens to generate for each problem:

def adaptive_reasoning(model, prompt: str, max_thinking_tokens: int = 10000):
    """
    Adaptive reasoning that adjusts thinking depth based on confidence

    This is a simplified illustration of the concept
    """
    thinking = ""
    confidence = 0.0

    while len(thinking.split()) < max_thinking_tokens and confidence < 0.95:
        # Generate next reasoning step
        next_step = model.generate(
            prompt + thinking,
            max_new_tokens=100,
            stop_sequences=["Therefore:", "Answer:"]
        )
        thinking += next_step

        # Estimate confidence (in practice, more sophisticated)
        if "I'm confident" in next_step or "The answer is" in next_step:
            confidence = 0.95
        elif "Let me reconsider" in next_step:
            confidence *= 0.8  # Uncertainty detected

    # Generate final answer
    final_answer = model.generate(
        prompt + thinking + "\nFinal Answer:",
        max_new_tokens=500
    )

    return {
        "thinking": thinking,
        "answer": final_answer,
        "thinking_tokens": len(thinking.split())
    }

3.6 Scaling Laws

Chinchilla Scaling Law

The Chinchilla paper (2022) showed that model size N and data size D should be scaled equally:

$$L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$$

where \(\alpha \approx 0.34\) and \(\beta \approx 0.28\).

Densing Law (2025)

The Densing Law observes that capability density (capability per parameter) doubles approximately every 3.5 months:

Implications

YearModelParamsApprox. GPT-4 Equivalent
2023LLaMA 2 70B70B~50%
2024LLaMA 3 70B70B~80%
2024Mistral Small 324B~75%
2025DeepSeek-R1671B (37B active)~95%
2025Qwen3-235B235B (22B active)~100%

3.7 Chapter Summary

Key Takeaways

  1. Pre-training teaches general knowledge; alignment makes models helpful and safe
  2. RLHF pioneered alignment but is complex (3 stages, PPO instability)
  3. DPO simplifies alignment to direct optimization on preferences (50% compute savings)
  4. Constitutional AI uses self-critique against principles for scalable alignment
  5. Inference-Time Scaling (o1/o3) trades compute at inference for better reasoning
  6. Scaling Laws guide efficient training; Densing Law shows capability density doubling every 3.5 months

Exercises

Exercise 1: DPO Implementation

Implement a complete DPO training loop using the loss function provided. Train on a small preference dataset and compare with a baseline SFT model.

Exercise 2: Constitutional Principles

Design a constitution for a customer service chatbot. Include at least 10 principles covering helpfulness, safety, and brand voice.

Exercise 3: Thinking Depth Analysis

Given problems of varying difficulty, analyze how much "thinking" (intermediate reasoning) improves accuracy. Plot accuracy vs. thinking tokens.

Exercise 4: Scaling Prediction

Using the Chinchilla scaling law, calculate the optimal model size for a fixed compute budget of \(10^{23}\) FLOPs.

Next Chapter

In the next chapter, we'll explore LLM Inference and Optimization - how to efficiently deploy models using quantization, vLLM, and TensorRT-LLM, and how to handle million-token context windows.

Disclaimer