Chapter 3: LLM Training and Alignment - LLM Basics Introduction

Introduction

Training a Large Language Model involves two major phases: pre-training on massive text corpora, and alignment to make the model helpful, harmless, and honest. This chapter covers the evolution of alignment techniques from RLHF to DPO, the revolutionary inference-time scaling that powers reasoning models, and the scaling laws that guide LLM development.

What You'll Learn

Pre-training objectives and data curation
RLHF (Reinforcement Learning from Human Feedback)
NEW: DPO and GRPO - simplified alignment without reward models
NEW: Constitutional AI - self-critique based alignment
NEW: Inference-Time Scaling and reasoning models (o1, o3)
Scaling Laws and the Densing Law

3.1 Pre-training

The Pre-training Objective

Pre-training teaches the model to predict the next token given all previous tokens (autoregressive language modeling):

$$\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, ..., x_{t-1}; \theta)$$

graph LR A[Raw Text Data] --> B[Filtering & Cleaning] B --> C[Tokenization] C --> D[Pre-training] D --> E[Base Model] E --> F[Alignment] F --> G[Final Model] style D fill:#e3f2fd style F fill:#e8f5e9

Training Data

Modern LLMs are trained on trillions of tokens from diverse sources:

Source	Examples	Purpose
Web Crawls	Common Crawl, C4	General knowledge
Books	BookCorpus, Gutenberg	Long-form reasoning
Code	GitHub, Stack Overflow	Programming ability
Scientific	arXiv, PubMed	Technical knowledge
Conversations	Reddit, Forums	Dialogue patterns

Data Quality Matters

The Densing Law (2025) shows that capability density (capability per parameter) doubles every ~3.5 months, primarily driven by:

Data quality: Better filtering, deduplication, and curation
Data diversity: Balanced mixture of domains
Synthetic data: High-quality generated training examples

3.2 RLHF (Reinforcement Learning from Human Feedback)

The Three-Stage Pipeline

RLHF, popularized by ChatGPT, aligns models with human preferences through three stages:

graph TD subgraph Stage1["Stage 1: SFT"] A[Base Model] --> B[Supervised Fine-tuning] B --> C[SFT Model] end subgraph Stage2["Stage 2: RM Training"] C --> D[Generate Responses] D --> E[Human Rankings] E --> F[Train Reward Model] end subgraph Stage3["Stage 3: PPO"] C --> G[Generate] F --> H[Score] G --> H H --> I[PPO Update] I --> J[Aligned Model] end style C fill:#fff3e0 style F fill:#e3f2fd style J fill:#e8f5e9

Stage 1: Supervised Fine-Tuning (SFT)

Fine-tune the base model on high-quality demonstrations:

# SFT Training Example
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b")

# SFT dataset format
sft_examples = [
    {"prompt": "Explain quantum computing", "response": "Quantum computing uses..."},
    {"prompt": "Write a poem about AI", "response": "In circuits deep..."},
]

# Training arguments
training_args = TrainingArguments(
    output_dir="./sft_model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    warmup_steps=100,
)

# Train
trainer = Trainer(model=model, args=training_args, train_dataset=sft_dataset)
trainer.train()

Stage 2: Reward Model Training

Train a model to predict human preferences:

import torch
import torch.nn as nn

class RewardModel(nn.Module):
    """Reward model for RLHF"""

    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model
        self.reward_head = nn.Linear(base_model.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        # Get hidden states from base model
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True
        )
        # Use last hidden state of last token
        last_hidden = outputs.hidden_states[-1][:, -1, :]
        reward = self.reward_head(last_hidden)
        return reward

def compute_preference_loss(reward_chosen, reward_rejected):
    """Bradley-Terry preference loss"""
    return -torch.log(torch.sigmoid(reward_chosen - reward_rejected)).mean()

Stage 3: PPO Training

Use Proximal Policy Optimization to maximize reward while staying close to the SFT model:

$$\mathcal{L}_{PPO} = \mathbb{E}\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right] - \beta \cdot D_{KL}(\pi_\theta || \pi_{SFT})$$

Challenges with RLHF

Complexity: Requires training 3 separate models
Instability: PPO is notoriously difficult to tune
Cost: Expensive human annotation for preferences
Reward hacking: Model may exploit reward model weaknesses

3.3 DPO (Direct Preference Optimization)

2024-2025 Breakthrough

DPO simplifies alignment by eliminating the reward model and RL loop entirely, achieving 50% compute reduction while maintaining 95% alignment effectiveness.

The Key Insight

DPO shows that the optimal policy under the RLHF objective has a closed-form solution. Instead of training a reward model, we can directly optimize:

$$\mathcal{L}_{DPO}(\pi_\theta; \pi_{ref}) = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]$$

where $y_w$ is the preferred response and $y_l$ is the rejected response.

graph LR subgraph "RLHF (Traditional)" A1[SFT Model] --> B1[Reward Model] B1 --> C1[PPO Training] C1 --> D1[Aligned Model] end subgraph "DPO (Simplified)" A2[SFT Model] --> B2[Direct Preference Optimization] B2 --> D2[Aligned Model] end style C1 fill:#ffcdd2 style B2 fill:#c8e6c9

DPO Implementation

import torch
import torch.nn.functional as F

def dpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    reference_chosen_logps: torch.Tensor,
    reference_rejected_logps: torch.Tensor,
    beta: float = 0.1
) -> torch.Tensor:
    """
    Direct Preference Optimization loss

    Args:
        policy_chosen_logps: Log probs of chosen responses under policy
        policy_rejected_logps: Log probs of rejected responses under policy
        reference_chosen_logps: Log probs of chosen responses under reference
        reference_rejected_logps: Log probs of rejected responses under reference
        beta: Temperature parameter (higher = more conservative updates)

    Returns:
        DPO loss value
    """
    # Compute log ratios
    chosen_logratios = policy_chosen_logps - reference_chosen_logps
    rejected_logratios = policy_rejected_logps - reference_rejected_logps

    # DPO loss: -log(sigmoid(beta * (chosen_logratio - rejected_logratio)))
    logits = beta * (chosen_logratios - rejected_logratios)
    loss = -F.logsigmoid(logits).mean()

    # Compute metrics for monitoring
    chosen_rewards = beta * chosen_logratios.detach()
    rejected_rewards = beta * rejected_logratios.detach()
    reward_margin = (chosen_rewards - rejected_rewards).mean()

    return loss, {
        'reward_margin': reward_margin.item(),
        'chosen_rewards': chosen_rewards.mean().item(),
        'rejected_rewards': rejected_rewards.mean().item()
    }

# Example usage
batch_size = 4
seq_len = 100

# Simulated log probabilities
policy_chosen = torch.randn(batch_size) * 0.1 - 50
policy_rejected = torch.randn(batch_size) * 0.1 - 55
ref_chosen = torch.randn(batch_size) * 0.1 - 52
ref_rejected = torch.randn(batch_size) * 0.1 - 54

loss, metrics = dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected)
print(f"DPO Loss: {loss.item():.4f}")
print(f"Reward Margin: {metrics['reward_margin']:.4f}")

GRPO (Group Relative Policy Optimization)

GRPO extends DPO by considering groups of responses rather than pairs, further improving training efficiency and stability.

3.4 Constitutional AI

Self-Critique Based Alignment

Constitutional AI (Anthropic, 2022) replaces human preference labeling with AI self-critique based on a set of principles (a "constitution").

graph TD A[Prompt] --> B[Initial Response] B --> C{Self-Critique} C --> |"Violates principles"| D[Revision] D --> C C --> |"Passes"| E[Final Response] F[Constitution] --> C style F fill:#e3f2fd style E fill:#e8f5e9

Example Constitution Principles

CONSTITUTION = [
    "Please choose the response that is most helpful, harmless, and honest.",
    "Please choose the response that does not encourage illegal activities.",
    "Please choose the response that is not racist, sexist, or discriminatory.",
    "Please choose the response that does not contain personal attacks.",
    "Please choose the response that provides accurate information.",
]

def constitutional_critique(response: str, principles: list) -> dict:
    """
    Self-critique a response against constitutional principles

    In practice, this would use an LLM to evaluate
    """
    critique_prompt = f"""
Response to evaluate:
{response}

Evaluate this response against the following principles:
{chr(10).join(f'{i+1}. {p}' for i, p in enumerate(principles))}

For each principle, indicate if the response:
- PASSES: Adheres to the principle
- FAILS: Violates the principle (explain how)
- UNCERTAIN: Cannot determine

Provide a revised response if any principles are violated.
"""
    # In practice: call LLM with critique_prompt
    return {"critique": critique_prompt, "needs_revision": False}

Claude's Approach

Claude 3.5 Sonnet layers Constitutional AI penalties on top of RLHF to refuse disallowed requests without losing helpfulness. This hybrid approach combines the best of both methods.

3.5 Inference-Time Scaling

2025 Revolution: Test-Time Compute

OpenAI's o1 and o3 models opened a new frontier: instead of just scaling training compute, we can scale inference-time compute to improve reasoning quality.

The Core Innovation

Traditional scaling law: More training compute = Better model

New insight: More inference compute = Better decisions

graph TB subgraph "Traditional LLM" A1[Input] --> B1[Single Forward Pass] B1 --> C1[Output] end subgraph "Reasoning Model (o1/o3)" A2[Input] --> B2[Chain of Thought] B2 --> C2[Self-Reflection] C2 --> D2{Good enough?} D2 --> |No| B2 D2 --> |Yes| E2[Output] end style B2 fill:#e3f2fd style C2 fill:#fff3e0

How o1 Works

o1 uses reinforcement learning to train the model to:

Generate a long chain of thought before answering
Recognize and correct its mistakes
Break down complex problems into simpler steps
Try different approaches when stuck

Thinking Levels

Mode	Compute	Use Case
o3-mini Low	~1x	Simple questions, fast responses
o3-mini Medium	~3x	Moderate reasoning tasks
o3-mini High	~10x	Complex math, coding, analysis
o3 Full	~100x	Research-level problems

TOPS: Thinking-Optimal Scaling

Research shows that optimal thinking time varies by problem difficulty. TOPS allows LLMs to decide how many tokens to generate for each problem:

Easy problems: Short reasoning is sufficient
Hard problems: Extended reasoning improves accuracy
Over-thinking: Can actually hurt performance on simple tasks

def adaptive_reasoning(model, prompt: str, max_thinking_tokens: int = 10000):
    """
    Adaptive reasoning that adjusts thinking depth based on confidence

    This is a simplified illustration of the concept
    """
    thinking = ""
    confidence = 0.0

    while len(thinking.split()) < max_thinking_tokens and confidence < 0.95:
        # Generate next reasoning step
        next_step = model.generate(
            prompt + thinking,
            max_new_tokens=100,
            stop_sequences=["Therefore:", "Answer:"]
        )
        thinking += next_step

        # Estimate confidence (in practice, more sophisticated)
        if "I'm confident" in next_step or "The answer is" in next_step:
            confidence = 0.95
        elif "Let me reconsider" in next_step:
            confidence *= 0.8  # Uncertainty detected

    # Generate final answer
    final_answer = model.generate(
        prompt + thinking + "\nFinal Answer:",
        max_new_tokens=500
    )

    return {
        "thinking": thinking,
        "answer": final_answer,
        "thinking_tokens": len(thinking.split())
    }

3.6 Scaling Laws

Chinchilla Scaling Law

The Chinchilla paper (2022) showed that model size N and data size D should be scaled equally:

$$L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$$

where $\alpha \approx 0.34$ and $\beta \approx 0.28$.

Densing Law (2025)

The Densing Law observes that capability density (capability per parameter) doubles approximately every 3.5 months:

Implications

Today's 7B model = Last year's 70B model performance
Efficiency gains come from better data, training, and architecture
Smaller models become increasingly viable for deployment

Year	Model	Params	Approx. GPT-4 Equivalent
2023	LLaMA 2 70B	70B	~50%
2024	LLaMA 3 70B	70B	~80%
2024	Mistral Small 3	24B	~75%
2025	DeepSeek-R1	671B (37B active)	~95%
2025	Qwen3-235B	235B (22B active)	~100%

3.7 Chapter Summary

Key Takeaways

Pre-training teaches general knowledge; alignment makes models helpful and safe
RLHF pioneered alignment but is complex (3 stages, PPO instability)
DPO simplifies alignment to direct optimization on preferences (50% compute savings)
Constitutional AI uses self-critique against principles for scalable alignment
Inference-Time Scaling (o1/o3) trades compute at inference for better reasoning
Scaling Laws guide efficient training; Densing Law shows capability density doubling every 3.5 months

Exercises

Exercise 1: DPO Implementation

Implement a complete DPO training loop using the loss function provided. Train on a small preference dataset and compare with a baseline SFT model.

Exercise 2: Constitutional Principles

Design a constitution for a customer service chatbot. Include at least 10 principles covering helpfulness, safety, and brand voice.

Exercise 3: Thinking Depth Analysis

Given problems of varying difficulty, analyze how much "thinking" (intermediate reasoning) improves accuracy. Plot accuracy vs. thinking tokens.

Exercise 4: Scaling Prediction

Using the Chinchilla scaling law, calculate the optimal model size for a fixed compute budget of $10^{23}$ FLOPs.

Next Chapter

In the next chapter, we'll explore LLM Inference and Optimization - how to efficiently deploy models using quantization, vLLM, and TensorRT-LLM, and how to handle million-token context windows.

Previous: Chapter 2 Next: Chapter 4