Introduction
Training a Large Language Model involves two major phases: pre-training on massive text corpora, and alignment to make the model helpful, harmless, and honest. This chapter covers the evolution of alignment techniques from RLHF to DPO, the revolutionary inference-time scaling that powers reasoning models, and the scaling laws that guide LLM development.
What You'll Learn
- Pre-training objectives and data curation
- RLHF (Reinforcement Learning from Human Feedback)
- NEW: DPO and GRPO - simplified alignment without reward models
- NEW: Constitutional AI - self-critique based alignment
- NEW: Inference-Time Scaling and reasoning models (o1, o3)
- Scaling Laws and the Densing Law
3.1 Pre-training
The Pre-training Objective
Pre-training teaches the model to predict the next token given all previous tokens (autoregressive language modeling):
$$\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, ..., x_{t-1}; \theta)$$
Training Data
Modern LLMs are trained on trillions of tokens from diverse sources:
| Source | Examples | Purpose |
|---|---|---|
| Web Crawls | Common Crawl, C4 | General knowledge |
| Books | BookCorpus, Gutenberg | Long-form reasoning |
| Code | GitHub, Stack Overflow | Programming ability |
| Scientific | arXiv, PubMed | Technical knowledge |
| Conversations | Reddit, Forums | Dialogue patterns |
Data Quality Matters
The Densing Law (2025) shows that capability density (capability per parameter) doubles every ~3.5 months, primarily driven by:
- Data quality: Better filtering, deduplication, and curation
- Data diversity: Balanced mixture of domains
- Synthetic data: High-quality generated training examples
3.2 RLHF (Reinforcement Learning from Human Feedback)
The Three-Stage Pipeline
RLHF, popularized by ChatGPT, aligns models with human preferences through three stages:
Stage 1: Supervised Fine-Tuning (SFT)
Fine-tune the base model on high-quality demonstrations:
# SFT Training Example
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments
# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b")
# SFT dataset format
sft_examples = [
{"prompt": "Explain quantum computing", "response": "Quantum computing uses..."},
{"prompt": "Write a poem about AI", "response": "In circuits deep..."},
]
# Training arguments
training_args = TrainingArguments(
output_dir="./sft_model",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
warmup_steps=100,
)
# Train
trainer = Trainer(model=model, args=training_args, train_dataset=sft_dataset)
trainer.train()
Stage 2: Reward Model Training
Train a model to predict human preferences:
import torch
import torch.nn as nn
class RewardModel(nn.Module):
"""Reward model for RLHF"""
def __init__(self, base_model):
super().__init__()
self.base_model = base_model
self.reward_head = nn.Linear(base_model.config.hidden_size, 1)
def forward(self, input_ids, attention_mask):
# Get hidden states from base model
outputs = self.base_model(
input_ids=input_ids,
attention_mask=attention_mask,
output_hidden_states=True
)
# Use last hidden state of last token
last_hidden = outputs.hidden_states[-1][:, -1, :]
reward = self.reward_head(last_hidden)
return reward
def compute_preference_loss(reward_chosen, reward_rejected):
"""Bradley-Terry preference loss"""
return -torch.log(torch.sigmoid(reward_chosen - reward_rejected)).mean()
Stage 3: PPO Training
Use Proximal Policy Optimization to maximize reward while staying close to the SFT model:
$$\mathcal{L}_{PPO} = \mathbb{E}\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right] - \beta \cdot D_{KL}(\pi_\theta || \pi_{SFT})$$
Challenges with RLHF
- Complexity: Requires training 3 separate models
- Instability: PPO is notoriously difficult to tune
- Cost: Expensive human annotation for preferences
- Reward hacking: Model may exploit reward model weaknesses
3.3 DPO (Direct Preference Optimization)
2024-2025 Breakthrough
DPO simplifies alignment by eliminating the reward model and RL loop entirely, achieving 50% compute reduction while maintaining 95% alignment effectiveness.
The Key Insight
DPO shows that the optimal policy under the RLHF objective has a closed-form solution. Instead of training a reward model, we can directly optimize:
$$\mathcal{L}_{DPO}(\pi_\theta; \pi_{ref}) = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]$$
where \(y_w\) is the preferred response and \(y_l\) is the rejected response.
DPO Implementation
import torch
import torch.nn.functional as F
def dpo_loss(
policy_chosen_logps: torch.Tensor,
policy_rejected_logps: torch.Tensor,
reference_chosen_logps: torch.Tensor,
reference_rejected_logps: torch.Tensor,
beta: float = 0.1
) -> torch.Tensor:
"""
Direct Preference Optimization loss
Args:
policy_chosen_logps: Log probs of chosen responses under policy
policy_rejected_logps: Log probs of rejected responses under policy
reference_chosen_logps: Log probs of chosen responses under reference
reference_rejected_logps: Log probs of rejected responses under reference
beta: Temperature parameter (higher = more conservative updates)
Returns:
DPO loss value
"""
# Compute log ratios
chosen_logratios = policy_chosen_logps - reference_chosen_logps
rejected_logratios = policy_rejected_logps - reference_rejected_logps
# DPO loss: -log(sigmoid(beta * (chosen_logratio - rejected_logratio)))
logits = beta * (chosen_logratios - rejected_logratios)
loss = -F.logsigmoid(logits).mean()
# Compute metrics for monitoring
chosen_rewards = beta * chosen_logratios.detach()
rejected_rewards = beta * rejected_logratios.detach()
reward_margin = (chosen_rewards - rejected_rewards).mean()
return loss, {
'reward_margin': reward_margin.item(),
'chosen_rewards': chosen_rewards.mean().item(),
'rejected_rewards': rejected_rewards.mean().item()
}
# Example usage
batch_size = 4
seq_len = 100
# Simulated log probabilities
policy_chosen = torch.randn(batch_size) * 0.1 - 50
policy_rejected = torch.randn(batch_size) * 0.1 - 55
ref_chosen = torch.randn(batch_size) * 0.1 - 52
ref_rejected = torch.randn(batch_size) * 0.1 - 54
loss, metrics = dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected)
print(f"DPO Loss: {loss.item():.4f}")
print(f"Reward Margin: {metrics['reward_margin']:.4f}")
GRPO (Group Relative Policy Optimization)
GRPO extends DPO by considering groups of responses rather than pairs, further improving training efficiency and stability.
3.4 Constitutional AI
Self-Critique Based Alignment
Constitutional AI (Anthropic, 2022) replaces human preference labeling with AI self-critique based on a set of principles (a "constitution").
Example Constitution Principles
CONSTITUTION = [
"Please choose the response that is most helpful, harmless, and honest.",
"Please choose the response that does not encourage illegal activities.",
"Please choose the response that is not racist, sexist, or discriminatory.",
"Please choose the response that does not contain personal attacks.",
"Please choose the response that provides accurate information.",
]
def constitutional_critique(response: str, principles: list) -> dict:
"""
Self-critique a response against constitutional principles
In practice, this would use an LLM to evaluate
"""
critique_prompt = f"""
Response to evaluate:
{response}
Evaluate this response against the following principles:
{chr(10).join(f'{i+1}. {p}' for i, p in enumerate(principles))}
For each principle, indicate if the response:
- PASSES: Adheres to the principle
- FAILS: Violates the principle (explain how)
- UNCERTAIN: Cannot determine
Provide a revised response if any principles are violated.
"""
# In practice: call LLM with critique_prompt
return {"critique": critique_prompt, "needs_revision": False}
Claude's Approach
Claude 3.5 Sonnet layers Constitutional AI penalties on top of RLHF to refuse disallowed requests without losing helpfulness. This hybrid approach combines the best of both methods.
3.5 Inference-Time Scaling
2025 Revolution: Test-Time Compute
OpenAI's o1 and o3 models opened a new frontier: instead of just scaling training compute, we can scale inference-time compute to improve reasoning quality.
The Core Innovation
Traditional scaling law: More training compute = Better model
New insight: More inference compute = Better decisions
How o1 Works
o1 uses reinforcement learning to train the model to:
- Generate a long chain of thought before answering
- Recognize and correct its mistakes
- Break down complex problems into simpler steps
- Try different approaches when stuck
Thinking Levels
| Mode | Compute | Use Case |
|---|---|---|
| o3-mini Low | ~1x | Simple questions, fast responses |
| o3-mini Medium | ~3x | Moderate reasoning tasks |
| o3-mini High | ~10x | Complex math, coding, analysis |
| o3 Full | ~100x | Research-level problems |
TOPS: Thinking-Optimal Scaling
Research shows that optimal thinking time varies by problem difficulty. TOPS allows LLMs to decide how many tokens to generate for each problem:
- Easy problems: Short reasoning is sufficient
- Hard problems: Extended reasoning improves accuracy
- Over-thinking: Can actually hurt performance on simple tasks
def adaptive_reasoning(model, prompt: str, max_thinking_tokens: int = 10000):
"""
Adaptive reasoning that adjusts thinking depth based on confidence
This is a simplified illustration of the concept
"""
thinking = ""
confidence = 0.0
while len(thinking.split()) < max_thinking_tokens and confidence < 0.95:
# Generate next reasoning step
next_step = model.generate(
prompt + thinking,
max_new_tokens=100,
stop_sequences=["Therefore:", "Answer:"]
)
thinking += next_step
# Estimate confidence (in practice, more sophisticated)
if "I'm confident" in next_step or "The answer is" in next_step:
confidence = 0.95
elif "Let me reconsider" in next_step:
confidence *= 0.8 # Uncertainty detected
# Generate final answer
final_answer = model.generate(
prompt + thinking + "\nFinal Answer:",
max_new_tokens=500
)
return {
"thinking": thinking,
"answer": final_answer,
"thinking_tokens": len(thinking.split())
}
3.6 Scaling Laws
Chinchilla Scaling Law
The Chinchilla paper (2022) showed that model size N and data size D should be scaled equally:
$$L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$$
where \(\alpha \approx 0.34\) and \(\beta \approx 0.28\).
Densing Law (2025)
The Densing Law observes that capability density (capability per parameter) doubles approximately every 3.5 months:
Implications
- Today's 7B model = Last year's 70B model performance
- Efficiency gains come from better data, training, and architecture
- Smaller models become increasingly viable for deployment
| Year | Model | Params | Approx. GPT-4 Equivalent |
|---|---|---|---|
| 2023 | LLaMA 2 70B | 70B | ~50% |
| 2024 | LLaMA 3 70B | 70B | ~80% |
| 2024 | Mistral Small 3 | 24B | ~75% |
| 2025 | DeepSeek-R1 | 671B (37B active) | ~95% |
| 2025 | Qwen3-235B | 235B (22B active) | ~100% |
3.7 Chapter Summary
Key Takeaways
- Pre-training teaches general knowledge; alignment makes models helpful and safe
- RLHF pioneered alignment but is complex (3 stages, PPO instability)
- DPO simplifies alignment to direct optimization on preferences (50% compute savings)
- Constitutional AI uses self-critique against principles for scalable alignment
- Inference-Time Scaling (o1/o3) trades compute at inference for better reasoning
- Scaling Laws guide efficient training; Densing Law shows capability density doubling every 3.5 months
Exercises
Exercise 1: DPO Implementation
Implement a complete DPO training loop using the loss function provided. Train on a small preference dataset and compare with a baseline SFT model.
Exercise 2: Constitutional Principles
Design a constitution for a customer service chatbot. Include at least 10 principles covering helpfulness, safety, and brand voice.
Exercise 3: Thinking Depth Analysis
Given problems of varying difficulty, analyze how much "thinking" (intermediate reasoning) improves accuracy. Plot accuracy vs. thinking tokens.
Exercise 4: Scaling Prediction
Using the Chinchilla scaling law, calculate the optimal model size for a fixed compute budget of \(10^{23}\) FLOPs.
Next Chapter
In the next chapter, we'll explore LLM Inference and Optimization - how to efficiently deploy models using quantization, vLLM, and TensorRT-LLM, and how to handle million-token context windows.