This chapter covers Large Language Models (LLMs). You will learn GPT family architecture, LLM training methods (pre-training, and Practice Prompt Engineering techniques.
Learning Objectives
By reading this chapter, you will be able to:
- ā Understand GPT family architecture and autoregressive generation
- ā Understand LLM training methods (pre-training, Instruction Tuning, RLHF)
- ā Practice Prompt Engineering techniques
- ā Understand characteristics of open-source LLMs and quantization techniques
- ā Implement practical applications such as RAG and Function Calling
- ā Build a complete chatbot system
4.1 The GPT Family
Overview of GPT Architecture
GPT (Generative Pre-trained Transformer) is an autoregressive language model that adopts a Decoder-only Transformer architecture.
Decoder-only: Unlike BERT which has both Encoder and Decoder, GPT consists only of Decoder and specializes in next token prediction.
Features of GPT Architecture
| Feature | Description | Advantage |
|---|---|---|
| Decoder-only | Uses only self-attention mechanism | Simple and scalable |
| Causal Masking | Hides future tokens | Enables autoregressive generation |
| Autoregressive | Generates sequentially from left to right | Natural text generation |
| Pre-training | Trained on large-scale text | General language understanding |
Evolution of GPT
117M params
2018] --> B[GPT-2
1.5B params
2019] B --> C[GPT-3
175B params
2020] C --> D[GPT-3.5
ChatGPT
2022] D --> E[GPT-4
Multimodal
2023] E --> F[GPT-4o
Omni
2024] style A fill:#e3f2fd style B fill:#bbdefb style C fill:#90caf9 style D fill:#64b5f6 style E fill:#42a5f5 style F fill:#1e88e5
Comparison of GPT-2/GPT-3/GPT-4/GPT-4o
| Model | Parameters | Context Length | Key Features |
|---|---|---|---|
| GPT-2 | 117M - 1.5B | 1,024 | High-quality text generation |
| GPT-3 | 175B | 2,048 | Few-shot learning, In-context Learning |
| GPT-3.5 | ~175B | 4,096 | Instruction Tuning, improved dialogue |
| GPT-4 | Undisclosed | 8,192 - 32,768 | Multimodal, advanced reasoning |
| GPT-4o | Undisclosed | 128,000 | Omni-modal (text, vision, audio), faster inference, native multimodal |
Autoregressive Generation
GPT generates text autoregressively by predicting the next token from previous tokens.
$$ P(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} P(x_i | x_1, x_2, \ldots, x_{i-1}) $$
The probability of each token is conditioned on all preceding tokens.
Example: Text Generation with GPT-2
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
# - transformers>=4.30.0
"""
Example: Example: Text Generation with GPT-2
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
# Load GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
# Text generation
prompt = "Artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt")
# Generation parameters
generation_config = {
'max_length': 100,
'num_return_sequences': 3,
'temperature': 0.8,
'top_k': 50,
'top_p': 0.95,
'do_sample': True,
'no_repeat_ngram_size': 2
}
# Generate text
with torch.no_grad():
outputs = model.generate(
inputs['input_ids'],
**generation_config
)
print("=== Text Generation with GPT-2 ===")
print(f"Prompt: '{prompt}'\n")
for i, output in enumerate(outputs):
text = tokenizer.decode(output, skip_special_tokens=True)
print(f"Generation {i+1}:")
print(f"{text}\n")
Sample Output:
=== Text Generation with GPT-2 ===
Prompt: 'Artificial intelligence is'
Generation 1:
Artificial intelligence is becoming more and more important in our daily lives. From smartphones to self-driving cars, AI systems are transforming the way we work and live. The technology has advanced rapidly...
Generation 2:
Artificial intelligence is a field of computer science that focuses on creating intelligent machines capable of performing tasks that typically require human intelligence, such as visual perception...
Generation 3:
Artificial intelligence is revolutionizing industries across the globe. Companies are investing billions in AI research to develop systems that can learn from data and make decisions autonomously...
Controlling Generation Parameters
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Controlling Generation Parameters
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
import numpy as np
# Generation with different temperatures
prompt = "The future of AI is"
temperatures = [0.3, 0.7, 1.0, 1.5]
print("=== Impact of Temperature ===\n")
for temp in temperatures:
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs['input_ids'],
max_length=50,
temperature=temp,
do_sample=True,
top_k=50
)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Temperature = {temp}:")
print(f"{text}\n")
# Visualize temperature and probability distribution
def softmax_with_temperature(logits, temperature):
"""Apply softmax with temperature parameter"""
return torch.softmax(logits / temperature, dim=-1)
# Sample logits
logits = torch.tensor([2.0, 1.0, 0.5, 0.2, 0.1])
temps = [0.5, 1.0, 2.0]
plt.figure(figsize=(12, 4))
for i, temp in enumerate(temps):
probs = softmax_with_temperature(logits, temp)
plt.subplot(1, 3, i+1)
plt.bar(range(len(probs)), probs.numpy())
plt.title(f'Temperature = {temp}')
plt.xlabel('Token ID')
plt.ylabel('Probability')
plt.ylim(0, 1)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nš Low temperature ā Deterministic (select high-probability tokens)")
print("š High temperature ā Diverse (also select low-probability tokens)")
Beam Search and Sampling
# Requirements:
# - Python 3.9+
# - transformers>=4.30.0
"""
Example: Beam Search and Sampling
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""
from transformers import GenerationConfig
prompt = "Machine learning can be used for"
inputs = tokenizer(prompt, return_tensors="pt")
print("=== Comparison of Generation Strategies ===\n")
# 1. Greedy Decoding
print("1. Greedy Decoding:")
with torch.no_grad():
outputs = model.generate(
inputs['input_ids'],
max_length=50,
do_sample=False
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print()
# 2. Beam Search
print("2. Beam Search (num_beams=5):")
with torch.no_grad():
outputs = model.generate(
inputs['input_ids'],
max_length=50,
num_beams=5,
do_sample=False
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print()
# 3. Top-k Sampling
print("3. Top-k Sampling (k=50):")
with torch.no_grad():
outputs = model.generate(
inputs['input_ids'],
max_length=50,
do_sample=True,
top_k=50
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print()
# 4. Top-p (Nucleus) Sampling
print("4. Top-p Sampling (p=0.95):")
with torch.no_grad():
outputs = model.generate(
inputs['input_ids'],
max_length=50,
do_sample=True,
top_p=0.95,
top_k=0
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Important: Greedy/Beam Search are deterministic and consistent, while Sampling provides diversity but lower reproducibility.
4.2 LLM Training Methods
Overall Picture of LLM Training
Pre-training] B --> C[Base Model
Base LLM] C --> D[Instruction Tuning
Instruction Tuning] D --> E[Instruction-following Model
Instruction-following LLM] E --> F[RLHF
Reinforcement Learning from Human Feedback] F --> G[Aligned Model
Aligned LLM] style A fill:#ffebee style B fill:#fff3e0 style C fill:#e3f2fd style D fill:#f3e5f5 style E fill:#e8f5e9 style F fill:#fce4ec style G fill:#c8e6c9
1. Pre-training
Pre-training involves learning next token prediction tasks on large-scale text corpora.
| Element | Content |
|---|---|
| Data | Hundreds of billions to trillions of tokens from web, books, papers, etc. |
| Objective Function | Next token prediction (Language Modeling) |
| Optimization | AdamW, learning rate scheduling |
| Compute Resources | Thousands to tens of thousands of GPUs, weeks to months |
Loss Function:
$$
\mathcal{L}_{\text{LM}} = -\sum_{i=1}^{n} \log P(x_i | x_{
2. Instruction Tuning
Instruction Tuning fine-tunes the model to follow user instructions.
Instruction Tuning Data Format
# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0
"""
Example: Instruction Tuning Data Format
Purpose: Demonstrate data manipulation and preprocessing
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""
instruction_data = [
{
"instruction": "Summarize the following text.",
"input": "Artificial Intelligence (AI) is technology that enables computers to think, learn, and solve problems like humans. It includes fields such as machine learning, deep learning, and natural language processing.",
"output": "AI is technology that gives computers human-like intelligence, including machine learning and natural language processing."
},
{
"instruction": "Answer the following question.",
"input": "How far is the distance from Earth to the Moon?",
"output": "The average distance from Earth to the Moon is approximately 384,400 kilometers (238,855 miles)."
},
{
"instruction": "Fix the bug in the following code.",
"input": "def add(a, b):\n return a - b",
"output": "def add(a, b):\n return a + b"
}
]
# Visualize data format
import pandas as pd
df = pd.DataFrame(instruction_data)
print("=== Instruction Tuning Data Format ===")
print(df.to_string(index=False))
print(f"\nNumber of data points: {len(instruction_data)} items")
Instruction Tuning Implementation Example (Conceptual)
# Requirements:
# - Python 3.9+
# - transformers>=4.30.0
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import Dataset
# Prepare data
def format_instruction(example):
"""Convert instruction data to prompt format"""
prompt = f"### Instruction:\n{example['instruction']}\n\n"
if example['input']:
prompt += f"### Input:\n{example['input']}\n\n"
prompt += f"### Response:\n{example['output']}"
return {"text": prompt}
# Create dataset
dataset = Dataset.from_list(instruction_data)
dataset = dataset.map(format_instruction)
print("=== Formatted Prompt Example ===")
print(dataset[0]['text'])
print("\n" + "="*50 + "\n")
# Prepare model and tokenizer (in practice, use large-scale models)
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Tokenization function
def tokenize_function(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=512
)
# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Training configuration (for demo)
training_args = TrainingArguments(
output_dir="./instruction-tuned-model",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
logging_steps=10,
save_steps=100,
evaluation_strategy="no"
)
print("=== Instruction Tuning Configuration ===")
print(f"Number of epochs: {training_args.num_train_epochs}")
print(f"Batch size: {training_args.per_device_train_batch_size}")
print(f"Learning rate: {training_args.learning_rate}")
print("\nā Instruction Tuning makes the model follow instructions")
3. RLHF (Reinforcement Learning from Human Feedback)
RLHF is a technique that uses human feedback to align the model with human values.
Three Steps of RLHF
Supervised Fine-tuning
SFT] --> B[Step 2:
Reward Model Training
Reward Model] B --> C[Step 3:
PPO Reinforcement Learning
Policy Optimization] style A fill:#e3f2fd style B fill:#f3e5f5 style C fill:#e8f5e9
Step 1: Supervised Fine-Tuning (SFT)
Fine-tune with high-quality human-generated dialogue data.
Step 2: Reward Model Training
Train a reward model from human evaluations (rankings).
# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0
"""
Example: Train a reward model from human evaluations (rankings).
Purpose: Demonstrate data manipulation and preprocessing
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
# Reward model data format (conceptual)
reward_data = [
{
"prompt": "What is artificial intelligence?",
"response_1": "Artificial Intelligence (AI) is technology that enables computers to think like humans.",
"response_2": "I don't know.",
"preference": 1 # response_1 is better
},
{
"prompt": "How to reverse a list in Python?",
"response_1": "Use the list.reverse() method or list[::-1] slicing.",
"response_2": "Cannot do it.",
"preference": 1
}
]
import pandas as pd
df_reward = pd.DataFrame(reward_data)
print("=== Reward Model Training Data ===")
print(df_reward.to_string(index=False))
print("\nā Learn reward function from human preferences")
Step 3: PPO (Proximal Policy Optimization)
Optimize the model using reinforcement learning with the reward model.
Objective Function:
$$ \mathcal{L}^{\text{PPO}} = \mathbb{E}_{x,y \sim \pi_\theta} \left[ r(x, y) - \beta \cdot \text{KL}(\pi_\theta || \pi_{\text{ref}}) \right] $$
- $r(x, y)$: Reward model score
- $\beta$: KL regularization coefficient
- $\pi_{\text{ref}}$: Original model (to prevent large deviation)
Effect of RLHF: ChatGPT's human-like dialogue capabilities are realized through RLHF.
4. Parameter-Efficient Fine-Tuning (PEFT)
Since fine-tuning entire large-scale models is computationally expensive, parameter-efficient fine-tuning is important.
LoRA (Low-Rank Adaptation)
LoRA applies low-rank decomposition to model weight matrices.
$$ W' = W + \Delta W = W + BA $$
- $W$: Original weight matrix (frozen)
- $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$: Trainable low-rank matrices
- $r \ll \min(d, k)$: Rank (e.g., 8, 16)
# Requirements:
# - Python 3.9+
# - transformers>=4.30.0
"""
Example: $$
W' = W + \Delta W = W + BA
$$
Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM
# Load base model
model_name = "gpt2"
base_model = AutoModelForCausalLM.from_pretrained(model_name)
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # Rank
lora_alpha=32, # Scaling factor
lora_dropout=0.1,
target_modules=["c_attn"] # Modules to apply
)
# Create LoRA model
lora_model = get_peft_model(base_model, lora_config)
# Compare parameter counts
total_params = sum(p.numel() for p in lora_model.parameters())
trainable_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
print("=== LoRA Parameter Efficiency ===")
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Trainable ratio: {100 * trainable_params / total_params:.2f}%")
print(f"\nā Fine-tune with only {100 * trainable_params / total_params:.2f}% of parameters")
print(lora_model.print_trainable_parameters())
Sample Output:
=== LoRA Parameter Efficiency ===
Total parameters: 124,439,808
Trainable parameters: 294,912
Trainable ratio: 0.24%
ā Fine-tune with only 0.24% of parameters
trainable params: 294,912 || all params: 124,439,808 || trainable%: 0.24
Adapter Layers
Add small bottleneck networks to each layer of the Transformer.
| Method | Trainable Parameters | Memory | Speed |
|---|---|---|---|
| Full Fine-tuning | 100% | High | Slow |
| LoRA | 0.1% - 1% | Low | Fast |
| Adapter | 2% - 5% | Medium | Medium |
| Prompt Tuning | < 0.1% | Very low | Very fast |
4.3 Prompt Engineering
What is Prompt Engineering
Prompt Engineering is the technique of designing inputs (prompts) to elicit desired outputs from LLMs.
"A good prompt can increase model performance tenfold" - OpenAI
1. Zero-shot Learning
Directly ask questions to a pre-trained model without providing any task examples.
# Requirements:
# - Python 3.9+
# - transformers>=4.30.0
from transformers import pipeline
# GPT-2 pipeline
generator = pipeline('text-generation', model='gpt2')
# Zero-shot prompt
zero_shot_prompt = """
Question: What is the capital of France?
Answer:
"""
result = generator(zero_shot_prompt, max_length=50, num_return_sequences=1)
print("=== Zero-shot Learning ===")
print(result[0]['generated_text'])
2. Few-shot Learning
Provide a few task examples before giving a new input.
# Few-shot prompt (In-Context Learning)
few_shot_prompt = """
Translate English to French:
English: Hello
French: Bonjour
English: Thank you
French: Merci
English: Good morning
French: Bon matin
English: How are you?
French:
"""
result = generator(few_shot_prompt, max_length=100, num_return_sequences=1)
print("\n=== Few-shot Learning ===")
print(result[0]['generated_text'])
print("\nā By showing examples, the model learns patterns and performs translation")
3. Chain-of-Thought (CoT) Prompting
Chain-of-Thought is a technique that prompts the model to generate intermediate reasoning steps.
# Standard prompt (direct answer)
standard_prompt = """
Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
Answer:
"""
# Chain-of-Thought prompt
cot_prompt = """
Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
Let's think step by step:
1. Roger starts with 5 tennis balls.
2. He buys 2 cans, each containing 3 tennis balls.
3. So he gets 2 Ć 3 = 6 new tennis balls.
4. Total tennis balls = 5 + 6 = 11.
Answer: 11 tennis balls.
Question: A restaurant had 23 customers. Then 11 more customers arrived.
Each customer ordered 2 drinks. How many drinks were ordered in total?
Let's think step by step:
"""
result = generator(cot_prompt, max_length=200, num_return_sequences=1)
print("\n=== Chain-of-Thought Prompting ===")
print(result[0]['generated_text'])
print("\nā Step-by-step reasoning solves complex problems")
Effectiveness of CoT (Research Results)
| Task | Standard Prompt | CoT Prompt | Improvement |
|---|---|---|---|
| Math problems | 34% | 78% | +44% |
| Common sense reasoning | 61% | 89% | +28% |
| Logic puzzles | 42% | 81% | +39% |
4. Best Practices for Prompt Design
# ā Bad prompt
bad_prompt = "Summarize this."
# ā
Good prompt
good_prompt = """
Task: Summarize the following article in 3 bullet points.
Focus on key findings and implications.
Article: [Long article text...]
Summary:
-
"""
# Principles of prompt design
prompt_principles = {
"Clear instructions": "Describe the task specifically",
"Format specification": "Show the desired output format",
"Provide context": "Include necessary background information",
"State constraints": "Specify character limits, style, etc.",
"Show examples": "Demonstrate expected output with few-shot",
"Break down steps": "Divide complex tasks into stages"
}
print("=== Principles of Prompt Design ===")
for principle, description in prompt_principles.items():
print(f"ā {principle}: {description}")
# Practical example: Structured prompt
structured_prompt = """
Role: You are an expert Python programmer.
Task: Review the following code and provide feedback.
Code:
```python
def calculate_average(numbers):
return sum(numbers) / len(numbers)
```
Output Format:
1. Code Quality (1-10):
2. Issues Found:
3. Suggestions:
4. Improved Code:
Analysis:
"""
print("\n=== Structured Prompt Example ===")
print(structured_prompt)
5. Prompt Management with LangChain
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.llms import HuggingFacePipeline
# Wrap HuggingFace pipeline with LangChain
llm = HuggingFacePipeline(pipeline=generator)
# Create prompt template
template = """
Question: {question}
Context: {context}
Please provide a detailed answer based on the context above.
Answer:
"""
prompt = PromptTemplate(
input_variables=["question", "context"],
template=template
)
# Create chain
chain = LLMChain(llm=llm, prompt=prompt)
# Execute
question = "What is machine learning?"
context = "Machine learning is a subset of artificial intelligence that enables systems to learn from data."
result = chain.run(question=question, context=context)
print("=== LangChain Prompt Template ===")
print(f"Question: {question}")
print(f"Context: {context}")
print(f"\nGenerated answer:")
print(result)
print("\nā Manage reusable prompt templates with LangChain")
4.4 Open-Source LLMs
Major Open-Source LLMs
| Model | Developer | Parameters | License | Features |
|---|---|---|---|---|
| LLaMA | Meta AI | 7B - 65B | Research only | High performance, research use |
| LLaMA-2 | Meta AI | 7B - 70B | LLaMA 2 Community | Commercial use, Chat version available |
| LLaMA-3 | Meta AI | 8B - 70B | LLaMA 3 Community | Improved reasoning, 8K context (2024-04) |
| Mistral | Mistral AI | 7B - Large | Apache 2.0 | Sliding window attention, efficient inference |
| Gemma | 2B - 7B | Gemma Terms | Lightweight, responsible AI (2024-02) | |
| Qwen-2 | Alibaba | 0.5B - 72B | Apache 2.0 | Multilingual, code-capable |
| Falcon | TII | 7B - 180B | Apache 2.0 | High-quality dataset |
| ELYZA | ELYZA Inc. | 7B - 13B | LLaMA based | Japanese-specialized |
| rinna | rinna | 3.6B - 36B | MIT | Japanese, commercial use allowed |
1. LLaMA / LLaMA-2
LLaMA (Large Language Model Meta AI) is an open-source LLM developed by Meta AI.
Features of LLaMA
- Efficient architecture: Equivalent performance to GPT-3 with fewer parameters
- Pre-Normalization: Uses RMSNorm
- SwiGLU activation function: Improved performance
- Rotary Positional Embeddings (RoPE): Position encoding
# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0
# - torch>=2.0.0, <2.3.0
# - transformers>=4.30.0
"""
Example: Features of LLaMA
Purpose: Demonstrate data manipulation and preprocessing
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load LLaMA-2 model (7B model)
model_name = "meta-llama/Llama-2-7b-hf" # From Hugging Face Hub
# Note: Access request required on Hugging Face to use LLaMA-2
# Using GPT-2 here for demo purposes
model_name = "gpt2" # Replace with LLaMA-2 in actual environment
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
print("=== LLaMA Family Information ===")
print(f"Model: {model_name}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Architecture: Decoder-only Transformer")
print(f"Context length: 2048 (LLaMA) / 4096 (LLaMA-2)")
# LLaMA-2 performance (benchmark results)
benchmarks = {
"MMLU": {"LLaMA-7B": 35.1, "LLaMA-2-7B": 45.3, "GPT-3.5": 70.0},
"HellaSwag": {"LLaMA-7B": 76.1, "LLaMA-2-7B": 77.2, "GPT-3.5": 85.5},
"HumanEval": {"LLaMA-7B": 10.5, "LLaMA-2-7B": 12.8, "GPT-3.5": 48.1}
}
import pandas as pd
df_bench = pd.DataFrame(benchmarks)
print("\n=== Benchmark Performance Comparison ===")
print(df_bench.to_string())
2. Japanese LLMs (ELYZA, rinna)
ELYZA-japanese-Llama-2
A model that continues pre-training LLaMA-2 with Japanese text.
# ELYZA-japanese-Llama-2 usage example (conceptual)
# model_name = "elyza/ELYZA-japanese-Llama-2-7b"
japanese_prompt = """
Please answer the following question in Japanese.
Question: What is the difference between machine learning and deep learning?
Answer:
"""
print("=== Japanese LLM (ELYZA) ===")
print("ā LLaMA-2 base + Japanese additional training")
print("ā Natural dialogue in Japanese possible")
print("ā Commercial use allowed (LLaMA-2 license)")
print(f"\nPrompt example:\n{japanese_prompt}")
# rinna GPT-NeoX
print("\n=== Japanese LLM (rinna) ===")
print("ā GPT-NeoX architecture")
print("ā Trained on Japanese Wikipedia, etc.")
print("ā 3.6B, 36B models available")
print("ā Commercial use allowed (MIT License)")
3. Model Quantization
Quantization is a technique to reduce memory and computation by lowering model precision.
Types of Quantization
| Precision | Memory Reduction | Performance Loss | Use Case |
|---|---|---|---|
| FP32 (Original) | - | - | Training |
| FP16 | 50% | Almost none | Inference |
| 8-bit | 75% | Small | Inference, fine-tuning |
| 4-bit | 87.5% | Medium | Memory-constrained environments |
8-bit Quantization Implementation
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
# - transformers>=4.30.0
"""
Example: 8-bit Quantization Implementation
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# 8-bit quantization configuration
quantization_config_8bit = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0
)
# 4-bit quantization configuration (QLoRA)
quantization_config_4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_use_double_quant=True
)
# Load model (8-bit)
# model_8bit = AutoModelForCausalLM.from_pretrained(
# "meta-llama/Llama-2-7b-hf",
# quantization_config=quantization_config_8bit,
# device_map="auto"
# )
print("=== Model Quantization ===")
print("\n8-bit quantization:")
print(" ā Memory usage: ~7GB (7B model, 75% reduction from FP32)")
print(" ā Performance loss: Minimal (1-2%)")
print(" ā Inference speed: Almost the same as FP32")
print("\n4-bit quantization (QLoRA):")
print(" ā Memory usage: ~3.5GB (7B model, 87.5% reduction from FP32)")
print(" ā Performance loss: Small (5-10%)")
print(" ā Fine-tuning possible")
# Calculate memory usage
def calculate_model_memory(num_params, precision):
"""Calculate model memory usage"""
bytes_per_param = {
'fp32': 4,
'fp16': 2,
'8bit': 1,
'4bit': 0.5
}
memory_gb = num_params * bytes_per_param[precision] / (1024**3)
return memory_gb
params_7b = 7_000_000_000
print("\n=== Memory Usage for 7B Model ===")
for precision in ['fp32', 'fp16', '8bit', '4bit']:
memory = calculate_model_memory(params_7b, precision)
print(f"{precision.upper():6s}: {memory:.2f} GB")
Output:
=== Memory Usage for 7B Model ===
FP32 : 26.08 GB
FP16 : 13.04 GB
8BIT : 6.52 GB
4BIT : 3.26 GB
QLoRA (Quantized LoRA): Combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs.
4. Multimodal LLMs
Modern LLMs have evolved beyond text to understand and generate multiple modalities including images, audio, and video. These Multimodal Large Language Models (MLLMs) represent a significant advancement in AI capabilities.
Major Multimodal LLMs
| Model | Developer | Modalities | Key Features |
|---|---|---|---|
| GPT-4V | OpenAI | Text, Vision | Advanced image understanding, visual reasoning |
| GPT-4o | OpenAI | Text, Vision, Audio | Native multimodal, real-time audio conversation |
| Claude 3 | Anthropic | Text, Vision | Strong vision analysis, document understanding |
| Gemini | Text, Vision, Audio | Natively multimodal, long context | |
| LLaVA | Microsoft/UW | Text, Vision | Open-source, visual instruction tuning |
Vision-Language Model Architecture
Vision-Language Models (VLMs) typically consist of three main components:
ViT/CLIP] B --> C[Projection Layer
Linear/MLP] C --> D[LLM Backbone
GPT/LLaMA] E[Text Input] --> D D --> F[Text Output] style A fill:#e3f2fd style B fill:#bbdefb style C fill:#90caf9 style D fill:#64b5f6 style E fill:#fff3e0 style F fill:#c8e6c9
- Vision Encoder: Extracts visual features from images (e.g., ViT, CLIP)
- Projection Layer: Maps visual features to the LLM's embedding space
- LLM Backbone: Processes combined visual and text tokens for generation
# Requirements:
# - Python 3.9+
# - transformers>=4.35.0
# - pillow>=10.0.0
"""
Example: Vision-Language Model Inference (Conceptual)
Purpose: Demonstrate multimodal LLM usage patterns
Target: Intermediate to Advanced
Execution time: Varies by model size
Dependencies: GPU recommended for larger models
"""
# Conceptual example of using a vision-language model
# In practice, use models like LLaVA, GPT-4V API, or Claude Vision
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch
# Load a vision-language model (example: using a smaller open model)
# model_name = "llava-hf/llava-1.5-7b-hf" # Example model
# For demonstration, showing the general pattern:
print("=== Vision-Language Model Usage Pattern ===")
# 1. Load image
# image = Image.open("example.jpg")
# 2. Prepare prompt with image
prompt = """
<image>
User: What objects can you see in this image? Describe them in detail.
Assistant:
"""
# 3. Process inputs (image + text)
# inputs = processor(images=image, text=prompt, return_tensors="pt")
# 4. Generate response
# with torch.no_grad():
# outputs = model.generate(**inputs, max_new_tokens=200)
# response = processor.decode(outputs[0], skip_special_tokens=True)
print("Pattern:")
print("1. Load vision encoder (ViT/CLIP) + LLM backbone")
print("2. Process image through vision encoder")
print("3. Project visual features to text embedding space")
print("4. Concatenate with text tokens")
print("5. Generate response autoregressively")
# Example capabilities of multimodal LLMs
capabilities = {
"Image Understanding": "Describe scenes, identify objects, read text in images",
"Visual QA": "Answer questions about image content",
"Document Analysis": "Extract information from documents, charts, diagrams",
"Code from Screenshots": "Generate code from UI mockups or diagrams",
"Image Generation": "Some models can generate images (e.g., GPT-4o with DALL-E)"
}
print("\n=== Multimodal LLM Capabilities ===")
for capability, description in capabilities.items():
print(f"- {capability}: {description}")
Trend: Multimodal LLMs are becoming the default as models like GPT-4o and Gemini demonstrate that native multimodal training leads to better cross-modal understanding than post-hoc integration.
4.5 Practical Applications of LLMs
1. Retrieval-Augmented Generation (RAG)
RAG is a technique that retrieves relevant information from an external knowledge base and generates answers based on it.
Retriever] B --> C[Knowledge Base
Vector DB] C --> D[Relevant Documents] D --> E[LLM
Generator] A --> E E --> F[Answer Generation] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#e8f5e9 style E fill:#fce4ec style F fill:#c8e6c9
RAG Implementation Example
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
# Knowledge base text
documents = [
"Artificial Intelligence (AI) is technology that enables computers to think and learn like humans.",
"Machine learning is a branch of AI that learns patterns from data.",
"Deep learning is a machine learning method using multi-layer neural networks.",
"Natural Language Processing (NLP) is technology that enables computers to understand human language.",
"Transformer is a revolutionary neural network architecture introduced in 2017."
]
# Text splitting
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100,
chunk_overlap=20
)
texts = text_splitter.create_documents(documents)
# Embedding model
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
# Create vector store
vectorstore = FAISS.from_documents(texts, embeddings)
print("=== RAG System Construction ===")
print(f"Number of documents: {len(documents)}")
print(f"Number of chunks: {len(texts)}")
print(f"Embedding model: sentence-transformers/all-MiniLM-L6-v2")
# Search test
query = "What is Transformer?"
relevant_docs = vectorstore.similarity_search(query, k=2)
print(f"\nQuery: {query}")
print("\nRelevant documents:")
for i, doc in enumerate(relevant_docs, 1):
print(f"{i}. {doc.page_content}")
# Integrate with LLM (conceptual)
# llm = HuggingFacePipeline(...)
# qa_chain = RetrievalQA.from_chain_type(
# llm=llm,
# retriever=vectorstore.as_retriever(),
# return_source_documents=True
# )
# result = qa_chain({"query": query})
print("\nā RAG enables LLM to answer with latest/domain-specific knowledge")
2. Function Calling
Function Calling is a technique that enables LLMs to call external tools and APIs.
import json
# Define available functions
available_functions = {
"get_weather": {
"description": "Get current weather for specified city",
"parameters": {
"city": {"type": "string", "description": "City name"}
}
},
"calculate": {
"description": "Perform mathematical calculation",
"parameters": {
"expression": {"type": "string", "description": "Mathematical expression"}
}
},
"search_web": {
"description": "Search the web",
"parameters": {
"query": {"type": "string", "description": "Search query"}
}
}
}
# Function implementations (dummy)
def get_weather(city):
"""Get weather information (dummy)"""
return f"The weather in {city} is sunny, temperature is 25 degrees."
def calculate(expression):
"""Execute calculation"""
try:
result = eval(expression)
return f"{expression} = {result}"
except:
return "Calculation error"
def search_web(query):
"""Web search (dummy)"""
return f"Search results for '{query}': [Related information...]"
# Function Calling prompt
def create_function_calling_prompt(user_query, functions):
"""Generate prompt for Function Calling"""
functions_desc = json.dumps(functions, ensure_ascii=False, indent=2)
prompt = f"""
You are a helpful assistant with access to the following functions:
{functions_desc}
User query: {user_query}
Based on the query, determine which function to call and with what parameters.
Respond in JSON format:
{{
"function": "function_name",
"parameters": {{...}}
}}
Response:
"""
return prompt
# Test
user_query = "What's the weather in Tokyo?"
prompt = create_function_calling_prompt(user_query, available_functions)
print("=== Function Calling ===")
print(f"User query: {user_query}")
print(f"\nPrompt:\n{prompt}")
# Expected response (actually generated by LLM)
function_call = {
"function": "get_weather",
"parameters": {"city": "Tokyo"}
}
print(f"\nLLM's function selection:")
print(json.dumps(function_call, ensure_ascii=False, indent=2))
# Execute function
if function_call["function"] == "get_weather":
result = get_weather(**function_call["parameters"])
print(f"\nExecution result: {result}")
print("\nā LLM can use tools to retrieve and process information")
3. Multi-turn Conversation
Maintain conversation history to achieve context-aware dialogue.
from collections import deque
class ConversationManager:
"""Class to manage conversation history"""
def __init__(self, max_history=10):
self.history = deque(maxlen=max_history)
def add_message(self, role, content):
"""Add message"""
self.history.append({"role": role, "content": content})
def get_prompt(self, system_message=""):
"""Generate prompt from conversation history"""
prompt_parts = []
if system_message:
prompt_parts.append(f"System: {system_message}\n")
for msg in self.history:
prompt_parts.append(f"{msg['role'].capitalize()}: {msg['content']}")
prompt_parts.append("Assistant:")
return "\n".join(prompt_parts)
def clear(self):
"""Clear history"""
self.history.clear()
# Usage example
conv_manager = ConversationManager(max_history=6)
system_msg = "You are a helpful AI assistant."
# Conversation simulation
conversation = [
("User", "Hello!"),
("Assistant", "Hello! How can I help you?"),
("User", "Please teach me about Python learning methods."),
("Assistant", "For Python learning, I recommend first learning basic syntax, then working on actual projects."),
("User", "What projects are good for beginners?"),
]
print("=== Multi-turn Conversation ===\n")
for i, (role, content) in enumerate(conversation):
conv_manager.add_message(role, content)
if role == "User":
print(f"{role}: {content}")
prompt = conv_manager.get_prompt(system_msg)
print(f"\n[Generated Prompt]")
print(prompt)
print("\n" + "="*50 + "\n")
print("ā Conversation history enables context-aware dialogue")
print(f"ā History length: {len(conv_manager.history)} messages")
4. Complete Chatbot Implementation
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
# - transformers>=4.30.0
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
class SimpleChatbot:
"""Simple chatbot"""
def __init__(self, model_name="gpt2", max_history=5):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.conversation = ConversationManager(max_history=max_history)
self.system_message = "You are a helpful AI assistant."
def generate_response(self, user_input, max_length=100):
"""Generate response to user input"""
# Add to conversation history
self.conversation.add_message("User", user_input)
# Generate prompt
prompt = self.conversation.get_prompt(self.system_message)
# Tokenize
inputs = self.tokenizer(prompt, return_tensors="pt")
# Generate
with torch.no_grad():
outputs = self.model.generate(
inputs['input_ids'],
max_length=len(inputs['input_ids'][0]) + max_length,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id
)
# Decode
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
# Remove prompt part
response = response[len(prompt):].strip()
# Add to conversation history
self.conversation.add_message("Assistant", response)
return response
def chat(self):
"""Dialogue loop"""
print("=== Chatbot Started ===")
print("Type 'quit' to exit\n")
while True:
user_input = input("You: ")
if user_input.lower() in ['quit', 'exit', 'bye']:
print("Assistant: Goodbye!")
break
response = self.generate_response(user_input)
print(f"Assistant: {response}\n")
# Instantiate chatbot
chatbot = SimpleChatbot(model_name="gpt2", max_history=5)
# Demo dialogue (in practice use chatbot.chat() for dialogue loop)
demo_inputs = [
"Hello!",
"What is AI?",
"Can you explain more?"
]
print("=== Chatbot Demo ===\n")
for user_input in demo_inputs:
print(f"You: {user_input}")
response = chatbot.generate_response(user_input, max_length=50)
print(f"Assistant: {response}\n")
print("ā Complete chatbot maintaining conversation history")
print("ā Context-aware response generation")
print("ā Extensible design (can add RAG, Function Calling, etc.)")
4.5.1 Multimodal LLMs
Multimodal LLMs extend language models to process multiple modalities including text, images, and audio.
Vision-Language Models
| Model | Provider | Capabilities | Released |
|---|---|---|---|
| GPT-4V | OpenAI | Image understanding, visual QA, OCR | 2023-09 |
| GPT-4o | OpenAI | Native multimodal (text, image, audio) | 2024-05 |
| Claude 3 Vision | Anthropic | Document analysis, chart understanding | 2024-03 |
| Gemini Pro Vision | Multimodal reasoning, video understanding | 2023-12 | |
| LLaVA | Open Source | Visual instruction following | 2023-04 |
Architecture Overview
ViT/CLIP] B --> C[Projection Layer] D[Text] --> E[Tokenizer] C --> F[LLM
Transformer] E --> F F --> G[Response] style A fill:#e3f2fd style D fill:#fff3e0 style F fill:#f3e5f5 style G fill:#e8f5e9
Key Insight: Multimodal LLMs typically combine a pre-trained vision encoder (like CLIP or ViT) with a language model, using a projection layer to align visual and textual representations.
4.5.2 NLP Market Trends (2025-2030)
š Market Statistics
| NLP Market Size (2025) | $39 Billion |
| Projected Size (2030) | $115 Billion |
| CAGR | ~24% |
| Key Drivers | LLMs, Enterprise AI, Automation |
Growth Factors
- Enterprise Adoption: Customer service, document processing, content generation
- LLM APIs: GPT-4, Claude, and open-source models democratizing access
- Vertical Solutions: Healthcare, legal, finance-specific NLP applications
- Multimodal AI: Vision-language models expanding use cases
4.6 Chapter Summary
What We Learned
GPT Family
- Decoder-only architecture
- Autoregressive text generation
- Evolution from GPT-2 to GPT-4
- Control of generation parameters (temperature, top-k, top-p)
LLM Training Methods
- Pre-training: Next token prediction on large corpus
- Instruction Tuning: Towards instruction-following models
- RLHF: Alignment through human feedback
- PEFT: Efficient fine-tuning with LoRA, Adapter
Prompt Engineering
- Zero-shot / Few-shot Learning
- Chain-of-Thought Prompting
- Best practices for prompt design
- Prompt management with LangChain
Open-Source LLMs
- LLaMA, Falcon, Japanese LLMs
- Model quantization (8-bit, 4-bit)
- Efficient fine-tuning with QLoRA
Practical Applications
- RAG: Generation leveraging external knowledge
- Function Calling: Tool/API integration
- Multi-turn Conversation: Context-aware dialogue
- Complete chatbot implementation
Best Practices for LLM Utilization
| Aspect | Recommendation |
|---|---|
| Model Selection | Balance size and performance according to task |
| Prompt Design | Clear instructions, examples, structure |
| Memory Efficiency | Utilize quantization, PEFT |
| Knowledge Update | Integrate latest information with RAG |
| Safety | Output validation, harmful content filtering |
Next Steps
To deepen your understanding of large language models:
- Try larger models (LLaMA-2 13B/70B)
- Fine-tune with custom data (LoRA)
- Deploy RAG systems to production
- Explore multimodal LLMs (GPT-4V, etc.)
- Learn LLMOps (operations and monitoring)
Exercises
Exercise 1 (Difficulty: easy)
Explain the difference between GPT's Decoder-only architecture and BERT's Encoder-only architecture. What tasks is each suited for?
Sample Answer
Answer:
GPT (Decoder-only):
- Structure: Self-attention mechanism + Causal Masking (doesn't see future)
- Training: Next token prediction (autoregressive)
- Strengths: Text generation, dialogue, creative writing
- Use cases: ChatGPT, code generation, text composition
BERT (Encoder-only):
- Structure: Bidirectional self-attention mechanism (references all tokens)
- Training: Masked Language Modeling (fill-in-the-blank)
- Strengths: Text understanding, classification, extraction
- Use cases: Sentiment analysis, named entity recognition, question answering
Usage Guidelines:
| Task | Recommended Model |
|---|---|
| Text generation | GPT |
| Document classification | BERT |
| Question answering (extractive) | BERT |
| Question answering (generative) | GPT |
| Summarization | GPT (or Encoder-Decoder) |
Exercise 2 (Difficulty: medium)
Explain why Chain-of-Thought (CoT) Prompting is effective for complex reasoning tasks. Provide concrete examples.
Sample Answer
Answer:
Why CoT is Effective:
- Explicit intermediate reasoning: Verbalizing steps helps LLM organize logical thinking
- Early error detection: Easier to notice mistakes at each step
- Complexity decomposition: Divide difficult problems into smaller sub-problems
- Enhanced In-context Learning: Learn reasoning patterns
Concrete Example: Math Problem
# ā Standard prompt
prompt_standard = """
Question: A store had 25 apples. They sold 8 apples in the morning
and 12 apples in the afternoon. How many apples are left?
Answer:
"""
# ā
CoT prompt
prompt_cot = """
Question: A store had 25 apples. They sold 8 apples in the morning
and 12 apples in the afternoon. How many apples are left?
Let's solve this step by step:
1. The store started with 25 apples.
2. They sold 8 apples in the morning: 25 - 8 = 17 apples remaining.
3. They sold 12 apples in the afternoon: 17 - 12 = 5 apples remaining.
Answer: 5 apples are left.
"""
Demonstrated Effectiveness (from research):
| Task | Standard | CoT | Improvement |
|---|---|---|---|
| GSM8K (Math) | 17.9% | 58.1% | +40.2% |
| SVAMP (Math) | 69.4% | 78.7% | +9.3% |
Exercise 3 (Difficulty: medium)
Explain why LoRA (Low-Rank Adaptation) is parameter-efficient using mathematical formulas.
Sample Answer
Answer:
Principle of LoRA:
Instead of updating the original weight matrix $W \in \mathbb{R}^{d \times k}$, use low-rank decomposition:
$$ W' = W + \Delta W = W + BA $$
- $W$: Original weights (frozen, not trained)
- $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$: Trainable matrices
- $r \ll \min(d, k)$: Rank (e.g., $r=8$)
Parameter Reduction Calculation:
Example: $d = 4096$, $k = 4096$, $r = 8$:
- Original weights: $4096 \times 4096 = 16,777,216$ parameters
- LoRA: $4096 \times 8 + 8 \times 4096 = 65,536$ parameters
- Reduction rate: $\frac{65,536}{16,777,216} \approx 0.39\%$ (99.6% reduction)
Implementation Example:
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: Implementation Example:
Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import torch
import torch.nn as nn
class LoRALayer(nn.Module):
def __init__(self, in_features, out_features, rank=8):
super().__init__()
# Original weights (frozen)
self.weight = nn.Parameter(torch.randn(out_features, in_features))
self.weight.requires_grad = False
# LoRA matrices (trainable)
self.lora_A = nn.Parameter(torch.randn(rank, in_features))
self.lora_B = nn.Parameter(torch.randn(out_features, rank))
self.rank = rank
def forward(self, x):
# W*x + B*A*x
return x @ self.weight.T + x @ self.lora_A.T @ self.lora_B.T
# Example
layer = LoRALayer(4096, 4096, rank=8)
total = sum(p.numel() for p in layer.parameters())
trainable = sum(p.numel() for p in layer.parameters() if p.requires_grad)
print(f"Total parameters: {total:,}")
print(f"Trainable: {trainable:,} ({100*trainable/total:.2f}%)")
Output:
Total parameters: 16,842,752
Trainable: 65,536 (0.39%)
Exercise 4 (Difficulty: hard)
Implement a RAG (Retrieval-Augmented Generation) system. Meet the following requirements:
- Retrieval from custom knowledge base
- Scoring of relevant documents
- Answer generation using retrieval results
Sample Answer
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0
# - transformers>=4.30.0
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
class SimpleRAG:
"""Simple RAG system"""
def __init__(self, knowledge_base, embedding_model="sentence-transformers/all-MiniLM-L6-v2"):
self.knowledge_base = knowledge_base
self.tokenizer = AutoTokenizer.from_pretrained(embedding_model)
self.model = AutoModel.from_pretrained(embedding_model)
# Pre-compute embeddings for knowledge base
self.kb_embeddings = self._embed_documents(knowledge_base)
def _mean_pooling(self, model_output, attention_mask):
"""Mean pooling"""
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def _embed_documents(self, documents):
"""Convert documents to embedding vectors"""
encoded = self.tokenizer(documents, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = self.model(**encoded)
embeddings = self._mean_pooling(model_output, encoded['attention_mask'])
return embeddings.numpy()
def retrieve(self, query, top_k=3):
"""Retrieve documents relevant to query"""
# Query embedding
query_embedding = self._embed_documents([query])
# Calculate cosine similarity
similarities = cosine_similarity(query_embedding, self.kb_embeddings)[0]
# Get top-k documents
top_indices = np.argsort(similarities)[::-1][:top_k]
results = []
for idx in top_indices:
results.append({
'document': self.knowledge_base[idx],
'score': similarities[idx]
})
return results
def generate_answer(self, query, retrieved_docs):
"""Generate answer using retrieval results (create prompt)"""
context = "\n".join([f"- {doc['document']}" for doc in retrieved_docs])
prompt = f"""
Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:
"""
return prompt
# Knowledge base
knowledge_base = [
"Machine learning is a field of AI where computers learn from data.",
"Deep learning is a machine learning method using multi-layer neural networks.",
"Transformer is a revolutionary architecture introduced in 2017.",
"BERT is a text understanding model using bidirectional Transformers.",
"GPT is a Decoder-only Transformer excellent at text generation.",
"Fine-tuning is a method to adapt pre-trained models to specific tasks.",
"LoRA is a parameter-efficient fine-tuning method.",
"RAG is a technique combining retrieval and generation."
]
# Create RAG system
rag = SimpleRAG(knowledge_base)
# Test
query = "Tell me about Transformer"
print("=== RAG System ===")
print(f"Knowledge base: {len(knowledge_base)} items\n")
print(f"Query: {query}\n")
# Retrieval
retrieved = rag.retrieve(query, top_k=3)
print("Retrieval results:")
for i, doc in enumerate(retrieved, 1):
print(f"{i}. [Score: {doc['score']:.3f}] {doc['document']}")
# Prompt for answer generation
prompt = rag.generate_answer(query, retrieved)
print(f"\nGenerated prompt:\n{prompt}")
print("\nā RAG system retrieves relevant knowledge and uses it for answer generation")
Sample Output:
=== RAG System ===
Knowledge base: 8 items
Query: Tell me about Transformer
Retrieval results:
1. [Score: 0.712] Transformer is a revolutionary architecture introduced in 2017.
2. [Score: 0.623] BERT is a text understanding model using bidirectional Transformers.
3. [Score: 0.589] GPT is a Decoder-only Transformer excellent at text generation.
Generated prompt:
Based on the following context, answer the question.
Context:
- Transformer is a revolutionary architecture introduced in 2017.
- BERT is a text understanding model using bidirectional Transformers.
- GPT is a Decoder-only Transformer excellent at text generation.
Question: Tell me about Transformer
Answer:
Exercise 5 (Difficulty: hard)
Explain the three steps of RLHF (SFT, Reward Model Training, PPO), their roles, and why they need to be performed in this order.
Sample Answer
Answer:
Step 1: Supervised Fine-Tuning (SFT)
- Role: Fine-tune model with high-quality human-generated dialogue data
- Purpose: Adapt base model to dialogue tasks
- Data: Pairs of (prompt, ideal response)
- Necessity: Pre-trained models are not optimized for dialogue
Step 2: Reward Model Training
- Role: Train a reward model that learns human preferences
- Purpose: Enable automatic evaluation of "good response" vs "bad response"
- Data: Rankings of multiple responses to the same prompt
- Necessity: Provide reward signal needed for reinforcement learning
Loss Function:
$$ \mathcal{L}_{\text{reward}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma(r(x, y_w) - r(x, y_l)) \right] $$
- $y_w$: Preferred response
- $y_l$: Non-preferred response
- $r(x, y)$: Reward score
Step 3: PPO (Proximal Policy Optimization)
- Role: Optimize model using reward model
- Purpose: Learn to generate responses aligned with human preferences
- Method: Reinforcement learning (PPO algorithm)
- Constraint: KL regularization to prevent large deviation from original model
Objective Function:
$$ \mathcal{L}^{\text{PPO}} = \mathbb{E}_{x,y \sim \pi_\theta} \left[ r(x, y) - \beta \cdot \text{KL}(\pi_\theta || \pi_{\text{SFT}}) \right] $$
Why This Order is Necessary:
- SFT First:
- Base model is unfamiliar with dialogue
- Serves as initial policy for reinforcement learning
- Also used for generating data for reward model training
- Reward Model Second:
- Provides reward signal needed for PPO
- Learns from human ranking data
- Evaluates responses generated by SFT model
- PPO Last:
- Cannot optimize without reward model
- Uses SFT model as initial policy
- KL regularization prevents large deviation from SFT
Overall Flow:
ChatGPT]
Effects:
| Stage | Performance Metric |
|---|---|
| Base Model | Dialogue quality: Low |
| SFT | Dialogue quality: Medium (follows instructions) |
| RLHF (PPO) | Dialogue quality: High (aligned with human preferences) |
References
- Vaswani, A., et al. (2017). Attention is All You Need. NeurIPS.
- Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2). OpenAI.
- Brown, T., et al. (2020). Language Models are Few-Shot Learners (GPT-3). NeurIPS.
- Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
- Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR.
- Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS.
- Touvron, H., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv.
- Touvron, H., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv.
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
- Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv.