This chapter covers Large Language Models (LLMs). You will learn Compare major LLM architectures including GPT and prompt engineering techniques such as Zero-shot.
Learning Objectives
By reading this chapter, you will be able to:
- ✅ Understand scaling laws and performance characteristics of Large Language Models (LLMs)
- ✅ Compare major LLM architectures including GPT, LLaMA, Claude, and Gemini
- ✅ Implement prompt engineering techniques such as Zero-shot, Few-shot, and Chain-of-Thought
- ✅ Understand the mechanisms and effective utilization of In-Context Learning
- ✅ Understand model improvement methods using human feedback through RLHF
- ✅ Build practical LLM applications and integrate APIs
5.1 LLM Scaling Laws
What are Large Language Models
Large Language Models (LLMs) are massive Transformer-based models pre-trained on enormous text datasets. With tens to hundreds of billions of parameters, they can execute various natural language processing tasks with high accuracy.
~100M params
2018-2019] A --> C[Medium Models
1B-10B params
2019-2020] A --> D[Large Models
100B+ params
2020-present] B --> B1[BERT Base
110M] B --> B2[GPT-2
117M-1.5B] C --> C1[GPT-3
175B] C --> C2[T5
11B] D --> D1[GPT-4
~1.7T estimated] D --> D2[Claude 3
undisclosed] D --> D3[Gemini Ultra
undisclosed] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#fff9c4 style D fill:#c8e6c9
Scaling Laws
The scaling laws published by OpenAI in 2020 quantify how model performance scales with respect to the number of parameters, data amount, and compute.
Basic Scaling Laws
The model loss $L$ is determined by three factors:
$$ L(N, D, C) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + \left(\frac{C_c}{C}\right)^{\alpha_C} $$
Where:
- $N$: Number of model parameters (non-embedding)
- $D$: Number of training data tokens
- $C$: Compute (FLOPs)
- $N_c, D_c, C_c$: Scaling constants
- $\alpha_N \approx 0.076, \alpha_D \approx 0.095, \alpha_C \approx 0.050$
| Scaling Factor | Impact | Practical Meaning |
|---|---|---|
| Model Size (N) | 10× reduces loss to ~0.95× | Larger models perform better |
| Data Amount (D) | 10× reduces loss to ~0.93× | Data importance is highest |
| Compute (C) | 10× reduces loss to ~0.97× | Efficient computation is key |
Chinchilla Paper Discovery (2022): DeepMind's research revealed that many LLMs are "over-parameterized" and that training smaller models on more data with the same compute budget is more efficient. The optimal ratio is data tokens ≈ 20 × parameters.
Emergent Abilities
Once LLMs exceed a certain scale, capabilities that were not explicitly trained suddenly emerge. These are called emergent abilities.
Simple Reasoning] A --> F[~100B parameters] F --> G[Chain-of-Thought
Complex Reasoning
Instruction Following] style A fill:#e3f2fd style G fill:#c8e6c9
| Emergent Ability | Emergence Scale | Description |
|---|---|---|
| In-Context Learning | ~10B+ | Learning tasks from examples |
| Chain-of-Thought | ~100B+ | Step-by-step reasoning capability |
| Instruction Following | ~10B+ (after RLHF) | Understanding natural language instructions |
| Multilingual Capability | ~10B+ | Transfer to unlearned languages |
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
import numpy as np
import matplotlib.pyplot as plt
def scaling_law_loss(N, D, C, N_c=8.8e13, D_c=5.4e13, C_c=1.3e13,
alpha_N=0.076, alpha_D=0.095, alpha_C=0.050):
"""
Calculate loss based on scaling laws
Args:
N: Number of parameters
D: Number of data tokens
C: Compute (FLOPs)
Others: Scaling constants
Returns:
Predicted loss value
"""
loss_N = (N_c / N) ** alpha_N
loss_D = (D_c / D) ** alpha_D
loss_C = (C_c / C) ** alpha_C
return loss_N + loss_D + loss_C
# Visualize the impact of model size
param_counts = np.logspace(6, 12, 50) # 1M to 1T parameters
data_tokens = 1e12 # 1T tokens fixed
compute = 1e21 # fixed
losses = [scaling_law_loss(N, data_tokens, compute) for N in param_counts]
plt.figure(figsize=(12, 5))
# Parameters vs Loss
plt.subplot(1, 2, 1)
plt.loglog(param_counts, losses, linewidth=2, color='#7b2cbf')
plt.xlabel('Number of Parameters', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Scaling Law: Model Size vs Performance', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
# Plot major models
models = {
'GPT-2': (1.5e9, scaling_law_loss(1.5e9, data_tokens, compute)),
'GPT-3': (175e9, scaling_law_loss(175e9, data_tokens, compute)),
'GPT-4 (estimated)': (1.7e12, scaling_law_loss(1.7e12, data_tokens, compute)),
}
for name, (params, loss) in models.items():
plt.scatter(params, loss, s=100, zorder=5)
plt.annotate(name, (params, loss), xytext=(10, 10),
textcoords='offset points', fontsize=9)
# Impact of data amount
plt.subplot(1, 2, 2)
data_amounts = np.logspace(9, 13, 50)
model_size = 100e9 # 100B parameters fixed
losses_data = [scaling_law_loss(model_size, D, compute) for D in data_amounts]
plt.loglog(data_amounts, losses_data, linewidth=2, color='#3182ce')
plt.xlabel('Training Data Tokens', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Scaling Law: Data Amount vs Performance', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
# Mark Chinchilla optimal point
optimal_data = 20 * model_size # 20x rule
optimal_loss = scaling_law_loss(model_size, optimal_data, compute)
plt.scatter(optimal_data, optimal_loss, s=150, c='red', marker='*',
zorder=5, label='Chinchilla Optimal Point')
plt.legend()
plt.tight_layout()
plt.show()
# Practical example: Estimating required computational resources
def estimate_training_cost(params, tokens, efficiency=0.5):
"""
Estimate training cost (FLOPs)
Args:
params: Number of parameters
tokens: Number of training tokens
efficiency: Hardware efficiency (0-1)
Returns:
Required FLOPs, GPU time estimate
"""
# Approximately 6 × params FLOPs per token
total_flops = 6 * params * tokens
# A100 GPU: ~312 TFLOPS (FP16)
a100_flops = 312e12 * efficiency
gpu_hours = total_flops / a100_flops / 3600
return total_flops, gpu_hours
# Cost estimation for GPT-3 class model
params_gpt3 = 175e9
tokens_gpt3 = 300e9
total_flops, gpu_hours = estimate_training_cost(params_gpt3, tokens_gpt3)
print(f"\n=== GPT-3 Scale Model Training Cost Estimate ===")
print(f"Parameters: {params_gpt3/1e9:.1f}B")
print(f"Training Tokens: {tokens_gpt3/1e9:.1f}B")
print(f"Total Compute: {total_flops:.2e} FLOPs")
print(f"A100 GPU Hours: {gpu_hours:,.0f} hours ({gpu_hours/24:,.0f} days)")
print(f"GPUs needed (30-day completion): {int(np.ceil(gpu_hours / (24 * 30)))} units")
5.2 Representative LLMs
GPT Series (OpenAI)
GPT-3 (2020)
GPT-3 (Generative Pre-trained Transformer 3) is an autoregressive language model with 175B parameters that demonstrated the effectiveness of Few-shot Learning.
| Feature | Details |
|---|---|
| Parameters | 175B (largest version) |
| Architecture | Decoder-only Transformer, 96 layers, 12,288 dimensions |
| Training Data | Common Crawl, WebText, Books, Wikipedia, etc. ~300B tokens |
| Context Length | 2,048 tokens |
| Innovation | Demonstrated Few-shot Learning, prompt-based versatility |
GPT-4 (2023)
GPT-4 is a state-of-the-art multimodal (text + image) model. Details are not disclosed, but it's estimated to have 1.7 trillion parameters.
| Capability | GPT-3.5 | GPT-4 |
|---|---|---|
| US Bar Exam | Bottom 10% | Top 10% |
| Math Olympiad | Failed | Top 500 equivalent |
| Coding Ability | Basic implementation | Complex algorithm design |
| Multimodal | Text only | Text + image understanding |
LLaMA Series (Meta)
LLaMA (Large Language Model Meta AI) is an efficient LLM family released as open source by Meta.
LLaMA 2 Features
- Model Sizes: 7B, 13B, 70B in 3 variations
- Training Data: 2 trillion tokens (public data only)
- Context Length: 4,096 tokens
- License: Commercial use allowed (with conditions)
- Optimization: Efficient design based on Chinchilla scaling laws
RMSNorm] A --> C[SwiGLU activation
Adopted from PaLM] A --> D[Rotary Positional
Embedding RoPE] A --> E[Grouped Query
Attention GQA] B --> F[Improved training stability] C --> G[Better performance] D --> H[Long context support] E --> I[Faster inference] style A fill:#e3f2fd style F fill:#c8e6c9 style G fill:#c8e6c9 style H fill:#c8e6c9 style I fill:#c8e6c9
Claude (Anthropic)
Claude is a safety-focused LLM developed by Anthropic using the "Constitutional AI" approach.
| Model | Features | Context Length |
|---|---|---|
| Claude 3 Opus | Highest performance, complex reasoning | 200K tokens |
| Claude 3 Sonnet | Balanced, fast response | 200K tokens |
| Claude 3 Haiku | Lightweight, fast, cost-efficient | 200K tokens |
Constitutional AI: In addition to human feedback (RLHF), this training method uses a "constitution" where AI performs self-criticism and self-improvement. It reduces harmful outputs and generates safer, more helpful responses.
Gemini (Google)
Gemini is a multimodal-native LLM developed by Google that processes text, images, audio, and video in an integrated manner.
- Gemini Ultra: Highest performance, complex task support
- Gemini Pro: Optimized for general purposes
- Gemini Nano: Lightweight version that runs on-device
# Major LLM comparison implementation example (via API)
import os
from typing import List, Dict
import time
class LLMComparison:
"""
Compare multiple LLM APIs using a unified interface
"""
def __init__(self):
"""Get each API key from environment variables"""
self.openai_key = os.getenv('OPENAI_API_KEY')
self.anthropic_key = os.getenv('ANTHROPIC_API_KEY')
self.google_key = os.getenv('GOOGLE_API_KEY')
def query_gpt4(self, prompt: str, max_tokens: int = 500) -> Dict:
"""Send query to GPT-4"""
try:
import openai
openai.api_key = self.openai_key
start_time = time.time()
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
temperature=0.7
)
latency = time.time() - start_time
return {
'model': 'GPT-4',
'response': response.choices[0].message.content,
'tokens': response.usage.total_tokens,
'latency': latency
}
except Exception as e:
return {'model': 'GPT-4', 'error': str(e)}
def query_claude(self, prompt: str, max_tokens: int = 500) -> Dict:
"""Send query to Claude 3"""
try:
import anthropic
client = anthropic.Anthropic(api_key=self.anthropic_key)
start_time = time.time()
message = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=max_tokens,
messages=[{"role": "user", "content": prompt}]
)
latency = time.time() - start_time
return {
'model': 'Claude 3 Opus',
'response': message.content[0].text,
'tokens': message.usage.input_tokens + message.usage.output_tokens,
'latency': latency
}
except Exception as e:
return {'model': 'Claude 3 Opus', 'error': str(e)}
def query_gemini(self, prompt: str, max_tokens: int = 500) -> Dict:
"""Send query to Gemini Pro"""
try:
import google.generativeai as genai
genai.configure(api_key=self.google_key)
model = genai.GenerativeModel('gemini-pro')
start_time = time.time()
response = model.generate_content(prompt)
latency = time.time() - start_time
return {
'model': 'Gemini Pro',
'response': response.text,
'tokens': 'N/A', # Gemini API doesn't return detailed token count
'latency': latency
}
except Exception as e:
return {'model': 'Gemini Pro', 'error': str(e)}
def compare_all(self, prompt: str, max_tokens: int = 500) -> List[Dict]:
"""
Send the same prompt to all models for comparison
Args:
prompt: Input prompt
max_tokens: Maximum generation tokens
Returns:
List of results from each model
"""
results = []
print(f"Prompt: {prompt}\n")
print("=" * 80)
# GPT-4
print("Querying GPT-4...")
gpt4_result = self.query_gpt4(prompt, max_tokens)
results.append(gpt4_result)
self._print_result(gpt4_result)
# Claude 3
print("\nQuerying Claude 3...")
claude_result = self.query_claude(prompt, max_tokens)
results.append(claude_result)
self._print_result(claude_result)
# Gemini
print("\nQuerying Gemini Pro...")
gemini_result = self.query_gemini(prompt, max_tokens)
results.append(gemini_result)
self._print_result(gemini_result)
return results
def _print_result(self, result: Dict):
"""Format and print result"""
if 'error' in result:
print(f"❌ {result['model']}: Error - {result['error']}")
else:
print(f"✅ {result['model']}:")
print(f" Response: {result['response'][:200]}...")
print(f" Tokens: {result['tokens']}")
print(f" Latency: {result['latency']:.2f}s")
# Usage example
if __name__ == "__main__":
# Note: Requires each API key for execution
comparator = LLMComparison()
# Simple comparison test
test_prompt = """
Solve the following problem step by step:
Problem: A company's revenue increases by 20% each year.
If the current revenue is 100 million yen, what will the revenue be after 5 years?
Show the calculation process.
"""
results = comparator.compare_all(test_prompt, max_tokens=300)
# Performance comparison
print("\n" + "=" * 80)
print("Performance Comparison:")
for result in results:
if 'error' not in result:
print(f"{result['model']:20} - Latency: {result['latency']:.2f}s")
5.3 Prompt Engineering
What is Prompt Engineering
Prompt engineering is the technique of designing inputs to elicit desired outputs from LLMs. With appropriate prompts, model performance can be dramatically improved without retraining.
Zero-shot Learning
Zero-shot Learning is a method of executing tasks with instructions only, without examples. It became possible through emergent abilities of large-scale models.
class ZeroShotPromptEngine:
"""
Design and execute Zero-shot prompts
"""
@staticmethod
def sentiment_analysis(text: str) -> str:
"""Zero-shot prompt for sentiment analysis"""
prompt = f"""
Classify the sentiment of the following text as "Positive", "Negative", or "Neutral".
Return only the classification result.
Text: {text}
Classification:"""
return prompt
@staticmethod
def text_summarization(text: str, max_words: int = 50) -> str:
"""Zero-shot prompt for summarization"""
prompt = f"""
Summarize the following text in {max_words} words or less.
Concisely capture the key points.
Text:
{text}
Summary:"""
return prompt
@staticmethod
def question_answering(context: str, question: str) -> str:
"""Zero-shot prompt for question answering"""
prompt = f"""
Answer the question based on the following context.
If the information is not in the context, answer "Insufficient information".
Context:
{context}
Question: {question}
Answer:"""
return prompt
@staticmethod
def language_translation(text: str, target_lang: str = "English") -> str:
"""Zero-shot prompt for translation"""
prompt = f"""
Translate the following text into {target_lang}.
Aim for natural and accurate translation.
Text: {text}
Translation:"""
return prompt
# Usage example
engine = ZeroShotPromptEngine()
# Sentiment analysis
text1 = "This product exceeded my expectations. I'm really glad I purchased it."
prompt1 = engine.sentiment_analysis(text1)
print("=== Zero-shot Sentiment Analysis ===")
print(prompt1)
print()
# Summarization
text2 = """
The development of artificial intelligence (AI) is bringing innovation to various fields.
Particularly in natural language processing, the emergence of large language models (LLMs)
has achieved near-human performance in tasks such as machine translation, text generation, and question answering.
In the future, AI will be utilized in even more areas including healthcare, education, and business.
"""
prompt2 = engine.text_summarization(text2, max_words=30)
print("=== Zero-shot Summarization ===")
print(prompt2)
Few-shot Learning
Few-shot Learning is a method where the model learns task patterns by presenting a small number of examples (typically 1-10).
class FewShotPromptEngine:
"""
Design and execute Few-shot prompts
"""
@staticmethod
def sentiment_analysis(text: str, num_examples: int = 3) -> str:
"""Few-shot prompt for sentiment analysis"""
# Example data
examples = [
("This movie was wonderful! I was moved.", "Positive"),
("The food was cold and the service was bad.", "Negative"),
("It's an ordinary product. Nothing particularly good or bad.", "Neutral"),
]
# Build prompt
prompt = "Classify the sentiment of the text based on the following examples.\n\n"
for i, (example_text, label) in enumerate(examples[:num_examples], 1):
prompt += f"Example {i}:\nText: {example_text}\nClassification: {label}\n\n"
prompt += f"Text: {text}\nClassification:"
return prompt
@staticmethod
def entity_extraction(text: str) -> str:
"""Few-shot prompt for named entity extraction"""
prompt = """
Extract person names, organization names, and locations from the text based on the following examples.
Example 1:
Text: Taro Tanaka is researching machine learning at the University of Tokyo.
Extraction: Person=Taro Tanaka, Organization=University of Tokyo, Location=None
Example 2:
Text: Apple CEO Tim Cook gave a speech in Silicon Valley.
Extraction: Person=Tim Cook, Organization=Apple, Location=Silicon Valley
Example 3:
Text: Google and Microsoft announced new AI technologies.
Extraction: Person=None, Organization=Google, Microsoft, Location=None
Text: {text}
Extraction:"""
return prompt.format(text=text)
@staticmethod
def code_generation(task_description: str) -> str:
"""Few-shot prompt for code generation"""
prompt = """
Generate a Python function to perform the task based on the following examples.
Example 1:
Task: Double list elements
Code:
def double_elements(lst):
return [x * 2 for x in lst]
Example 2:
Task: Reverse a string
Code:
def reverse_string(s):
return s[::-1]
Example 3:
Task: Calculate average of a list
Code:
def calculate_average(lst):
return sum(lst) / len(lst) if lst else 0
Task: {task}
Code:"""
return prompt.format(task=task_description)
@staticmethod
def analogical_reasoning(question: str) -> str:
"""Few-shot prompt for analogical reasoning"""
prompt = """
Understand the following patterns and answer the question.
Example 1: Tokyo:Japan = Paris:?
Answer: France
Reason: Just as Tokyo is the capital of Japan, Paris is the capital of France.
Example 2: Doctor:Hospital = Teacher:?
Answer: School
Reason: Just as doctors work at hospitals, teachers work at schools.
Example 3: Dog:Mammal = Hawk:?
Answer: Bird
Reason: Just as dogs belong to mammals, hawks belong to birds.
Question: {question}
Answer:"""
return prompt.format(question=question)
# Usage example
few_shot_engine = FewShotPromptEngine()
# Few-shot sentiment analysis
test_text = "It wasn't as good as I expected, but it's okay."
prompt = few_shot_engine.sentiment_analysis(test_text)
print("=== Few-shot Sentiment Analysis ===")
print(prompt)
print("\n" + "="*80 + "\n")
# Few-shot named entity extraction
test_entity = "An NHK reporter interviewed Hanako Yamada in New York."
prompt = few_shot_engine.entity_extraction(test_entity)
print("=== Few-shot Named Entity Extraction ===")
print(prompt)
print("\n" + "="*80 + "\n")
# Few-shot analogical reasoning
test_analogy = "Book:Author = Movie:?"
prompt = few_shot_engine.analogical_reasoning(test_analogy)
print("=== Few-shot Analogical Reasoning ===")
print(prompt)
Chain-of-Thought (CoT) Prompting
Chain-of-Thought is a method that improves accuracy on complex problems by having the model generate step-by-step reasoning processes.
$$ \text{Accuracy}_{\text{CoT}} \approx \text{Accuracy}_{\text{standard}} + \Delta_{\text{reasoning}} $$
Where $\Delta_{\text{reasoning}}$ is the accuracy improvement from reasoning, which becomes larger for more complex problems.
class ChainOfThoughtEngine:
"""
Chain-of-Thought (CoT) prompt engineering
"""
@staticmethod
def math_problem_basic(problem: str) -> str:
"""Basic CoT for math problems"""
prompt = f"""
Solve the following problem step by step.
Explain what you are calculating at each step.
Problem: {problem}
Solution:
Step 1:"""
return prompt
@staticmethod
def math_problem_with_examples(problem: str) -> str:
"""Few-shot CoT example"""
prompt = """
Solve the problem step by step based on the following examples.
Example 1:
Problem: There are 15 apples and 23 oranges. How many fruits are there in total?
Solution:
Step 1: Check number of apples → 15
Step 2: Check number of oranges → 23
Step 3: Calculate total → 15 + 23 = 38
Answer: 38 fruits
Example 2:
Problem: I bought 3 books at 500 yen each and 5 pens at 120 yen each. What is the total amount?
Solution:
Step 1: Calculate total for books → 500 yen × 3 books = 1,500 yen
Step 2: Calculate total for pens → 120 yen × 5 pens = 600 yen
Step 3: Calculate grand total → 1,500 yen + 600 yen = 2,100 yen
Answer: 2,100 yen
Problem: {problem}
Solution:
Step 1:"""
return prompt.format(problem=problem)
@staticmethod
def logical_reasoning(scenario: str, question: str) -> str:
"""CoT for logical reasoning"""
prompt = f"""
Analyze the following situation step by step and reach a logical conclusion.
Situation: {scenario}
Question: {question}
Analysis:
Observation 1:"""
return prompt
@staticmethod
def self_consistency_cot(problem: str, num_paths: int = 3) -> str:
"""
Self-Consistency CoT: Generate multiple reasoning paths
and select the most consistent answer
"""
prompt = f"""
Solve the following problem using {num_paths} different approaches.
Reason step by step for each approach, then choose the most certain answer.
Problem: {problem}
Approach 1:
"""
return prompt
# Practical example: Complex math problem
cot_engine = ChainOfThoughtEngine()
problem1 = """
A store's revenue was 1 million yen in year 1.
Year 2 was 20% increase from previous year, year 3 was 15% decrease from previous year, year 4 was 25% increase from previous year.
What is the revenue in year 4 in ten thousand yen?
"""
print("=== Chain-of-Thought: Math Problem ===")
prompt1 = cot_engine.math_problem_with_examples(problem1)
print(prompt1)
print("\n" + "="*80 + "\n")
# Logical reasoning example
scenario = """
There are conference rooms A, B, and C.
- Mr. Tanaka is not in conference room A
- Mr. Sato is in conference room B
- No one is in conference room C
- Besides Mr. Tanaka and Mr. Sato, there is Mr. Yamada
"""
question = "Which conference room is Mr. Yamada in?"
print("=== Chain-of-Thought: Logical Reasoning ===")
prompt2 = cot_engine.logical_reasoning(scenario, question)
print(prompt2)
print("\n" + "="*80 + "\n")
# Self-Consistency CoT
problem2 = """
There are 5 red balls and 3 blue balls in a bag.
When drawing 2 balls simultaneously, what is the probability that both are red?
"""
print("=== Self-Consistency CoT ===")
prompt3 = cot_engine.self_consistency_cot(problem2, num_paths=3)
print(prompt3)
Effect of CoT: In experiments with Google's PaLM model, the accuracy rate for arithmetic problems improved from 34% with standard prompts to 79% with CoT prompts. Particularly large improvements were seen in problems requiring multi-step reasoning.
Prompt Template Design
from typing import Dict, List, Optional
from dataclasses import dataclass
@dataclass
class PromptTemplate:
"""Structured management of prompt templates"""
name: str
instruction: str
examples: Optional[List[Dict[str, str]]] = None
output_format: Optional[str] = None
def render(self, **kwargs) -> str:
"""Convert template to actual prompt"""
prompt = f"{self.instruction}\n\n"
# Add Few-shot examples
if self.examples:
prompt += "Examples:\n"
for i, example in enumerate(self.examples, 1):
prompt += f"\nExample {i}:\n"
for key, value in example.items():
prompt += f"{key}: {value}\n"
# Add output format
if self.output_format:
prompt += f"\nOutput format:\n{self.output_format}\n"
# Insert variables
prompt += "\nInput:\n"
for key, value in kwargs.items():
prompt += f"{key}: {value}\n"
prompt += "\nOutput:"
return prompt
class PromptLibrary:
"""Library of reusable prompt templates"""
@staticmethod
def get_classification_template() -> PromptTemplate:
"""Template for classification tasks"""
return PromptTemplate(
name="classification",
instruction="Classify the following text into the specified category.",
examples=[
{
"Text": "A new smartphone has been released.",
"Category": "Technology"
},
{
"Text": "Stock prices are soaring.",
"Category": "Business"
}
],
output_format="Return only the category name."
)
@staticmethod
def get_extraction_template() -> PromptTemplate:
"""Template for information extraction"""
return PromptTemplate(
name="extraction",
instruction="Extract the specified information from the text.",
output_format="Return in JSON format: {\"item1\": \"value1\", \"item2\": \"value2\"}"
)
@staticmethod
def get_generation_template() -> PromptTemplate:
"""Template for generation tasks"""
return PromptTemplate(
name="generation",
instruction="Generate creative content based on the following conditions.",
output_format="Output in natural, readable text."
)
@staticmethod
def get_reasoning_template() -> PromptTemplate:
"""Template for reasoning tasks"""
return PromptTemplate(
name="reasoning",
instruction="""
Analyze the following problem step by step and solve it logically.
Clearly show the reasoning process at each step.
""",
output_format="""
Step 1: [Initial analysis]
Step 2: [Next reasoning]
...
Conclusion: [Final answer]
"""
)
# Usage example
library = PromptLibrary()
# Classification task
classification_template = library.get_classification_template()
prompt = classification_template.render(
Text="AI research is accelerating.",
Category="Technology, Business, Politics, Sports, Entertainment"
)
print("=== Classification Prompt ===")
print(prompt)
print("\n" + "="*80 + "\n")
# Information extraction task
extraction_template = library.get_extraction_template()
prompt = extraction_template.render(
Text="Taro Tanaka (35 years old) is the CTO of ABC Corporation and lives in Tokyo.",
Extraction_items="Name, Age, Company, Position, Location"
)
print("=== Extraction Prompt ===")
print(prompt)
print("\n" + "="*80 + "\n")
# Reasoning task
reasoning_template = library.get_reasoning_template()
prompt = reasoning_template.render(
Problem="There are three boxes A, B, C, and only one contains treasure. Box A says 'The treasure is here', Box B says 'The treasure is not in A', Box C says 'The treasure is in B'. If only one is telling the truth, which box has the treasure?"
)
print("=== Reasoning Prompt ===")
print(prompt)
5.4 In-Context Learning
Mechanism of In-Context Learning
In-Context Learning (ICL) is the ability of LLMs to learn directly from examples in the prompt and execute new tasks without parameter updates.
Self-Attention] D --> F E --> F F --> G[Pattern Learning
within Context] G --> H[Output Generation] style A fill:#e3f2fd style F fill:#fff3e0 style G fill:#c8e6c9
Why ICL Works
Recent research (2023) revealed that ICL operates through the following mechanisms:
- Activation of latent concepts: Knowledge acquired during pre-training is activated by examples
- Formation of task vectors: Patterns extracted from examples are retained as internal representations
- Analogy-based reasoning: New inputs are processed by analogy with examples
$$ P(y|x, \text{examples}) \approx \sum_{i=1}^{k} \alpha_i \cdot P(y|x, \text{example}_i) $$
Where $\alpha_i$ is the relevance weight of each example.
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
import numpy as np
from typing import List, Tuple, Dict
class InContextLearningSimulator:
"""
Simulate the operation of In-Context Learning
(Simplified mechanism reproduction)
"""
def __init__(self, embedding_dim: int = 128):
"""
Args:
embedding_dim: Dimension of embedding vectors
"""
self.embedding_dim = embedding_dim
self.task_vector = None
def create_example_embedding(self, input_text: str, output_text: str) -> np.ndarray:
"""
Create example embedding from input-output pair
(Actually processed by Transformer, but simplified here)
"""
# Simplified: Convert string to embedding via hash
input_hash = hash(input_text) % 10000
output_hash = hash(output_text) % 10000
np.random.seed(input_hash)
input_emb = np.random.randn(self.embedding_dim)
np.random.seed(output_hash)
output_emb = np.random.randn(self.embedding_dim)
# Task vector: difference between output and input
task_emb = output_emb - input_emb
return task_emb / (np.linalg.norm(task_emb) + 1e-8)
def learn_from_examples(self, examples: List[Tuple[str, str]]) -> None:
"""
Learn task vector from Few-shot examples
Args:
examples: [(input1, output1), (input2, output2), ...]
"""
task_vectors = []
for input_text, output_text in examples:
task_vec = self.create_example_embedding(input_text, output_text)
task_vectors.append(task_vec)
# Acquire task representation as average of multiple examples
self.task_vector = np.mean(task_vectors, axis=0)
self.task_vector /= (np.linalg.norm(self.task_vector) + 1e-8)
def predict(self, query: str, candidates: List[str]) -> Dict[str, float]:
"""
Predict based on learned task vector
Args:
query: Input query
candidates: List of candidate outputs
Returns:
Score dictionary for each candidate
"""
if self.task_vector is None:
raise ValueError("Call learn_from_examples() first")
# Query embedding
query_hash = hash(query) % 10000
np.random.seed(query_hash)
query_emb = np.random.randn(self.embedding_dim)
query_emb /= (np.linalg.norm(query_emb) + 1e-8)
# Calculate score for each candidate
scores = {}
for candidate in candidates:
candidate_hash = hash(candidate) % 10000
np.random.seed(candidate_hash)
candidate_emb = np.random.randn(self.embedding_dim)
candidate_emb /= (np.linalg.norm(candidate_emb) + 1e-8)
# Consistency with task vector
predicted_output = query_emb + self.task_vector
similarity = np.dot(predicted_output, candidate_emb)
scores[candidate] = float(similarity)
# Normalize scores
total = sum(np.exp(s) for s in scores.values())
scores = {k: np.exp(v)/total for k, v in scores.items()}
return scores
# Usage example: Sentiment analysis task
print("=== In-Context Learning Simulation ===\n")
simulator = InContextLearningSimulator(embedding_dim=128)
# Few-shot examples
examples = [
("This movie was wonderful!", "Positive"),
("Worst experience ever. Never going back.", "Negative"),
("It's ordinary. Nothing particularly memorable.", "Neutral"),
("Quality exceeded expectations, very satisfied!", "Positive"),
("Disappointed with terrible service.", "Negative"),
]
# Learn task vector
simulator.learn_from_examples(examples)
print("✅ Learning from Few-shot examples complete\n")
# Predict on new inputs
test_queries = [
"Excellent product. Highly recommend.",
"It was disappointing.",
"Mediocre performance."
]
candidates = ["Positive", "Negative", "Neutral"]
for query in test_queries:
print(f"Query: {query}")
scores = simulator.predict(query, candidates)
# Sort by score
sorted_scores = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for label, score in sorted_scores:
print(f" {label}: {score:.4f} {'★' * int(score * 10)}")
print(f" → Prediction: {sorted_scores[0][0]}\n")
Effective Utilization of ICL
Example Selection Strategies
| Strategy | Method | Advantage |
|---|---|---|
| Random Selection | Randomly select from training data | Simple, less bias |
| Similarity-based | Select examples similar to query | Higher task performance |
| Diversity-focused | Include diverse examples | Improved generalization |
| Difficulty Adjustment | Order from easy to hard | Better learning efficiency |
5.5 RLHF (Reinforcement Learning from Human Feedback)
What is RLHF
RLHF is a method to improve LLMs by leveraging human feedback. It is adopted by almost all commercial LLMs including ChatGPT, Claude, and Gemini.
Pre-trained LLM] B --> C[Step 2
Reward Model Training] C --> C1[Human evaluation
of responses] C --> C2[Create preferred/
non-preferred pairs] C --> C3[Learn Reward Model] C3 --> D[Step 3
Reinforcement Learning via PPO] D --> D1[LLM generates responses] D --> D2[Reward Model
scores them] D --> D3[Reinforce high-scoring responses] D3 --> E[Optimized LLM] style A fill:#e3f2fd style C fill:#fff3e0 style D fill:#fff9c4 style E fill:#c8e6c9
3 Steps of RLHF
Step 1: Pre-training
Train a Transformer on a large corpus to learn basic language patterns.
$$ \mathcal{L}_{\text{pretrain}} = -\mathbb{E}_{x \sim \mathcal{D}} \left[\log P_{\theta}(x)\right] $$
Step 2: Reward Model Training
Human evaluators compare multiple responses and rank preferences. A Reward Model is trained from this data.
$$ \mathcal{L}_{\text{reward}} = -\mathbb{E}_{(x, y_w, y_l)} \left[\log \sigma(r_{\phi}(x, y_w) - r_{\phi}(x, y_l))\right] $$
Where:
- $r_{\phi}$: Reward Model (parameters $\phi$)
- $y_w$: Preferred response (winner)
- $y_l$: Non-preferred response (loser)
- $\sigma$: Sigmoid function
Step 3: Reinforcement Learning via PPO
Optimize the LLM using the PPO (Proximal Policy Optimization) algorithm. KL divergence constraint prevents large deviations from the original pre-trained model.
$$ \mathcal{L}_{\text{RLHF}} = \mathbb{E}_{x, y} \left[r_{\phi}(x, y) - \beta \cdot D_{KL}(\pi_{\theta} || \pi_{\text{ref}})\right] $$
Where:
- $\pi_{\theta}$: Policy being optimized (LLM)
- $\pi_{\text{ref}}$: Reference policy (original model)
- $\beta$: Strength of KL constraint
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
class RewardModel(nn.Module):
"""
Reward Model for RLHF
"""
def __init__(self, input_dim: int = 768, hidden_dim: int = 256):
"""
Args:
input_dim: Input embedding dimension (e.g., BERT embedding)
hidden_dim: Hidden layer dimension
"""
super().__init__()
self.network = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(hidden_dim, 1) # Output scalar reward
)
def forward(self, embeddings):
"""
Args:
embeddings: [batch_size, seq_len, input_dim]
Returns:
rewards: [batch_size] scalar rewards
"""
# Average pooling
pooled = embeddings.mean(dim=1) # [batch_size, input_dim]
rewards = self.network(pooled).squeeze(-1) # [batch_size]
return rewards
class RLHFTrainer:
"""
RLHF training process simulation
"""
def __init__(self, reward_model: RewardModel, beta: float = 0.1):
"""
Args:
reward_model: Trained Reward Model
beta: Strength of KL divergence constraint
"""
self.reward_model = reward_model
self.beta = beta
def compute_reward(self, response_embeddings: torch.Tensor) -> torch.Tensor:
"""
Calculate reward for responses
Args:
response_embeddings: [batch_size, seq_len, embed_dim]
Returns:
rewards: [batch_size]
"""
with torch.no_grad():
rewards = self.reward_model(response_embeddings)
return rewards
def compute_kl_penalty(self,
current_logprobs: torch.Tensor,
reference_logprobs: torch.Tensor) -> torch.Tensor:
"""
Calculate KL divergence penalty
Args:
current_logprobs: Log probabilities of current model
reference_logprobs: Log probabilities of reference model
Returns:
kl_penalty: KL divergence
"""
kl = current_logprobs - reference_logprobs
return kl.mean()
def ppo_loss(self,
old_logprobs: torch.Tensor,
new_logprobs: torch.Tensor,
advantages: torch.Tensor,
epsilon: float = 0.2) -> torch.Tensor:
"""
Calculate PPO (Proximal Policy Optimization) loss
Args:
old_logprobs: Log probabilities of old policy
new_logprobs: Log probabilities of new policy
advantages: Advantages (rewards - baseline)
epsilon: Clipping range
Returns:
ppo_loss: PPO loss
"""
# Probability ratio
ratio = torch.exp(new_logprobs - old_logprobs)
# Clipped objective function
clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
# PPO loss (to minimize)
loss = -torch.min(
ratio * advantages,
clipped_ratio * advantages
).mean()
return loss
# Usage example
print("=== RLHF Reward Model Demo ===\n")
# Initialize Reward Model
reward_model = RewardModel(input_dim=768, hidden_dim=256)
# Dummy data: embeddings of 2 responses
torch.manual_seed(42)
response1_emb = torch.randn(1, 20, 768) # Preferred response
response2_emb = torch.randn(1, 20, 768) # Non-preferred response
# Calculate rewards
reward1 = reward_model(response1_emb)
reward2 = reward_model(response2_emb)
print(f"Reward for response 1: {reward1.item():.4f}")
print(f"Reward for response 2: {reward2.item():.4f}")
# Calculate pairwise loss (during training)
pairwise_loss = -F.logsigmoid(reward1 - reward2).mean()
print(f"\nPairwise ranking loss: {pairwise_loss.item():.4f}")
# Initialize RLHF trainer
trainer = RLHFTrainer(reward_model, beta=0.1)
# PPO loss calculation example
old_logprobs = torch.randn(16) # [batch_size]
new_logprobs = old_logprobs + torch.randn(16) * 0.1
advantages = torch.randn(16)
ppo_loss = trainer.ppo_loss(old_logprobs, new_logprobs, advantages)
print(f"\nPPO loss: {ppo_loss.item():.4f}")
# Calculate KL penalty
kl_penalty = trainer.compute_kl_penalty(new_logprobs, old_logprobs)
print(f"KL penalty: {kl_penalty.item():.4f}")
# Total objective function
total_loss = ppo_loss + trainer.beta * kl_penalty
print(f"\nTotal loss (PPO + KL constraint): {total_loss.item():.4f}")
Challenges and Improvements in RLHF
| Challenge | Description | Solution |
|---|---|---|
| Reward Hacking | Model finds unintended ways to maximize reward | Diverse evaluators, adjust KL constraints |
| Evaluator Bias | Inconsistency in human evaluations | Multi-evaluator consensus, guideline development |
| Computational Cost | PPO training is computationally expensive | Alternative methods like DPO (Direct Preference Optimization) |
| Excessive Safety | Responses become overly cautious | Fine-tuning reward model |
5.6 Practical Projects
Project 1: Few-shot Text Classification System
import openai
import os
from typing import List, Dict, Tuple
from collections import Counter
class FewShotClassifier:
"""
General-purpose text classifier using Few-shot Learning
"""
def __init__(self, api_key: str = None, model: str = "gpt-3.5-turbo"):
"""
Args:
api_key: OpenAI API key (if None, get from environment variable)
model: Model name to use
"""
self.api_key = api_key or os.getenv('OPENAI_API_KEY')
self.model = model
openai.api_key = self.api_key
def create_few_shot_prompt(self,
examples: List[Tuple[str, str]],
query: str,
labels: List[str]) -> str:
"""
Build Few-shot prompt
Args:
examples: [(text1, label1), (text2, label2), ...]
query: Text to classify
labels: List of possible labels
Returns:
Constructed prompt
"""
prompt = "This is a task to classify sentences into categories.\n\n"
prompt += f"Available categories: {', '.join(labels)}\n\n"
# Add Few-shot examples
for i, (text, label) in enumerate(examples, 1):
prompt += f"Example {i}:\n"
prompt += f"Sentence: {text}\n"
prompt += f"Category: {label}\n\n"
# Add query
prompt += f"Classify the following sentence:\n"
prompt += f"Sentence: {query}\n"
prompt += f"Category:"
return prompt
def classify(self,
query: str,
examples: List[Tuple[str, str]],
labels: List[str],
temperature: float = 0.3) -> Dict[str, any]:
"""
Classify text
Args:
query: Text to classify
examples: Few-shot examples
labels: Possible labels
temperature: Generation diversity (0-1)
Returns:
Classification result dictionary
"""
prompt = self.create_few_shot_prompt(examples, query, labels)
try:
response = openai.ChatCompletion.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
max_tokens=50
)
predicted_label = response.choices[0].message.content.strip()
return {
'query': query,
'predicted_label': predicted_label,
'prompt': prompt,
'success': True
}
except Exception as e:
return {
'query': query,
'error': str(e),
'success': False
}
def batch_classify(self,
queries: List[str],
examples: List[Tuple[str, str]],
labels: List[str]) -> List[Dict]:
"""
Batch classify multiple texts
Args:
queries: List of texts to classify
examples: Few-shot examples
labels: Possible labels
Returns:
List of classification results
"""
results = []
for query in queries:
result = self.classify(query, examples, labels)
results.append(result)
return results
# Usage example: News article classification
print("=== Few-shot Text Classification Demo ===\n")
# Note: Requires OpenAI API key for execution
# classifier = FewShotClassifier()
# Few-shot training examples
examples = [
("New smartphone released, reservations flooding in.", "Technology"),
("Stock market surges, hitting record highs.", "Business"),
("Japan's national soccer team wins at World Cup.", "Sports"),
("New movie released, breaking box office records.", "Entertainment"),
("Government announces new economic policy.", "Politics"),
]
# Labels
labels = ["Technology", "Business", "Sports", "Entertainment", "Politics"]
# Test queries
test_queries = [
"Massive investment being made in AI research and development.",
"Professional baseball championship team has been decided.",
"New game becomes a worldwide hit.",
]
# Execute classification (demo pseudocode)
print("Few-shot examples:")
for text, label in examples:
print(f" [{label}] {text}")
print("\nTexts to classify:")
for query in test_queries:
print(f" - {query}")
# Actual API call is commented out
# results = classifier.batch_classify(test_queries, examples, labels)
#
# print("\nResults:")
# for result in results:
# if result['success']:
# print(f"✅ [{result['predicted_label']}] {result['query']}")
# else:
# print(f"❌ Error: {result['error']}")
Project 2: Chain-of-Thought Reasoning Engine
class ChainOfThoughtReasoner:
"""
Problem-solving engine implementing Chain-of-Thought reasoning
"""
def __init__(self, api_key: str = None, model: str = "gpt-4"):
"""
Args:
api_key: OpenAI API key
model: Model to use (GPT-4 recommended for CoT)
"""
self.api_key = api_key or os.getenv('OPENAI_API_KEY')
self.model = model
openai.api_key = self.api_key
def solve_math_problem(self, problem: str) -> Dict[str, any]:
"""
Solve math problem with CoT
Args:
problem: Problem statement
Returns:
Solution and reasoning process
"""
prompt = f"""
Solve the following math problem step by step.
Clearly explain what you are calculating at each step.
Problem: {problem}
Solution steps:
Step 1:"""
try:
response = openai.ChatCompletion.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=500
)
reasoning = response.choices[0].message.content
# Extract final answer (simplified version)
lines = reasoning.split('\n')
answer_line = [l for l in lines if 'Answer' in l or 'answer' in l]
final_answer = answer_line[-1] if answer_line else "Extraction failed"
return {
'problem': problem,
'reasoning': reasoning,
'final_answer': final_answer,
'success': True
}
except Exception as e:
return {
'problem': problem,
'error': str(e),
'success': False
}
def solve_logic_puzzle(self, puzzle: str, question: str) -> Dict[str, any]:
"""
Solve logic puzzle with CoT
Args:
puzzle: Puzzle situation description
question: Question to solve
Returns:
Solution and reasoning process
"""
prompt = f"""
Analyze and solve the following logic puzzle step by step.
Situation:
{puzzle}
Question: {question}
Analysis steps:
Observation 1:"""
try:
response = openai.ChatCompletion.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=600
)
reasoning = response.choices[0].message.content
return {
'puzzle': puzzle,
'question': question,
'reasoning': reasoning,
'success': True
}
except Exception as e:
return {
'puzzle': puzzle,
'error': str(e),
'success': False
}
# Usage example
print("=== Chain-of-Thought Reasoning Engine Demo ===\n")
# reasoner = ChainOfThoughtReasoner()
# Math problem
math_problem = """
An item's regular price is 10,000 yen.
It was discounted 20% off in a sale, then an additional 500 yen off with a coupon.
Calculate the final payment amount.
"""
print("Math Problem:")
print(math_problem)
print("\nExpected reasoning:")
print("Step 1: Calculate 20% off amount → 10,000 × 0.2 = 2,000 yen")
print("Step 2: Price after sale → 10,000 - 2,000 = 8,000 yen")
print("Step 3: Apply coupon → 8,000 - 500 = 7,500 yen")
print("Answer: 7,500 yen")
# result = reasoner.solve_math_problem(math_problem)
# if result['success']:
# print("\nCoT Reasoning Result:")
# print(result['reasoning'])
Project 3: Integrated LLM Chatbot System
from datetime import datetime
from typing import Optional
class LLMChatbot:
"""
Chatbot integrating multiple prompt techniques
"""
def __init__(self, api_key: str = None, model: str = "gpt-3.5-turbo"):
self.api_key = api_key or os.getenv('OPENAI_API_KEY')
self.model = model
self.conversation_history = []
openai.api_key = self.api_key
def set_system_prompt(self, persona: str, capabilities: List[str]):
"""
Set system prompt (bot persona)
Args:
persona: Bot's personality and role
capabilities: List of bot's capabilities
"""
system_prompt = f"""
You are {persona}.
Your capabilities:
{chr(10).join('- ' + cap for cap in capabilities)}
In conversations with users, provide kind and accurate information.
For uncertain information, clearly state that.
"""
self.conversation_history = [
{"role": "system", "content": system_prompt}
]
def chat(self,
user_message: str,
use_cot: bool = False,
temperature: float = 0.7) -> Dict[str, any]:
"""
Respond to user message
Args:
user_message: User's input
use_cot: Whether to use Chain-of-Thought
temperature: Response diversity
Returns:
Response result
"""
# Add CoT prompt
if use_cot:
user_message = f"""
{user_message}
When answering the above question, please think step by step and explain.
"""
# Add to conversation history
self.conversation_history.append({
"role": "user",
"content": user_message
})
try:
response = openai.ChatCompletion.create(
model=self.model,
messages=self.conversation_history,
temperature=temperature,
max_tokens=500
)
assistant_message = response.choices[0].message.content
# Add assistant's response to history
self.conversation_history.append({
"role": "assistant",
"content": assistant_message
})
return {
'user': user_message,
'assistant': assistant_message,
'timestamp': datetime.now().isoformat(),
'success': True
}
except Exception as e:
return {
'user': user_message,
'error': str(e),
'success': False
}
def get_conversation_summary(self) -> str:
"""Get summary of conversation history"""
if len(self.conversation_history) <= 1:
return "Conversation has not started yet."
summary = "Conversation History:\n"
for i, msg in enumerate(self.conversation_history[1:], 1): # Skip system prompt
role = "User" if msg["role"] == "user" else "Assistant"
content = msg["content"][:100] + "..." if len(msg["content"]) > 100 else msg["content"]
summary += f"{i}. [{role}] {content}\n"
return summary
def clear_history(self, keep_system: bool = True):
"""
Clear conversation history
Args:
keep_system: Whether to keep system prompt
"""
if keep_system and self.conversation_history:
self.conversation_history = [self.conversation_history[0]]
else:
self.conversation_history = []
# Usage example
print("=== Integrated LLM Chatbot Demo ===\n")
# Initialize chatbot
# bot = LLMChatbot(model="gpt-3.5-turbo")
# Persona setup
persona = "a kind and knowledgeable AI assistant"
capabilities = [
"Answering general questions",
"Programming support",
"Solving math and logic problems",
"Text summarization and translation",
]
# bot.set_system_prompt(persona, capabilities)
print(f"Bot Persona: {persona}\n")
print("Conversation Simulation:\n")
# Conversation examples (demo)
demo_conversations = [
("Hello! Please tell me how to reverse a list in Python.", False),
("If the list has 1 million elements, what would be the most efficient method?", True), # Use CoT
]
for user_msg, use_cot in demo_conversations:
print(f"👤 User: {user_msg}")
if use_cot:
print(" (Using Chain-of-Thought reasoning)")
# Actual API call is commented out
# result = bot.chat(user_msg, use_cot=use_cot)
# if result['success']:
# print(f"🤖 Assistant: {result['assistant']}")
# else:
# print(f"❌ Error: {result['error']}")
print()
# print(bot.get_conversation_summary())
Exercises
Exercise 5.1: Understanding Scaling Laws
Problem: When training two models under the following conditions, which would be expected to have higher performance? Explain based on the Chinchilla scaling laws.
- Model A: 200B parameters, trained on 1 trillion tokens
- Model B: 70B parameters, trained on 4 trillion tokens
Hint: Chinchilla optimal ratio is "data tokens ≈ 20 × parameters".
Sample Answer
Analysis:
- Model A: Optimal data = 200B × 20 = 4 trillion tokens → Actual: 1 trillion tokens (insufficient)
- Model B: Optimal data = 70B × 20 = 1.4 trillion tokens → Actual: 4 trillion tokens (excessive but acceptable)
Conclusion: Model B is more likely to have higher performance. Model A is "over-parameterized" and its performance plateaus due to data shortage. As the Chinchilla paper shows, with the same compute budget, it's more efficient to train a smaller model on more data.
Exercise 5.2: Few-shot Prompt Design
Problem: Design an effective Few-shot prompt for the following task.
Task: Extract "rating score (1-5)" and "main reason" from product reviews
Requirements:
- Include 3 examples
- Clearly specify output format
- Consider edge cases (ambiguous ratings)
Sample Answer
Extract "rating score (1-5)" and "main reason" from the following product reviews.
Example 1:
Review: This vacuum has strong suction, is lightweight and easy to use. The price is reasonable and I'm very satisfied.
Output: {"score": 5, "reason": "Suction power, lightweight, cost performance"}
Example 2:
Review: The design is good, but battery life is poor and needs frequent charging.
Output: {"score": 2, "reason": "Short battery duration"}
Example 3:
Review: It's an ordinary product. Nothing particularly good or bad.
Output: {"score": 3, "reason": "No notable features"}
Review: {input_review}
Output:
Design Points:
- Include balanced examples: positive (Example 1), negative (Example 2), neutral (Example 3)
- Specify JSON format for structured output that's easy to parse
- "Reason" should be a concise summary, guiding against copying the entire review
Exercise 5.3: Implementing Chain-of-Thought Reasoning
Problem: Create a CoT prompt to solve the following logic puzzle.
Puzzle:
There are 3 suspects A, B, C.There is one culprit, and only that person is lying. Who is the culprit?
- A says "B is the culprit"
- B says "I am innocent"
- C says "A is the culprit"
Sample Answer
Analyze and solve the following logic puzzle step by step.
Puzzle:
There are 3 suspects A, B, C.
- A says "B is the culprit"
- B says "I am innocent"
- C says "A is the culprit"
There is one culprit, and only that person is lying. Who is the culprit?
Step-by-step analysis:
Hypothesis 1: If A is the culprit
- A's statement "B is the culprit" is a lie → ✓ Culprit lies
- B's statement "I am innocent" is true → ✓ B and C tell truth
- C's statement "A is the culprit" is true → ✓ No contradiction
Conclusion: A could be the culprit
Hypothesis 2: If B is the culprit
- A's statement "B is the culprit" is true → ✗ Contradiction (non-culprit also lies?)
- B's statement "I am innocent" is a lie → ✓ Culprit lies
- C's statement "A is the culprit" is a lie → ✗ Contradiction (2 people lie?)
Conclusion: Contradicts conditions
Hypothesis 3: If C is the culprit
- A's statement "B is the culprit" is a lie → ✗ Contradiction (non-culprit also lies?)
- B's statement "I am innocent" is true → ✓
- C's statement "A is the culprit" is a lie → ✓ Culprit lies
Conclusion: Contradicts conditions
Final Conclusion: A is the culprit.
Reason: Only Hypothesis 1 satisfies all conditions.
CoT Design Points:
- Systematically verify all possibilities
- Clearly check for contradictions under each hypothesis
- Use symbols (✓, ✗) for visual clarity
Exercise 5.4: Understanding RLHF
Problem: Explain the role of the KL divergence constraint $\beta \cdot D_{KL}(\pi_{\theta} || \pi_{\text{ref}})$ used in RLHF. Also describe the problems when $\beta$ is too large or too small.
Sample Answer
Role of KL Divergence Constraint:
- Prevent mode collapse: Constrains the optimizing model $\pi_{\theta}$ from deviating too much from the reference model $\pi_{\text{ref}}$ (pre-trained).
- Preserve language ability: Prevents loss of basic language abilities like grammar and coherence during reward maximization.
- Avoid reward hacking: Prevents the model from learning extreme strategies that exploit vulnerabilities in the reward model.
When $\beta$ is too large:
- Problem: Model stays too close to reference model, reducing RLHF effectiveness
- Result: Human feedback is barely reflected, little improvement seen
When $\beta$ is too small:
- Problem: Model deviates significantly from reference model, generating unnatural outputs
- Result: Grammar collapse, nonsensical responses, reward hacking
Practical $\beta$ selection:
- Common range: 0.01~0.1
- Tuning method: Search for optimal value based on human evaluation on validation set
Exercise 5.5: LLM Application Design
Problem: Design an LLM chatbot for customer support. Propose a system architecture and prompt strategy that meets the following requirements.
Requirements:
- Immediately answer frequently asked questions (FAQ)
- Handle complex issues step by step
- Escalate to human operators when uncertain
- Context understanding considering conversation history
Sample Answer
System Architecture:
class CustomerSupportChatbot:
"""
LLM chatbot for customer support
"""
def __init__(self):
self.faq_database = self.load_faq()
self.conversation_history = []
self.escalation_threshold = 0.3 # Confidence threshold
def load_faq(self):
"""Load FAQ database"""
return {
"Shipping days": "Typically delivered in 3-5 business days.",
"Return policy": "Returns accepted within 30 days of purchase.",
"Payment methods": "We accept credit cards, bank transfers, and cash on delivery.",
}
def check_faq(self, query: str) -> Optional[str]:
"""Check for matching FAQ questions"""
# Simplified: Would use embedding-based similarity search in practice
for question, answer in self.faq_database.items():
if question in query:
return answer
return None
def classify_complexity(self, query: str) -> str:
"""Classify inquiry complexity"""
complexity_prompt = f"""
Classify the following inquiry as "Simple", "Medium", or "Complex".
Inquiry: {query}
Classification:"""
# LLM call (pseudocode)
# complexity = call_llm(complexity_prompt)
return "Medium" # For demo
def handle_query(self, query: str) -> Dict:
"""Handle inquiry"""
# Step 1: FAQ check
faq_answer = self.check_faq(query)
if faq_answer:
return {
'type': 'faq',
'answer': faq_answer,
'confidence': 1.0
}
# Step 2: Complexity assessment
complexity = self.classify_complexity(query)
# Step 3: Process based on complexity
if complexity == "Simple":
return self.simple_response(query)
elif complexity == "Medium":
return self.cot_response(query)
else:
return self.escalate_to_human(query)
def simple_response(self, query: str):
"""Simple Zero-shot response"""
prompt = f"""
You are a customer support assistant.
Answer the following question concisely.
Question: {query}
Answer:"""
# response = call_llm(prompt)
return {'type': 'simple', 'answer': "Response content"}
def cot_response(self, query: str):
"""Handle step by step with Chain-of-Thought"""
prompt = f"""
You are a customer support assistant.
Analyze the following problem step by step and propose a solution.
Problem: {query}
Analysis:
Step 1:"""
# response = call_llm(prompt)
return {'type': 'cot', 'answer': "Step-by-step response"}
def escalate_to_human(self, query: str):
"""Escalate to human operator"""
return {
'type': 'escalation',
'message': "This issue is complex. I'll connect you to a specialized operator.",
'query': query
}
Prompt Strategy:
- System Prompt: Clearly define bot's role, tone, and constraints
- Few-shot FAQ: Present similar question examples to improve accuracy
- CoT for Complex Issues: Analyze complex problems step by step
- Confidence Scoring: Evaluate response confidence, escalate if low
Evaluation Metrics:
- FAQ match rate: 70% or higher
- Escalation rate: 15% or lower
- Customer satisfaction: 4.0/5.0 or higher
Summary
In this chapter, we learned the essence and practical utilization of Large Language Models (LLMs):
- ✅ Scaling Laws: Understood the relationship between model size, data amount, and compute, and grasped the Chinchilla optimal ratio
- ✅ Major LLMs: Compared architectures and differentiators of GPT, LLaMA, Claude, Gemini, etc.
- ✅ Prompt Engineering: Implemented techniques such as Zero-shot, Few-shot, and Chain-of-Thought
- ✅ In-Context Learning: Understood the mechanism for learning new tasks without parameter updates
- ✅ RLHF: Learned the model improvement process using human feedback
- ✅ Practical Projects: Built Few-shot classification, CoT reasoning, and integrated chatbots
Next Steps: Once you understand LLM fundamentals, move on to more advanced topics such as domain-specific fine-tuning, RAG (Retrieval-Augmented Generation), and multimodal LLMs. Ethical considerations and bias mitigation techniques for responsible AI development are also important.
Disclaimer
- This content is provided solely for educational, research, and informational purposes and does not constitute professional advice (legal, accounting, technical warranty, etc.).
- This content and accompanying code examples are provided "AS IS" without any warranty, express or implied, including but not limited to merchantability, fitness for a particular purpose, non-infringement, accuracy, completeness, operation, or safety.
- The author and Tohoku University assume no responsibility for the content, availability, or safety of external links, third-party data, tools, libraries, etc.
- To the maximum extent permitted by applicable law, the author and Tohoku University shall not be liable for any direct, indirect, incidental, special, consequential, or punitive damages arising from the use, execution, or interpretation of this content.
- The content may be changed, updated, or discontinued without notice.
- The copyright and license of this content are subject to the stated conditions (e.g., CC BY 4.0). Such licenses typically include no-warranty clauses.