Introduction
This final chapter brings together everything you've learned to build real-world applications. We'll cover four practical projects that demonstrate modern prompt engineering techniques including vision prompting, prompt caching, and production deployment strategies.
By the end of this chapter, you will have:
- Built a document analysis system with structured extraction
- Created a vision-based application using multimodal prompts
- Implemented prompt caching for cost optimization
- Learned production deployment best practices
Project 1: Document Analysis System
Project Overview
Goal: Build a system that analyzes business documents (invoices, contracts, reports) and extracts structured information.
Techniques Used: Structured Output, Prompt Chains, Error Handling
Difficulty: Intermediate
System Architecture
Implementation
Document Classification
from openai import OpenAI
from pydantic import BaseModel
from typing import Literal
import json
client = OpenAI()
class DocumentClassification(BaseModel):
document_type: Literal["invoice", "contract", "report", "other"]
confidence: float
language: str
page_count_estimate: int
def classify_document(text: str) -> DocumentClassification:
"""Classify a document and return structured metadata."""
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """Classify the document type and extract metadata.
Document types:
- invoice: Bills, receipts, payment requests
- contract: Agreements, terms of service, NDAs
- report: Analysis, summaries, research documents
- other: Anything else"""
},
{"role": "user", "content": text[:4000]} # First 4K chars
],
response_format=DocumentClassification
)
return response.choices[0].message.parsed
Invoice Extraction
from pydantic import BaseModel, Field
from typing import Optional
from datetime import date
class LineItem(BaseModel):
description: str
quantity: float
unit_price: float
total: float
class InvoiceData(BaseModel):
invoice_number: str
vendor_name: str
vendor_address: Optional[str] = None
customer_name: str
invoice_date: date
due_date: Optional[date] = None
line_items: list[LineItem]
subtotal: float
tax_amount: float = 0
total_amount: float
currency: str = "USD"
payment_terms: Optional[str] = None
def extract_invoice(text: str) -> InvoiceData:
"""Extract structured data from invoice text."""
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """Extract all invoice information into the structured format.
- Parse dates in ISO format (YYYY-MM-DD)
- Calculate totals if not explicitly stated
- Use the document's currency
- Extract all line items with quantities and prices"""
},
{"role": "user", "content": text}
],
response_format=InvoiceData
)
return response.choices[0].message.parsed
Complete Pipeline
class DocumentProcessor:
"""Complete document processing pipeline."""
def __init__(self):
self.client = OpenAI()
def process(self, document_text: str) -> dict:
"""Process any document type."""
# Step 1: Classify
classification = classify_document(document_text)
# Step 2: Route to appropriate extractor
if classification.document_type == "invoice":
data = extract_invoice(document_text)
elif classification.document_type == "contract":
data = extract_contract(document_text)
elif classification.document_type == "report":
data = summarize_report(document_text)
else:
data = {"raw_text": document_text[:1000]}
# Step 3: Validate and return
return {
"classification": classification.model_dump(),
"extracted_data": data.model_dump() if hasattr(data, 'model_dump') else data,
"processing_status": "success"
}
# Usage
processor = DocumentProcessor()
result = processor.process(invoice_text)
print(json.dumps(result, indent=2, default=str))
Project 2: Vision Prompting Application
Project Overview
Goal: Build a multimodal application that analyzes images for product quality inspection.
Techniques Used: Vision Prompting, Structured Output, Few-shot with Images
Difficulty: Intermediate
Vision Prompting Basics
Modern LLMs (GPT-4V, Claude 3.5, Gemini) support image inputs alongside text. This enables powerful visual understanding applications.
Basic Image Analysis
import base64
from openai import OpenAI
client = OpenAI()
def encode_image(image_path: str) -> str:
"""Encode image to base64."""
with open(image_path, "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")
def analyze_image(image_path: str, prompt: str) -> str:
"""Analyze an image with a custom prompt."""
base64_image = encode_image(image_path)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}",
"detail": "high" # or "low" for faster/cheaper
}
}
]
}
],
max_tokens=1000
)
return response.choices[0].message.content
# Usage
result = analyze_image(
"product.jpg",
"Describe any visible defects or quality issues in this product image."
)
Quality Inspection System
Product Quality Inspector
from pydantic import BaseModel
from typing import Literal
from enum import Enum
class DefectType(str, Enum):
SCRATCH = "scratch"
DENT = "dent"
DISCOLORATION = "discoloration"
CRACK = "crack"
MISSING_PART = "missing_part"
CONTAMINATION = "contamination"
OTHER = "other"
class Defect(BaseModel):
type: DefectType
location: str # e.g., "top-left corner", "center"
severity: Literal["minor", "moderate", "severe"]
description: str
class QualityInspection(BaseModel):
product_identified: str
overall_quality: Literal["pass", "fail", "review_needed"]
confidence: float
defects_found: list[Defect]
recommendations: list[str]
def inspect_product(image_path: str, product_type: str) -> QualityInspection:
"""Perform quality inspection on a product image."""
base64_image = encode_image(image_path)
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": f"""You are a quality control inspector for {product_type} products.
Analyze the image for any defects or quality issues.
Quality Standards:
- Minor defects: Cosmetic issues that don't affect function
- Moderate defects: Visible issues that may affect customer satisfaction
- Severe defects: Functional issues or safety concerns
Pass criteria: No severe defects, max 2 minor defects
Fail criteria: Any severe defect or more than 3 moderate defects
Review needed: Edge cases requiring human judgment"""
},
{
"role": "user",
"content": [
{"type": "text", "text": "Inspect this product image and provide detailed quality assessment."},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
}
]
}
],
response_format=QualityInspection
)
return response.choices[0].message.parsed
# Usage
inspection = inspect_product("widget.jpg", "electronic component")
print(f"Quality: {inspection.overall_quality}")
print(f"Defects found: {len(inspection.defects_found)}")
for defect in inspection.defects_found:
print(f" - {defect.type}: {defect.description} ({defect.severity})")
Vision Prompting Best Practices
| Practice | Description |
|---|---|
| Use high detail for fine analysis | Set detail: "high" for detailed inspection tasks |
| Provide reference context | Tell the model what product it's looking at |
| Use structured output | Define schemas for consistent extraction |
| Multiple images for comparison | Send reference images alongside test images |
| Specify location format | Define how to report positions (grid, coordinates, descriptions) |
Project 3: Prompt Caching for Cost Optimization
Project Overview
Goal: Implement prompt caching strategies to reduce API costs by 50-90%.
Techniques Used: Prompt Caching, Token Optimization, Batch Processing
Difficulty: Intermediate
Understanding Prompt Caching
Prompt Caching (available in Anthropic Claude and OpenAI) allows you to cache the prefix of prompts that are reused frequently. Cached tokens cost significantly less on subsequent requests.
Anthropic Prompt Caching
Claude Prompt Caching
import anthropic
client = anthropic.Anthropic()
# Large system prompt that will be cached
SYSTEM_PROMPT = """You are an expert legal assistant with deep knowledge of:
[Include 5000+ tokens of legal context, case law references,
jurisdiction-specific rules, document templates, etc.]
Your role is to help lawyers draft documents, review contracts,
and provide legal research assistance.
""" # This could be 10K+ tokens
def query_legal_assistant(user_query: str):
"""Query with prompt caching enabled."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Enable caching
}
],
messages=[
{"role": "user", "content": user_query}
]
)
# Check cache usage in response
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
return response.content[0].text
# First call: Creates cache (higher cost)
result1 = query_legal_assistant("Review this NDA clause...")
# Subsequent calls: Uses cache (90% cost reduction on cached tokens)
result2 = query_legal_assistant("Draft a confidentiality agreement...")
result3 = query_legal_assistant("What are the requirements for...")
OpenAI Automatic Caching
OpenAI Prompt Caching (Automatic)
from openai import OpenAI
client = OpenAI()
# OpenAI automatically caches prompts >= 1024 tokens
# that share the same prefix
LONG_CONTEXT = """
[Large document or context - 5000+ tokens]
This could be:
- A complete codebase for code review
- A lengthy document for analysis
- Extensive product documentation
- Historical conversation context
"""
def analyze_with_context(query: str):
"""Automatically benefits from caching with long prompts."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": LONG_CONTEXT # Automatically cached if >= 1024 tokens
},
{"role": "user", "content": query}
]
)
# Check cache usage
if hasattr(response.usage, 'prompt_tokens_details'):
details = response.usage.prompt_tokens_details
print(f"Cached tokens: {details.cached_tokens}")
return response.choices[0].message.content
Cost Optimization Strategies
Caching Best Practices
- Front-load static content: Put cacheable content at the beginning of prompts
- Batch similar requests: Group requests with the same context
- Use appropriate cache TTL: Anthropic caches for 5 minutes by default
- Monitor cache hit rates: Track to optimize prompt structure
- Minimize variable content: Keep dynamic parts at the end
Prompt Structure for Optimal Caching
# Optimal structure for caching
messages = [
# CACHED SECTION (put first, don't change)
{
"role": "system",
"content": LARGE_STATIC_CONTEXT # 5000+ tokens, rarely changes
},
# SEMI-CACHED (changes occasionally)
{
"role": "user",
"content": session_context # Session-specific but stable
},
{
"role": "assistant",
"content": previous_response
},
# NOT CACHED (changes every request)
{
"role": "user",
"content": current_query # Different each time
}
]
Cost Comparison
| Scenario | Without Caching | With Caching | Savings |
|---|---|---|---|
| 10K token system prompt, 100 requests | $3.00 | $0.33 | 89% |
| RAG with 8K context, 50 requests | $1.20 | $0.15 | 87% |
| Code review (full codebase), 20 requests | $2.00 | $0.25 | 87% |
Project 4: Production Deployment
Project Overview
Goal: Deploy a prompt-based application with proper monitoring, error handling, and scaling.
Techniques Used: Observability, Rate Limiting, Fallback Strategies
Difficulty: Advanced
Production Architecture
Robust LLM Client
Production-Ready Client
import time
from openai import OpenAI
from anthropic import Anthropic
from tenacity import retry, stop_after_attempt, wait_exponential
import logging
logger = logging.getLogger(__name__)
class ProductionLLMClient:
"""Production-ready LLM client with fallback and monitoring."""
def __init__(self):
self.openai = OpenAI()
self.anthropic = Anthropic()
self.metrics = MetricsCollector()
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60)
)
def _call_openai(self, messages: list, **kwargs) -> str:
"""Call OpenAI with retry logic."""
start = time.time()
try:
response = self.openai.chat.completions.create(
model=kwargs.get("model", "gpt-4o"),
messages=messages,
**kwargs
)
self.metrics.record("openai_success", time.time() - start)
return response.choices[0].message.content
except Exception as e:
self.metrics.record("openai_error", time.time() - start)
logger.error(f"OpenAI error: {e}")
raise
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60)
)
def _call_anthropic(self, messages: list, **kwargs) -> str:
"""Call Anthropic with retry logic."""
start = time.time()
try:
response = self.anthropic.messages.create(
model=kwargs.get("model", "claude-sonnet-4-20250514"),
max_tokens=kwargs.get("max_tokens", 1024),
messages=messages
)
self.metrics.record("anthropic_success", time.time() - start)
return response.content[0].text
except Exception as e:
self.metrics.record("anthropic_error", time.time() - start)
logger.error(f"Anthropic error: {e}")
raise
def complete(self, messages: list, primary: str = "openai", **kwargs) -> str:
"""Complete with automatic fallback."""
try:
if primary == "openai":
return self._call_openai(messages, **kwargs)
else:
return self._call_anthropic(messages, **kwargs)
except Exception as e:
logger.warning(f"Primary provider failed, trying fallback: {e}")
# Fallback to other provider
try:
if primary == "openai":
return self._call_anthropic(messages, **kwargs)
else:
return self._call_openai(messages, **kwargs)
except Exception as e2:
logger.error(f"All providers failed: {e2}")
raise RuntimeError("All LLM providers unavailable")
Monitoring and Observability
Metrics Collection
from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, List
import json
@dataclass
class LLMMetrics:
"""Track LLM usage metrics for monitoring."""
requests_total: int = 0
requests_success: int = 0
requests_failed: int = 0
total_tokens: int = 0
total_cost: float = 0.0
latencies: List[float] = field(default_factory=list)
errors_by_type: Dict[str, int] = field(default_factory=dict)
def record_request(self, success: bool, latency: float,
tokens: int, cost: float, error_type: str = None):
self.requests_total += 1
if success:
self.requests_success += 1
else:
self.requests_failed += 1
self.errors_by_type[error_type] = self.errors_by_type.get(error_type, 0) + 1
self.latencies.append(latency)
self.total_tokens += tokens
self.total_cost += cost
def get_summary(self) -> dict:
return {
"success_rate": self.requests_success / max(self.requests_total, 1),
"avg_latency": sum(self.latencies) / max(len(self.latencies), 1),
"p95_latency": sorted(self.latencies)[int(len(self.latencies) * 0.95)]
if self.latencies else 0,
"total_cost": self.total_cost,
"error_breakdown": self.errors_by_type
}
def export_prometheus(self) -> str:
"""Export metrics in Prometheus format."""
return f"""
# HELP llm_requests_total Total LLM requests
# TYPE llm_requests_total counter
llm_requests_total {self.requests_total}
# HELP llm_requests_success_total Successful LLM requests
# TYPE llm_requests_success_total counter
llm_requests_success_total {self.requests_success}
# HELP llm_latency_seconds LLM request latency
# TYPE llm_latency_seconds histogram
llm_latency_seconds_sum {sum(self.latencies)}
llm_latency_seconds_count {len(self.latencies)}
# HELP llm_cost_dollars Total LLM cost
# TYPE llm_cost_dollars counter
llm_cost_dollars {self.total_cost}
"""
Production Checklist
Deployment Checklist
- Error Handling: Retry logic, fallback providers, graceful degradation
- Rate Limiting: Per-user limits, global throttling, queue management
- Caching: Response caching, prompt caching, embedding caching
- Monitoring: Latency, success rate, token usage, cost tracking
- Logging: Request/response logging (sanitized), error tracking
- Security: Input validation, output filtering, PII handling
- Cost Controls: Budget alerts, usage caps, cost attribution
- Testing: Prompt regression tests, load testing, chaos testing
Final Exercises
Exercise 1: Document Processor Extension (Difficulty: Medium)
Task: Extend the document analysis system to handle:
- PDF documents (extract text first)
- Multi-language support (detect and translate)
- Confidence scoring for extracted fields
Exercise 2: Vision Application (Difficulty: Medium)
Task: Build a receipt scanner that:
- Accepts photos of receipts
- Extracts merchant, date, items, total
- Categorizes expenses automatically
- Handles poor image quality gracefully
Exercise 3: Caching Implementation (Difficulty: Medium)
Task: Implement a caching layer for a chatbot that:
- Caches the system prompt and knowledge base
- Tracks cache hit rate and cost savings
- Automatically invalidates cache when knowledge updates
Exercise 4: Production API (Difficulty: Advanced)
Task: Build a production-ready API endpoint that:
- Implements the ProductionLLMClient with fallback
- Includes rate limiting (10 requests/minute/user)
- Exports Prometheus metrics
- Logs all requests with correlation IDs
Exercise 5: Complete Project (Difficulty: Advanced)
Task: Combine all techniques to build a "Smart Meeting Notes" application:
- Input: Meeting transcript (text or audio)
- Processing: Extract action items, decisions, participants
- Output: Structured summary with follow-up tasks
- Features: Prompt caching, error handling, monitoring
Series Summary
Key Points from This Series
- Chapter 1: Prompt fundamentals - clarity, specificity, context, constraints
- Chapter 2: Basic techniques - Role Prompting, Structured Output, JSON Mode
- Chapter 3: Advanced techniques - Tree of Thought, ReAct, reasoning model prompts
- Chapter 4: Function Calling - tool definition, MCP, error handling, orchestration
- Chapter 5: Production - vision prompting, caching, deployment best practices
Continuing Your Journey
Congratulations on completing the Introduction to Prompt Engineering series! Here are recommended next steps:
- Practice: Apply these techniques to your own projects
- Explore: Try the AI Agents Introduction series
- Deepen: Study the LLM Basics Introduction for theoretical foundations
- Build: Create your own prompt library and templates
- Share: Contribute to the prompt engineering community
References
- OpenAI. (2024). Vision Guide
- Anthropic. (2024). Prompt Caching Documentation
- Google. (2024). Gemini Vision Capabilities
- LangChain. (2024). Production Deployment Guide
Update History
- 2026-01-12: v2.0 Initial release with Vision and Caching projects