This chapter covers Embeddings and Search. You will learn essential concepts and techniques.
1. Vector Embeddings
1.1 Embedding Concepts
Embeddings are a technique that represents text as points in a high-dimensional vector space. Semantically similar texts are positioned close together in the vector space.
Embedding Properties:
- Semantic Representation: Captures the meaning of words and text as numerical vectors
- Dimensionality Reduction: Compresses high-dimensional language space to fixed dimensions (e.g., 1536 dimensions)
- Comparability: Enables similarity calculation through vector operations
Similarity between two vectors \(\mathbf{u}\) and \(\mathbf{v}\):
$$\text{similarity}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}$$Range: -1 (opposite) to 1 (identical)
Implementation Example 1: Embedding Generation and Similarity Calculation
from openai import OpenAI
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
class EmbeddingGenerator:
"""Embedding generation and similarity calculation"""
def __init__(self, api_key, model="text-embedding-3-small"):
self.client = OpenAI(api_key=api_key)
self.model = model
def get_embedding(self, text):
"""Get embedding for a single text"""
response = self.client.embeddings.create(
input=text,
model=self.model
)
return np.array(response.data[0].embedding)
def get_embeddings_batch(self, texts, batch_size=100):
"""Get embeddings with batch processing"""
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = self.client.embeddings.create(
input=batch,
model=self.model
)
batch_embeddings = [
np.array(data.embedding) for data in response.data
]
embeddings.extend(batch_embeddings)
return np.array(embeddings)
def cosine_similarity(self, vec1, vec2):
"""Calculate cosine similarity"""
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
def find_most_similar(self, query_text, document_texts, top_k=5):
"""Search for most similar documents"""
# Get embeddings
query_emb = self.get_embedding(query_text)
doc_embs = self.get_embeddings_batch(document_texts)
# Calculate similarities
similarities = cosine_similarity([query_emb], doc_embs)[0]
# Get Top-K
top_indices = np.argsort(similarities)[::-1][:top_k]
results = [
{
'text': document_texts[idx],
'score': float(similarities[idx]),
'rank': rank + 1
}
for rank, idx in enumerate(top_indices)
]
return results
# Usage example
generator = EmbeddingGenerator(api_key="your-api-key")
documents = [
"Machine learning is AI technology that learns from data",
"Deep learning uses neural networks",
"Natural language processing is a method for text analysis",
"Computer vision specializes in image recognition"
]
query = "AI-based text analysis"
results = generator.find_most_similar(query, documents, top_k=3)
for result in results:
print(f"Rank {result['rank']}: {result['text']}")
print(f"Similarity: {result['score']:.4f}\n")
1.2 Choosing Embedding Models
Various embedding models exist, and the choice depends on the intended use case.
Implementation Example 2: Comparing Multiple Embedding Models
from sentence_transformers import SentenceTransformer
from langchain.embeddings import (
OpenAIEmbeddings, HuggingFaceEmbeddings
)
import time
class EmbeddingComparison:
"""Comparison of multiple embedding models"""
def __init__(self):
self.models = {}
def load_models(self, openai_api_key=None):
"""Load various models"""
# OpenAI
if openai_api_key:
self.models['openai-small'] = OpenAIEmbeddings(
model="text-embedding-3-small",
openai_api_key=openai_api_key
)
self.models['openai-large'] = OpenAIEmbeddings(
model="text-embedding-3-large",
openai_api_key=openai_api_key
)
# Sentence Transformers (local)
self.models['multilingual'] = SentenceTransformer(
'paraphrase-multilingual-MiniLM-L12-v2'
)
self.models['japanese'] = SentenceTransformer(
'sentence-transformers/distiluse-base-multilingual-cased-v1'
)
def benchmark_model(self, model_name, texts):
"""Benchmark a model"""
model = self.models[model_name]
start = time.time()
if isinstance(model, SentenceTransformer):
embeddings = model.encode(texts)
else:
embeddings = model.embed_documents(texts)
elapsed = time.time() - start
return {
'model': model_name,
'num_texts': len(texts),
'time': elapsed,
'time_per_text': elapsed / len(texts),
'dimension': len(embeddings[0])
}
def compare_all_models(self, test_texts):
"""Compare all models"""
results = []
for model_name in self.models.keys():
try:
result = self.benchmark_model(model_name, test_texts)
results.append(result)
print(f"{model_name}: {result['time']:.2f}s "
f"(dimension: {result['dimension']})")
except Exception as e:
print(f"{model_name}: Error - {e}")
return results
# Usage example
comparator = EmbeddingComparison()
comparator.load_models(openai_api_key="your-api-key")
test_texts = [
"Learning the basics of machine learning",
"Building deep learning models",
"Applications of natural language processing"
] * 10 # 30 texts
results = comparator.compare_all_models(test_texts)
2. Similarity Search
2.1 Search Algorithms
Vector databases search for similar vectors quickly from large embedding collections.
- Brute Force Search: Compares with all vectors (for small-scale data)
- Approximate Nearest Neighbor (ANN): Index structures like HNSW, IVF
- Hybrid Search: Vector search + keyword search
3. Vector Databases
3.1 FAISS (Facebook AI Similarity Search)
A high-speed similarity search library developed by Meta that runs in local environments.
Implementation Example 3: FAISS Implementation
import faiss
import numpy as np
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.docstore.document import Document
class FAISSVectorStore:
"""FAISS vector store implementation"""
def __init__(self, embeddings):
self.embeddings = embeddings
self.vectorstore = None
def create_index(self, documents, index_type='flat'):
"""Create index"""
# Use Langchain FAISS
self.vectorstore = FAISS.from_documents(
documents,
self.embeddings
)
# Custom index configuration is also possible
if index_type == 'ivf':
self._create_ivf_index(documents)
print(f"Index creation complete: {len(documents)} documents")
def _create_ivf_index(self, documents):
"""Create IVF (Inverted File) index"""
# Get embeddings
texts = [doc.page_content for doc in documents]
embeddings = self.embeddings.embed_documents(texts)
embeddings_array = np.array(embeddings).astype('float32')
# Number of dimensions
dimension = embeddings_array.shape[1]
# Create IVF index
nlist = 100 # Number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
# Train
index.train(embeddings_array)
index.add(embeddings_array)
print(f"IVF index created: {nlist} clusters")
return index
def search(self, query, k=5, score_threshold=None):
"""Search for similar documents"""
if score_threshold:
results = self.vectorstore.similarity_search_with_relevance_scores(
query, k=k
)
# Filter by score
filtered = [
(doc, score) for doc, score in results
if score >= score_threshold
]
return filtered
else:
return self.vectorstore.similarity_search(query, k=k)
def search_with_metadata_filter(self, query, k=5, filter_dict=None):
"""Search with metadata filter"""
if filter_dict:
return self.vectorstore.similarity_search(
query, k=k, filter=filter_dict
)
return self.search(query, k=k)
def save_local(self, path):
"""Save locally"""
self.vectorstore.save_local(path)
print(f"Saved: {path}")
def load_local(self, path):
"""Load locally"""
self.vectorstore = FAISS.load_local(
path, self.embeddings
)
print(f"Loaded: {path}")
# Usage example
embeddings = OpenAIEmbeddings(openai_api_key="your-api-key")
faiss_store = FAISSVectorStore(embeddings)
# Prepare documents
documents = [
Document(
page_content="Python is a popular programming language",
metadata={"category": "programming", "language": "en"}
),
Document(
page_content="Python is commonly used for machine learning",
metadata={"category": "ml", "language": "en"}
)
]
# Create index
faiss_store.create_index(documents)
# Search
results = faiss_store.search("programming language", k=2)
for doc in results:
print(f"- {doc.page_content}")
# Save
faiss_store.save_local("./faiss_index")
3.2 ChromaDB
An open-source vector database that excels at metadata filtering.
Implementation Example 4: ChromaDB Implementation
import chromadb
from chromadb.config import Settings
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
class ChromaVectorStore:
"""ChromaDB vector store implementation"""
def __init__(self, embeddings, persist_directory="./chroma_db"):
self.embeddings = embeddings
self.persist_directory = persist_directory
self.vectorstore = None
# Client configuration
self.client = chromadb.Client(Settings(
chroma_db_impl="duckdb+parquet",
persist_directory=persist_directory
))
def create_collection(self, documents, collection_name="default"):
"""Create collection"""
self.vectorstore = Chroma.from_documents(
documents=documents,
embedding=self.embeddings,
collection_name=collection_name,
persist_directory=self.persist_directory
)
# Persist
self.vectorstore.persist()
print(f"Collection created: {collection_name}")
def add_documents(self, documents):
"""Add documents"""
if not self.vectorstore:
raise ValueError("Collection not created")
self.vectorstore.add_documents(documents)
self.vectorstore.persist()
print(f"{len(documents)} documents added")
def search_with_filter(self, query, k=5, where=None, where_document=None):
"""Advanced filtering search"""
# Metadata filter
if where:
results = self.vectorstore.similarity_search(
query, k=k, filter=where
)
# Document content filter
elif where_document:
results = self.vectorstore.similarity_search(
query, k=k, where_document=where_document
)
else:
results = self.vectorstore.similarity_search(query, k=k)
return results
def mmr_search(self, query, k=5, fetch_k=20, lambda_mult=0.5):
"""MMR (Maximal Marginal Relevance) search
Search that balances diversity and relevance
"""
results = self.vectorstore.max_marginal_relevance_search(
query,
k=k,
fetch_k=fetch_k,
lambda_mult=lambda_mult # 0=diversity focused, 1=relevance focused
)
return results
def delete_collection(self, collection_name):
"""Delete collection"""
self.client.delete_collection(collection_name)
print(f"Deleted: {collection_name}")
# Usage example
embeddings = OpenAIEmbeddings(openai_api_key="your-api-key")
chroma_store = ChromaVectorStore(embeddings, persist_directory="./chroma_db")
documents = [
Document(
page_content="Introduction to Python machine learning",
metadata={"type": "tutorial", "level": "beginner", "year": 2024}
),
Document(
page_content="Advanced deep learning techniques",
metadata={"type": "advanced", "level": "expert", "year": 2024}
),
Document(
page_content="Data science fundamentals",
metadata={"type": "tutorial", "level": "beginner", "year": 2023}
)
]
# Create collection
chroma_store.create_collection(documents, collection_name="ml_docs")
# Metadata filter search
results = chroma_store.search_with_filter(
"machine learning",
k=2,
where={"level": "beginner", "year": 2024}
)
for doc in results:
print(f"- {doc.page_content}")
print(f" Metadata: {doc.metadata}")
# MMR search (diversity focused)
diverse_results = chroma_store.mmr_search(
"learning machine learning",
k=3,
lambda_mult=0.3 # Diversity focused
)
print(f"\nMMR search results: {len(diverse_results)} items")
3.3 Pinecone
A cloud-native vector database that excels at scalability.
Implementation Example 5: Pinecone Implementation
import pinecone
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
import time
class PineconeVectorStore:
"""Pinecone vector store implementation"""
def __init__(self, api_key, environment, embeddings):
self.embeddings = embeddings
# Initialize Pinecone
pinecone.init(
api_key=api_key,
environment=environment
)
def create_index(self, index_name, dimension=1536, metric='cosine'):
"""Create index"""
# Check for existing index
if index_name not in pinecone.list_indexes():
pinecone.create_index(
name=index_name,
dimension=dimension,
metric=metric,
pods=1,
pod_type='p1.x1'
)
# Wait for index to be ready
time.sleep(1)
print(f"Index created: {index_name}")
else:
print(f"Using existing index: {index_name}")
def upsert_documents(self, index_name, documents):
"""Upsert documents"""
vectorstore = Pinecone.from_documents(
documents,
self.embeddings,
index_name=index_name
)
print(f"{len(documents)} documents upserted")
return vectorstore
def search_with_namespace(self, index_name, query, k=5, namespace=None):
"""Search with namespace specification"""
vectorstore = Pinecone.from_existing_index(
index_name=index_name,
embedding=self.embeddings,
namespace=namespace
)
results = vectorstore.similarity_search_with_score(query, k=k)
return results
def hybrid_search(self, index_name, query, k=5, alpha=0.5):
"""Hybrid search (dense vector + sparse vector)
alpha: 0=keyword search only, 1=vector search only
"""
# Pinecone hybrid search feature
index = pinecone.Index(index_name)
# Query embedding
query_vector = self.embeddings.embed_query(query)
# Execute hybrid search
results = index.query(
vector=query_vector,
top_k=k,
include_metadata=True,
# Hybrid search parameter
alpha=alpha
)
return results
def delete_index(self, index_name):
"""Delete index"""
if index_name in pinecone.list_indexes():
pinecone.delete_index(index_name)
print(f"Index deleted: {index_name}")
def get_index_stats(self, index_name):
"""Get index statistics"""
index = pinecone.Index(index_name)
stats = index.describe_index_stats()
return stats
# Usage example
embeddings = OpenAIEmbeddings(openai_api_key="your-openai-key")
pinecone_store = PineconeVectorStore(
api_key="your-pinecone-key",
environment="us-west1-gcp",
embeddings=embeddings
)
# Create index
index_name = "ml-knowledge-base"
pinecone_store.create_index(index_name, dimension=1536)
# Upsert documents
documents = [
Document(
page_content="Fundamental theories of machine learning",
metadata={"category": "ml", "level": "basic"}
),
Document(
page_content="Deep learning implementation methods",
metadata={"category": "dl", "level": "advanced"}
)
]
vectorstore = pinecone_store.upsert_documents(index_name, documents)
# Search
results = pinecone_store.search_with_namespace(
index_name, "how to learn machine learning", k=3
)
for doc, score in results:
print(f"Score: {score:.4f}")
print(f"Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}\n")
# Statistics
stats = pinecone_store.get_index_stats(index_name)
print(f"Total vector count: {stats['total_vector_count']}")
3.4 Vector DB Comparison and Selection
Implementation Example 6: Vector DB Performance Comparison
import time
from typing import List, Dict
from langchain.schema import Document
class VectorDBBenchmark:
"""Vector database performance comparison"""
def __init__(self):
self.results = []
def benchmark_indexing(self, db_name, vectorstore, documents):
"""Measure index creation time"""
start = time.time()
if db_name == "FAISS":
vectorstore.create_index(documents)
elif db_name == "Chroma":
vectorstore.create_collection(documents)
elif db_name == "Pinecone":
vectorstore.upsert_documents("benchmark", documents)
elapsed = time.time() - start
return {
'db': db_name,
'operation': 'indexing',
'num_docs': len(documents),
'time': elapsed,
'docs_per_sec': len(documents) / elapsed
}
def benchmark_search(self, db_name, vectorstore, queries, k=5):
"""Measure search time"""
start = time.time()
for query in queries:
if db_name == "FAISS":
vectorstore.search(query, k=k)
elif db_name == "Chroma":
vectorstore.search_with_filter(query, k=k)
elif db_name == "Pinecone":
vectorstore.search_with_namespace("benchmark", query, k=k)
elapsed = time.time() - start
return {
'db': db_name,
'operation': 'search',
'num_queries': len(queries),
'time': elapsed,
'queries_per_sec': len(queries) / elapsed,
'avg_latency_ms': (elapsed / len(queries)) * 1000
}
def compare_features(self):
"""Feature comparison table"""
comparison = {
'FAISS': {
'type': 'Local library',
'deployment': 'Self-hosted',
'scalability': 'Medium',
'metadata_filter': 'Limited',
'cost': 'Free (infrastructure costs only)',
'best_for': 'Small to medium scale, offline environments'
},
'Chroma': {
'type': 'Local/Server',
'deployment': 'Self-hosted/Cloud',
'scalability': 'Medium to High',
'metadata_filter': 'Powerful',
'cost': 'Free (open source)',
'best_for': 'Medium scale, development environments'
},
'Pinecone': {
'type': 'Cloud service',
'deployment': 'Managed',
'scalability': 'Very high',
'metadata_filter': 'Powerful',
'cost': 'Paid (usage-based)',
'best_for': 'Large scale, production environments'
}
}
return comparison
def print_comparison(self):
"""Display comparison results"""
features = self.compare_features()
print("=" * 80)
print("Vector Database Feature Comparison")
print("=" * 80)
for db_name, features_dict in features.items():
print(f"\n[{db_name}]")
for key, value in features_dict.items():
print(f" {key:20s}: {value}")
# Usage example
benchmark = VectorDBBenchmark()
# Display feature comparison
benchmark.print_comparison()
# Test data
test_documents = [
Document(page_content=f"Document {i}")
for i in range(1000)
]
test_queries = [f"Query {i}" for i in range(100)]
# Run benchmark for each DB
# faiss_result = benchmark.benchmark_indexing("FAISS", faiss_store, test_documents)
# chroma_result = benchmark.benchmark_indexing("Chroma", chroma_store, test_documents)
print("\nPerformance benchmark complete")
- FAISS: Prototypes, small scale, offline environments
- Chroma: Development environments, medium scale, metadata utilization
- Pinecone: Production environments, large scale, managed service preference
Summary
- Embeddings transform text into vector space, enabling semantic similarity calculation
- Cosine similarity is the most common similarity metric
- FAISS, Chroma, and Pinecone each have different characteristics
- Select the appropriate vector DB based on use case and scale
Disclaimer
- This content is provided solely for educational, research, and informational purposes and does not constitute professional advice (legal, accounting, technical warranty, etc.).
- This content and accompanying code examples are provided "AS IS" without any warranty, express or implied, including but not limited to merchantability, fitness for a particular purpose, non-infringement, accuracy, completeness, operation, or safety.
- The author and Tohoku University assume no responsibility for the content, availability, or safety of external links, third-party data, tools, libraries, etc.
- To the maximum extent permitted by applicable law, the author and Tohoku University shall not be liable for any direct, indirect, incidental, special, consequential, or punitive damages arising from the use, execution, or interpretation of this content.
- The content may be changed, updated, or discontinued without notice.
- The copyright and license of this content are subject to the stated conditions (e.g., CC BY 4.0). Such licenses typically include no-warranty clauses.