This chapter covers Content. You will learn text feature extraction using TF-IDF, Build user profiles, and mechanisms of knowledge-based recommendations.
Learning Objectives
By reading this chapter, you will master the following:
- ✅ Understand the principles and implementation methods of content-based filtering
- ✅ Implement text feature extraction using TF-IDF
- ✅ Build user profiles and generate recommendations
- ✅ Design and implement hybrid recommendation methods
- ✅ Understand the mechanisms of knowledge-based recommendations
- ✅ Build context-aware recommendation systems
3.1 Principles of Content-Based Recommendation
What is Content-Based Filtering?
Content-Based Filtering is a method that makes recommendations by matching item features with user preferences.
"Recommend items with similar features to those the user has liked in the past"
Comparison with Collaborative Filtering
| Aspect | Collaborative Filtering | Content-Based |
|---|---|---|
| Basis for Recommendation | Similarity between users/items | Item features and user preferences |
| Required Data | User-item rating matrix | Item features |
| Cold Start | Difficult for new users/items | Possible for new items |
| Diversity | Can provide serendipitous recommendations | Tends to be biased toward known preferences |
| Scalability | Depends on number of users | Depends on item features |
Content-Based Recommendation Flow
Main Types of Features
| Feature Type | Examples | Extraction Methods |
|---|---|---|
| Text | Descriptions, reviews, tags | TF-IDF, Word2Vec, BERT |
| Categorical | Genre, category | One-Hot, Target Encoding |
| Numerical | Price, rating, popularity | Normalization, binning |
| Image | Product images, thumbnails | CNN, ResNet, CLIP |
| Audio | Music, podcasts | MFCC, spectrogram |
3.2 Text Feature Extraction with TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF is a metric that evaluates the importance of words in a document.
$$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $$
- TF (Term Frequency): Frequency of word $t$ in document $d$
- IDF (Inverse Document Frequency): Rarity of word $t$
$$ \text{IDF}(t) = \log \frac{N}{\text{df}(t)} $$
- $N$: Total number of documents
- $\text{df}(t)$: Number of documents containing word $t$
Implementation: Movie Recommendation Based on Descriptions
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
"""
Example: Implementation: Movie Recommendation Based on Descriptions
Purpose: Demonstrate machine learning model training and evaluation
Target: Intermediate
Execution time: ~5 seconds
Dependencies: None
"""
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Sample movie data
movies = pd.DataFrame({
'title': [
'The Matrix',
'Inception',
'The Dark Knight',
'Interstellar',
'The Avengers',
'Iron Man',
'Titanic',
'The Notebook'
],
'description': [
'A computer hacker learns about the true nature of reality and his role in the war against its controllers',
'A thief who steals corporate secrets through dream-sharing technology',
'Batman fights crime and chaos in Gotham City with the Joker',
'A team of explorers travel through a wormhole in space in an attempt to ensure humanity survival',
'Earth mightiest heroes must come together to stop an alien invasion',
'A billionaire industrialist builds an armored suit to fight evil',
'A romance develops aboard a ship during its ill-fated maiden voyage',
'A poor yet passionate young man falls in love with a rich young woman'
],
'genre': ['Sci-Fi', 'Sci-Fi', 'Action', 'Sci-Fi', 'Action', 'Action', 'Romance', 'Romance']
})
print("=== Movie Data ===")
print(movies[['title', 'genre']])
# TF-IDF feature extraction
tfidf = TfidfVectorizer(stop_words='english', max_features=50)
tfidf_matrix = tfidf.fit_transform(movies['description'])
print(f"\n=== TF-IDF Matrix ===")
print(f"Shape: {tfidf_matrix.shape}")
print(f"Number of features: {len(tfidf.get_feature_names_out())}")
print(f"Key words: {tfidf.get_feature_names_out()[:15]}")
# Compute cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(f"\n=== Cosine Similarity Matrix ===")
print(f"Shape: {cosine_sim.shape}")
print("\nSimilarity between The Matrix and other movies:")
for i, title in enumerate(movies['title']):
print(f" {title}: {cosine_sim[0, i]:.3f}")
Output:
=== Movie Data ===
title genre
0 The Matrix Sci-Fi
1 Inception Sci-Fi
2 The Dark Knight Action
3 Interstellar Sci-Fi
4 The Avengers Action
5 Iron Man Action
6 Titanic Romance
7 The Notebook Romance
=== TF-IDF Matrix ===
Shape: (8, 50)
Number of features: 50
Key words: ['aboard' 'against' 'alien' 'armored' 'attempt' 'batman' 'billionaire'
'builds' 'chaos' 'city' 'come' 'computer' 'controllers' 'corporate'
'crime']
=== Cosine Similarity Matrix ===
Shape: (8, 8)
Similarity between The Matrix and other movies:
The Matrix: 1.000
Inception: 0.000
The Dark Knight: 0.000
Interstellar: 0.000
The Avengers: 0.000
Iron Man: 0.000
Titanic: 0.000
The Notebook: 0.087
Recommendation Function Implementation
def get_content_based_recommendations(movie_title, movies_df, cosine_sim_matrix, top_n=5):
"""
Generate content-based recommendations
Args:
movie_title: Reference movie title
movies_df: Movie dataframe
cosine_sim_matrix: Cosine similarity matrix
top_n: Number of movies to recommend
Returns:
Dataframe of recommended movies
"""
# Get movie index
idx = movies_df[movies_df['title'] == movie_title].index[0]
# Get similarity scores
sim_scores = list(enumerate(cosine_sim_matrix[idx]))
# Sort by similarity (excluding itself)
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
# Get movie indices
movie_indices = [i[0] for i in sim_scores]
similarity_scores = [i[1] for i in sim_scores]
# Return recommendation results
recommendations = movies_df.iloc[movie_indices].copy()
recommendations['similarity_score'] = similarity_scores
return recommendations[['title', 'genre', 'similarity_score']]
# Generate recommendations
print("\n=== Movies similar to 'The Matrix' ===")
recommendations = get_content_based_recommendations('The Matrix', movies, cosine_sim, top_n=3)
print(recommendations)
print("\n=== Movies similar to 'Titanic' ===")
recommendations = get_content_based_recommendations('Titanic', movies, cosine_sim, top_n=3)
print(recommendations)
Output:
=== Movies similar to 'The Matrix' ===
title genre similarity_score
1 Inception Sci-Fi 0.186
3 Interstellar Sci-Fi 0.159
2 The Dark Knight Action 0.124
=== Movies similar to 'Titanic' ===
title genre similarity_score
7 The Notebook Romance 0.573
4 The Avengers Action 0.097
5 Iron Man Action 0.000
3.3 Building User Profiles
What is a User Profile?
User Profile is a vector that aggregates features of items the user has rated in the past.
$$ \text{UserProfile}_u = \frac{1}{|I_u|} \sum_{i \in I_u} r_{ui} \cdot \text{ItemFeatures}_i $$
- $I_u$: Set of items rated by user $u$
- $r_{ui}$: User $u$'s rating (weight) for item $i$
Implementation: User Profile-Based Recommendation
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
"""
Example: Implementation: User Profile-Based Recommendation
Purpose: Demonstrate data manipulation and preprocessing
Target: Intermediate
Execution time: 10-30 seconds
Dependencies: None
"""
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# User rating data
user_ratings = pd.DataFrame({
'user_id': [1, 1, 1, 2, 2, 2, 3, 3],
'title': ['The Matrix', 'Inception', 'Interstellar',
'The Avengers', 'Iron Man', 'The Dark Knight',
'Titanic', 'The Notebook'],
'rating': [5, 4, 5, 5, 4, 4, 5, 5]
})
print("=== User Rating Data ===")
print(user_ratings)
# Retrieve TF-IDF matrix
tfidf_matrix = tfidf.transform(movies['description'])
def build_user_profile(user_id, ratings_df, movies_df, tfidf_matrix):
"""Build user profile"""
# Get user's rated items
user_data = ratings_df[ratings_df['user_id'] == user_id]
# Get movie indices and ratings
movie_indices = []
ratings = []
for _, row in user_data.iterrows():
idx = movies_df[movies_df['title'] == row['title']].index[0]
movie_indices.append(idx)
ratings.append(row['rating'])
# Create profile with weighted average
ratings = np.array(ratings)
user_profile = np.zeros(tfidf_matrix.shape[1])
for idx, rating in zip(movie_indices, ratings):
user_profile += rating * tfidf_matrix[idx].toarray().flatten()
user_profile /= np.sum(ratings)
return user_profile.reshape(1, -1)
# Build user profiles
user1_profile = build_user_profile(1, user_ratings, movies, tfidf_matrix)
user2_profile = build_user_profile(2, user_ratings, movies, tfidf_matrix)
user3_profile = build_user_profile(3, user_ratings, movies, tfidf_matrix)
print("\n=== User Profiles ===")
print(f"User 1 profile shape: {user1_profile.shape}")
print(f"User 2 profile shape: {user2_profile.shape}")
print(f"User 3 profile shape: {user3_profile.shape}")
def recommend_for_user(user_profile, movies_df, tfidf_matrix, watched_movies, top_n=3):
"""Recommend based on user profile"""
# Calculate similarity with all movies
similarities = cosine_similarity(user_profile, tfidf_matrix).flatten()
# Exclude watched movies
watched_indices = [movies_df[movies_df['title'] == title].index[0]
for title in watched_movies]
similarities[watched_indices] = -1
# Top-N recommendations
top_indices = similarities.argsort()[::-1][:top_n]
recommendations = movies_df.iloc[top_indices].copy()
recommendations['score'] = similarities[top_indices]
return recommendations[['title', 'genre', 'score']]
# Recommendations for each user
print("\n=== Recommendations for User 1 (Sci-Fi fan) ===")
watched_1 = user_ratings[user_ratings['user_id'] == 1]['title'].tolist()
recs_1 = recommend_for_user(user1_profile, movies, tfidf_matrix, watched_1, top_n=3)
print(recs_1)
print("\n=== Recommendations for User 2 (Action fan) ===")
watched_2 = user_ratings[user_ratings['user_id'] == 2]['title'].tolist()
recs_2 = recommend_for_user(user2_profile, movies, tfidf_matrix, watched_2, top_n=3)
print(recs_2)
print("\n=== Recommendations for User 3 (Romance fan) ===")
watched_3 = user_ratings[user_ratings['user_id'] == 3]['title'].tolist()
recs_3 = recommend_for_user(user3_profile, movies, tfidf_matrix, watched_3, top_n=3)
print(recs_3)
Output:
=== User Rating Data ===
user_id title rating
0 1 The Matrix 5
1 1 Inception 4
2 1 Interstellar 5
3 2 The Avengers 5
4 2 Iron Man 4
5 2 The Dark Knight 4
6 3 Titanic 5
7 3 The Notebook 5
=== User Profiles ===
User 1 profile shape: (1, 50)
User 2 profile shape: (1, 50)
User 3 profile shape: (1, 50)
=== Recommendations for User 1 (Sci-Fi fan) ===
title genre score
2 The Dark Knight Action 0.091847
5 Iron Man Action 0.062134
4 The Avengers Action 0.054621
=== Recommendations for User 2 (Action fan) ===
title genre score
0 The Matrix Sci-Fi 0.103256
1 Inception Sci-Fi 0.082471
3 Interstellar Sci-Fi 0.071829
=== Recommendations for User 3 (Romance fan) ===
title genre score
0 The Matrix Sci-Fi 0.028645
1 Inception Sci-Fi 0.027183
4 The Avengers Action 0.026451
3.4 Hybrid Recommendation Systems
Types of Hybrid Recommendations
| Method | Description | Implementation Difficulty |
|---|---|---|
| Weighted | Weighted linear combination of multiple method scores | Low |
| Switching | Switch methods based on situation | Medium |
| Mixed | Mix and present recommendations from multiple methods | Low |
| Feature Combination | Integrate collaborative and content features | High |
| Cascade | Apply methods in stages | Medium |
| Meta-level | Output of one method becomes input to another | High |
Implementation: Weighted Hybrid
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
"""
Example: Implementation: Weighted Hybrid
Purpose: Demonstrate data manipulation and preprocessing
Target: Intermediate
Execution time: 10-30 seconds
Dependencies: None
"""
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix
# Extended sample data
np.random.seed(42)
n_users = 50
n_movies = len(movies)
# User-movie rating matrix (for collaborative filtering)
ratings_data = []
for user_id in range(n_users):
n_ratings = np.random.randint(2, 6)
movie_indices = np.random.choice(n_movies, n_ratings, replace=False)
for movie_idx in movie_indices:
rating = np.random.randint(1, 6)
ratings_data.append({
'user_id': user_id,
'movie_id': movie_idx,
'rating': rating
})
ratings_df = pd.DataFrame(ratings_data)
# Build user-movie matrix
user_movie_matrix = csr_matrix(
(ratings_df['rating'], (ratings_df['user_id'], ratings_df['movie_id'])),
shape=(n_users, n_movies)
).toarray()
print("=== User-Movie Rating Matrix ===")
print(f"Shape: {user_movie_matrix.shape}")
print(f"Number of ratings: {len(ratings_df)}")
print(f"Sparsity: {(1 - len(ratings_df) / (n_users * n_movies)) * 100:.1f}%")
# Collaborative filtering score calculation (Item-based CF)
def collaborative_filtering_score(user_id, movie_id, user_movie_matrix):
"""Item-based CF score"""
# Movies rated by user
user_ratings = user_movie_matrix[user_id]
# Movie similarity (cosine similarity of rating vectors)
movie_sim = cosine_similarity(user_movie_matrix.T)
# Similarity between user's rated movies and target movie
rated_movies = np.where(user_ratings > 0)[0]
if len(rated_movies) == 0:
return 0.0
# Weighted average score
weighted_sum = 0
similarity_sum = 0
for rated_movie in rated_movies:
if rated_movie != movie_id:
sim = movie_sim[movie_id, rated_movie]
weighted_sum += sim * user_ratings[rated_movie]
similarity_sum += abs(sim)
if similarity_sum == 0:
return 0.0
return weighted_sum / similarity_sum
# Content-based score calculation
def content_based_score(user_id, movie_id, user_movie_matrix, tfidf_matrix):
"""Content-based score"""
# Build user profile
user_ratings = user_movie_matrix[user_id]
rated_movies = np.where(user_ratings > 0)[0]
if len(rated_movies) == 0:
return 0.0
# Weighted average profile
user_profile = np.zeros(tfidf_matrix.shape[1])
for movie_idx in rated_movies:
user_profile += user_ratings[movie_idx] * tfidf_matrix[movie_idx].toarray().flatten()
user_profile /= np.sum(user_ratings[rated_movies])
# Similarity with target movie
movie_features = tfidf_matrix[movie_id].toarray().flatten()
score = cosine_similarity(user_profile.reshape(1, -1),
movie_features.reshape(1, -1))[0, 0]
return score
# Hybrid score calculation
def hybrid_score(user_id, movie_id, user_movie_matrix, tfidf_matrix,
alpha=0.5):
"""
Weighted Hybrid score
Args:
alpha: Weight for collaborative filtering (0-1)
1-alpha is weight for content-based
"""
cf_score = collaborative_filtering_score(user_id, movie_id, user_movie_matrix)
cb_score = content_based_score(user_id, movie_id, user_movie_matrix, tfidf_matrix)
# Normalize (scores to 0-5 range)
cf_score_norm = cf_score / 5.0 if cf_score > 0 else 0
cb_score_norm = cb_score
hybrid = alpha * cf_score_norm + (1 - alpha) * cb_score_norm
return hybrid, cf_score_norm, cb_score_norm
# Generate hybrid recommendations
def get_hybrid_recommendations(user_id, user_movie_matrix, tfidf_matrix,
movies_df, alpha=0.5, top_n=5):
"""Hybrid recommendations"""
# Get unrated movies
user_ratings = user_movie_matrix[user_id]
unrated_movies = np.where(user_ratings == 0)[0]
# Calculate scores
scores = []
for movie_id in unrated_movies:
hybrid, cf, cb = hybrid_score(user_id, movie_id, user_movie_matrix,
tfidf_matrix, alpha)
scores.append({
'movie_id': movie_id,
'hybrid_score': hybrid,
'cf_score': cf,
'cb_score': cb
})
# Sort by score
scores_df = pd.DataFrame(scores).sort_values('hybrid_score', ascending=False)
top_scores = scores_df.head(top_n)
# Join movie information
recommendations = movies_df.iloc[top_scores['movie_id']].copy()
recommendations['hybrid_score'] = top_scores['hybrid_score'].values
recommendations['cf_score'] = top_scores['cf_score'].values
recommendations['cb_score'] = top_scores['cb_score'].values
return recommendations[['title', 'genre', 'hybrid_score', 'cf_score', 'cb_score']]
# Generate recommendations (compare different alpha values)
print("\n=== Recommendations for User 0 ===")
print("\n[α=0.3: Content-based emphasis]")
recs_cb_heavy = get_hybrid_recommendations(0, user_movie_matrix, tfidf_matrix,
movies, alpha=0.3, top_n=5)
print(recs_cb_heavy)
print("\n[α=0.5: Balanced]")
recs_balanced = get_hybrid_recommendations(0, user_movie_matrix, tfidf_matrix,
movies, alpha=0.5, top_n=5)
print(recs_balanced)
print("\n[α=0.7: Collaborative filtering emphasis]")
recs_cf_heavy = get_hybrid_recommendations(0, user_movie_matrix, tfidf_matrix,
movies, alpha=0.7, top_n=5)
print(recs_cf_heavy)
Sample Output:
=== User-Movie Rating Matrix ===
Shape: (50, 8)
Number of ratings: 174
Sparsity: 56.5%
=== Recommendations for User 0 ===
[α=0.3: Content-based emphasis]
title genre hybrid_score cf_score cb_score
1 Inception Sci-Fi 0.142635 0.183421 0.124738
0 The Matrix Sci-Fi 0.138274 0.167283 0.125196
3 Interstellar Sci-Fi 0.129847 0.154923 0.119324
5 Iron Man Action 0.098234 0.112453 0.092847
4 The Avengers Action 0.094512 0.108734 0.089273
[α=0.5: Balanced]
title genre hybrid_score cf_score cb_score
1 Inception Sci-Fi 0.154080 0.183421 0.124738
0 The Matrix Sci-Fi 0.146240 0.167283 0.125196
3 Interstellar Sci-Fi 0.137124 0.154923 0.119324
5 Iron Man Action 0.102650 0.112453 0.092847
4 The Avengers Action 0.099004 0.108734 0.089273
[α=0.7: Collaborative filtering emphasis]
title genre hybrid_score cf_score cb_score
1 Inception Sci-Fi 0.165525 0.183421 0.124738
0 The Matrix Sci-Fi 0.154205 0.167283 0.125196
3 Interstellar Sci-Fi 0.144400 0.154923 0.119324
5 Iron Man Action 0.107066 0.112453 0.092847
4 The Avengers Action 0.103495 0.108734 0.089273
Switching Hybrid: Situation-Adaptive
def switching_hybrid_recommendation(user_id, user_movie_matrix, tfidf_matrix,
movies_df, top_n=5):
"""
Switching Hybrid: Switch methods based on situation
Rules:
- User has few ratings (< 3) → Content-based
- User has sufficient ratings → Collaborative filtering
"""
user_ratings = user_movie_matrix[user_id]
n_ratings = np.sum(user_ratings > 0)
print(f"\nUser {user_id}: Number of ratings = {n_ratings}")
if n_ratings < 3:
print("→ Using content-based recommendation (insufficient ratings)")
method = 'content_based'
alpha = 0.0 # 100% content-based
else:
print("→ Using collaborative filtering (sufficient ratings)")
method = 'collaborative'
alpha = 1.0 # 100% collaborative filtering
recommendations = get_hybrid_recommendations(
user_id, user_movie_matrix, tfidf_matrix, movies_df,
alpha=alpha, top_n=top_n
)
return recommendations, method
# Compare users with few and many ratings
user_sparse = 0 # Few ratings
user_dense = np.argmax(np.sum(user_movie_matrix > 0, axis=1)) # Many ratings
print("=== Switching Hybrid Recommendations ===")
recs_sparse, method_sparse = switching_hybrid_recommendation(
user_sparse, user_movie_matrix, tfidf_matrix, movies, top_n=3
)
print(recs_sparse)
recs_dense, method_dense = switching_hybrid_recommendation(
user_dense, user_movie_matrix, tfidf_matrix, movies, top_n=3
)
print(recs_dense)
3.5 Knowledge-based Recommendation
What is Knowledge-based Recommendation?
Knowledge-based recommendation uses domain knowledge and constraints to make recommendations.
| Type | Description | Use Cases |
|---|---|---|
| Constraint-based | Items that satisfy user constraints | Real estate, travel, job search |
| Case-based | Recommendations from similar past cases | Medical diagnosis, customer support |
| Conversational | Identify needs through dialogue | Chatbots, personal assistants |
Implementation: Constraint-based Recommendation
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
"""
Example: Implementation: Constraint-based Recommendation
Purpose: Demonstrate data manipulation and preprocessing
Target: Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
import numpy as np
import pandas as pd
# Movie data (with detailed information)
movies_detailed = pd.DataFrame({
'title': ['The Matrix', 'Inception', 'The Dark Knight', 'Interstellar',
'The Avengers', 'Iron Man', 'Titanic', 'The Notebook'],
'genre': ['Sci-Fi', 'Sci-Fi', 'Action', 'Sci-Fi', 'Action', 'Action', 'Romance', 'Romance'],
'year': [1999, 2010, 2008, 2014, 2012, 2008, 1997, 2004],
'duration_min': [136, 148, 152, 169, 143, 126, 195, 123],
'rating': [8.7, 8.8, 9.0, 8.6, 8.0, 7.9, 7.9, 7.8],
'language': ['English', 'English', 'English', 'English',
'English', 'English', 'English', 'English']
})
print("=== Detailed Movie Data ===")
print(movies_detailed)
def constraint_based_recommendation(constraints, movies_df):
"""
Constraint-based recommendation
Args:
constraints: Dictionary of constraints
Example: {'genre': 'Sci-Fi', 'min_rating': 8.5, 'max_duration': 150}
"""
filtered = movies_df.copy()
# Apply each constraint
for key, value in constraints.items():
if key == 'genre':
filtered = filtered[filtered['genre'] == value]
elif key == 'min_rating':
filtered = filtered[filtered['rating'] >= value]
elif key == 'max_rating':
filtered = filtered[filtered['rating'] <= value]
elif key == 'min_year':
filtered = filtered[filtered['year'] >= value]
elif key == 'max_year':
filtered = filtered[filtered['year'] <= value]
elif key == 'max_duration':
filtered = filtered[filtered['duration_min'] <= value]
elif key == 'min_duration':
filtered = filtered[filtered['duration_min'] >= value]
return filtered
# Use case 1: Short, highly-rated sci-fi movies
print("\n=== Use Case 1: Short, highly-rated Sci-Fi movies ===")
constraints_1 = {
'genre': 'Sci-Fi',
'min_rating': 8.5,
'max_duration': 150
}
print(f"Constraints: {constraints_1}")
recommendations_1 = constraint_based_recommendation(constraints_1, movies_detailed)
print(recommendations_1[['title', 'genre', 'rating', 'duration_min']])
# Use case 2: Action movies from 2010 onwards
print("\n=== Use Case 2: Action movies from 2010 onwards ===")
constraints_2 = {
'genre': 'Action',
'min_year': 2010
}
print(f"Constraints: {constraints_2}")
recommendations_2 = constraint_based_recommendation(constraints_2, movies_detailed)
print(recommendations_2[['title', 'genre', 'year', 'rating']])
# Use case 3: Highly-rated long movies
print("\n=== Use Case 3: Highly-rated long movies ===")
constraints_3 = {
'min_rating': 8.5,
'min_duration': 150
}
print(f"Constraints: {constraints_3}")
recommendations_3 = constraint_based_recommendation(constraints_3, movies_detailed)
print(recommendations_3[['title', 'rating', 'duration_min']])
Output:
=== Detailed Movie Data ===
title genre year duration_min rating language
0 The Matrix Sci-Fi 1999 136 8.7 English
1 Inception Sci-Fi 2010 148 8.8 English
2 The Dark Knight Action 2008 152 9.0 English
3 Interstellar Sci-Fi 2014 169 8.6 English
4 The Avengers Action 2012 143 8.0 English
5 Iron Man Action 2008 126 7.9 English
6 Titanic Romance 1997 195 7.9 English
7 The Notebook Romance 2004 123 7.8 English
=== Use Case 1: Short, highly-rated Sci-Fi movies ===
Constraints: {'genre': 'Sci-Fi', 'min_rating': 8.5, 'max_duration': 150}
title genre rating duration_min
0 The Matrix Sci-Fi 8.7 136
1 Inception Sci-Fi 8.8 148
=== Use Case 2: Action movies from 2010 onwards ===
Constraints: {'genre': 'Action', 'min_year': 2010}
title genre year rating
4 The Avengers Action 2012 8.0
=== Use Case 3: Highly-rated long movies ===
Constraints: {'min_rating': 8.5, 'min_duration': 150}
title rating duration_min
2 The Dark Knight 9.0 152
3 Interstellar 8.6 169
3.6 Context-Aware Recommendation
Types of Context Information
| Context | Examples | Usage |
|---|---|---|
| Time | Time of day, day of week, season | Time-based recommendations |
| Location | Geolocation, device | Location-based recommendations |
| Social | Companions, groups | Group recommendations |
| Activity | Working, resting | Situation-adaptive recommendations |
| Mood | Emotional state | Emotion-aware recommendations |
Implementation: Time-Aware Recommendation
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
"""
Example: Implementation: Time-Aware Recommendation
Purpose: Demonstrate data manipulation and preprocessing
Target: Intermediate
Execution time: 5-10 seconds
Dependencies: None
"""
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
# Generate viewing history by time of day
np.random.seed(42)
n_records = 200
# Generate hours
hours = np.random.randint(0, 24, n_records)
users = np.random.randint(0, 20, n_records)
movie_ids = np.random.randint(0, len(movies), n_records)
viewing_history = pd.DataFrame({
'user_id': users,
'movie_id': movie_ids,
'hour': hours,
'rating': np.random.randint(1, 6, n_records)
})
# Join movie titles and genres
viewing_history['title'] = viewing_history['movie_id'].apply(
lambda x: movies.iloc[x]['title']
)
viewing_history['genre'] = viewing_history['movie_id'].apply(
lambda x: movies.iloc[x]['genre']
)
# Categorize time of day
def categorize_time(hour):
if 6 <= hour < 12:
return 'morning'
elif 12 <= hour < 18:
return 'afternoon'
elif 18 <= hour < 22:
return 'evening'
else:
return 'night'
viewing_history['time_of_day'] = viewing_history['hour'].apply(categorize_time)
print("=== Viewing History by Time of Day ===")
print(viewing_history.head(10))
# Analyze genre preferences by time of day
time_genre_prefs = viewing_history.groupby(['time_of_day', 'genre']).agg({
'rating': ['mean', 'count']
}).reset_index()
time_genre_prefs.columns = ['time_of_day', 'genre', 'avg_rating', 'count']
print("\n=== Genre Preferences by Time of Day ===")
print(time_genre_prefs.sort_values(['time_of_day', 'avg_rating'], ascending=[True, False]))
def time_aware_recommendation(user_id, current_hour, viewing_history, movies_df, top_n=3):
"""Time-aware recommendation"""
time_category = categorize_time(current_hour)
# Get genre preferences for this time of day
time_prefs = viewing_history[viewing_history['time_of_day'] == time_category]
genre_scores = time_prefs.groupby('genre')['rating'].mean().to_dict()
# Get unwatched movies
watched_movies = viewing_history[viewing_history['user_id'] == user_id]['movie_id'].unique()
unwatched = movies_df[~movies_df.index.isin(watched_movies)].copy()
# Assign genre scores
unwatched['time_score'] = unwatched['genre'].map(genre_scores).fillna(0)
# Top-N recommendations
recommendations = unwatched.nlargest(top_n, 'time_score')
return recommendations[['title', 'genre', 'time_score']], time_category
# Recommendations at different times of day
print("\n=== Time-Aware Recommendations ===")
print("\n[Recommendations at 8 AM]")
recs_morning, time_cat = time_aware_recommendation(0, 8, viewing_history, movies, top_n=3)
print(f"Time of day: {time_cat}")
print(recs_morning)
print("\n[Recommendations at 8 PM]")
recs_evening, time_cat = time_aware_recommendation(0, 20, viewing_history, movies, top_n=3)
print(f"Time of day: {time_cat}")
print(recs_evening)
print("\n[Recommendations at 1 AM]")
recs_night, time_cat = time_aware_recommendation(0, 1, viewing_history, movies, top_n=3)
print(f"Time of day: {time_cat}")
print(recs_night)
Sample Output:
=== Viewing History by Time of Day ===
user_id movie_id hour rating title genre time_of_day
0 12 6 18 1 Titanic Romance evening
1 11 7 19 1 The Notebook Romance evening
2 18 2 9 5 The Dark Knight Action morning
3 8 6 3 5 Titanic Romance night
4 10 2 14 2 The Dark Knight Action afternoon
...
=== Genre Preferences by Time of Day ===
time_of_day genre avg_rating count
0 afternoon Action 3.181818 22
1 afternoon Romance 2.866667 15
2 afternoon Sci-Fi 2.952381 21
3 evening Action 3.000000 19
4 evening Romance 2.941176 17
5 evening Sci-Fi 3.117647 17
6 morning Action 3.263158 19
7 morning Romance 3.250000 16
8 morning Sci-Fi 3.105263 19
9 night Action 3.055556 18
10 night Romance 3.166667 18
11 night Sci-Fi 2.857143 14
=== Time-Aware Recommendations ===
[Recommendations at 8 AM]
Time of day: morning
title genre time_score
2 The Dark Knight Action 3.263158
0 The Matrix Sci-Fi 3.105263
7 The Notebook Romance 3.250000
[Recommendations at 8 PM]
Time of day: evening
title genre time_score
1 Inception Sci-Fi 3.117647
2 The Dark Knight Action 3.000000
7 The Notebook Romance 2.941176
[Recommendations at 1 AM]
Time of day: night
title genre time_score
7 The Notebook Romance 3.166667
2 The Dark Knight Action 3.055556
1 Inception Sci-Fi 2.857143
3.7 Chapter Summary
What We Learned
Content-Based Filtering
- Text feature extraction with TF-IDF
- User profile construction
- Recommendation generation using cosine similarity
- Addressing cold start problems
Hybrid Recommendation
- Weighted: Weighted linear combination
- Switching: Situation-adaptive switching
- Mixed: Mixing multiple methods
- Complementary use of collaborative and content-based
Knowledge-based Recommendation
- Constraint-based recommendation implementation
- Utilizing domain knowledge
- Responding to explicit user requirements
Context-Aware Recommendation
- Considering time, location, and activity
- Situation-adaptive recommendations
- Multi-criteria recommendation systems
Method Selection Guidelines
| Situation | Recommended Method | Reason |
|---|---|---|
| Many new items | Content-based | Can recommend immediately with features |
| Many new users | Knowledge-based, Hybrid | No rating data required |
| Rich rating data | Collaborative filtering, Hybrid | Leverage collective intelligence |
| Diversity is important | Hybrid | Reduce bias with multiple methods |
| Clear constraints | Knowledge-based | Ensure constraints are met |
| High context dependency | Context-aware | Reflect time and location |
Next Chapter
In Chapter 4, you will learn about Deep Learning-Based Recommendation Systems, covering Neural Collaborative Filtering (NCF), the use of embedding layers, DeepFM which combines Factorization Machines with DNNs, Two-Tower Models for large-scale systems, and Transformer-based recommendation approaches.
Exercises
Exercise 1 (Difficulty: Easy)
List three main differences between content-based filtering and collaborative filtering, and explain the advantages and disadvantages of each.
Sample Answer
Main Differences:
Basis for Recommendation
- Content-based: Item features
- Collaborative filtering: Rating patterns between users/items
Required Data
- Content-based: Item features (metadata)
- Collaborative filtering: User-item rating matrix
Cold Start Problem
- Content-based: Can recommend new items if features exist
- Collaborative filtering: Difficult for new users/items
Advantages and Disadvantages:
| Method | Advantages | Disadvantages |
|---|---|---|
| Content-based | - Handles new items - High explainability - User independent |
- Filter bubble - Requires feature engineering - Low serendipity |
| Collaborative Filtering | - Serendipitous recommendations - No features needed - Leverages collective intelligence |
- Cold start - Sparsity problem - Scalability |
Exercise 2 (Difficulty: Medium)
Explain the TF-IDF calculation formula and manually calculate TF-IDF for the following three documents.
Document 1: "machine learning is great"
Document 2: "deep learning is powerful"
Document 3: "machine learning and deep learning"
Sample Answer
TF-IDF Calculation Formula:
$$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $$
$$ \text{TF}(t, d) = \frac{\text{Occurrences of term } t \text{ in document } d}{\text{Total words in document } d} $$
$$ \text{IDF}(t) = \log \frac{N}{\text{df}(t)} $$
- $N$: Total number of documents = 3
- $\text{df}(t)$: Number of documents containing word $t$
Manual Calculation Example (TF-IDF for "machine" in Document 1):
TF Calculation
- Document 1: "machine learning is great" (4 words)
- Occurrences of "machine": 1
- TF("machine", Document 1) = 1 / 4 = 0.25
IDF Calculation
- Total documents $N$ = 3
- Documents containing "machine": Document 1, Document 3 → df("machine") = 2
- IDF("machine") = log(3 / 2) = log(1.5) ≈ 0.176
TF-IDF Calculation
- TF-IDF("machine", Document 1) = 0.25 × 0.176 ≈ 0.044
Verification with Python:
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Verification with Python:
Purpose: Demonstrate machine learning model training and evaluation
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
docs = [
"machine learning is great",
"deep learning is powerful",
"machine learning and deep learning"
]
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(docs)
print("Features:", tfidf.get_feature_names_out())
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())
Exercise 3 (Difficulty: Medium)
In a weighted hybrid recommendation system, propose three strategies for adjusting the collaborative filtering weight (α) based on different situations.
Sample Answer
α Adjustment Strategies:
Dynamic Adjustment Based on User Rating Count
def adaptive_alpha(user_id, user_movie_matrix, min_ratings=5): """Adjust α based on user's rating count""" n_ratings = np.sum(user_movie_matrix[user_id] > 0) if n_ratings < min_ratings: # Few ratings → Content-based emphasis alpha = 0.2 elif n_ratings < 10: # Medium → Balanced alpha = 0.5 else: # Many ratings → Collaborative filtering emphasis alpha = 0.8 return alphaReason: For new users with little rating data, emphasize content-based; with sufficient data, emphasize collaborative filtering
Adjustment Based on Item Popularity
def popularity_based_alpha(item_id, user_movie_matrix): """Adjust α based on item rating count""" item_ratings = user_movie_matrix[:, item_id] n_ratings = np.sum(item_ratings > 0) if n_ratings < 5: # Niche item → Content-based alpha = 0.3 else: # Popular item → Collaborative filtering alpha = 0.7 return alphaReason: Collaborative filtering doesn't work well for niche items with few ratings
A/B Testing Optimization
def ab_test_alpha_optimization(user_groups, alpha_values, eval_metric='precision@k'): """Optimize α through A/B testing""" results = {} for alpha in alpha_values: metrics = [] for user in user_groups: recs = hybrid_recommend(user, alpha=alpha) metric = evaluate(recs, eval_metric) metrics.append(metric) results[alpha] = np.mean(metrics) optimal_alpha = max(results, key=results.get) return optimal_alphaReason: Measure performance on real data to find the optimal balance
Exercise 4 (Difficulty: Hard)
Implement a time-aware recommendation system. Assume user viewing history includes time information, and genre preferences change depending on the time of day.
Sample Answer
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
class TimeAwareRecommender:
"""Time-aware recommendation system"""
def __init__(self, decay_factor=0.9):
"""
Args:
decay_factor: Time decay coefficient (0-1)
"""
self.decay_factor = decay_factor
self.time_genre_prefs = None
def fit(self, viewing_history):
"""Learn preferences by time of day"""
# Categorize time of day
viewing_history['time_category'] = viewing_history['hour'].apply(
self._categorize_time
)
# Average rating by time × genre
self.time_genre_prefs = viewing_history.groupby(
['time_category', 'genre']
)['rating'].mean().to_dict()
return self
def _categorize_time(self, hour):
"""Categorize time of day"""
if 6 <= hour < 12:
return 'morning'
elif 12 <= hour < 18:
return 'afternoon'
elif 18 <= hour < 22:
return 'evening'
else:
return 'night'
def _time_decay(self, timestamp, current_time):
"""Time decay weight"""
hours_diff = (current_time - timestamp).total_seconds() / 3600
days_diff = hours_diff / 24
return self.decay_factor ** days_diff
def recommend(self, user_id, current_time, viewing_history,
movies_df, top_n=5):
"""Time-aware recommendation"""
current_hour = current_time.hour
time_category = self._categorize_time(current_hour)
# Get user history
user_history = viewing_history[
viewing_history['user_id'] == user_id
].copy()
# Calculate time decay weights
user_history['time_weight'] = user_history['timestamp'].apply(
lambda t: self._time_decay(t, current_time)
)
# Calculate user preferences with weighted average
user_genre_prefs = user_history.groupby('genre').apply(
lambda g: np.average(g['rating'], weights=g['time_weight'])
).to_dict()
# Get unwatched movies
watched = user_history['movie_id'].unique()
unwatched = movies_df[~movies_df.index.isin(watched)].copy()
# Calculate scores
def calculate_score(row):
genre = row['genre']
# User preference score
user_pref = user_genre_prefs.get(genre, 0)
# Time of day preference score
time_pref = self.time_genre_prefs.get(
(time_category, genre), 0
)
# Integrated score
return 0.6 * user_pref + 0.4 * time_pref
unwatched['score'] = unwatched.apply(calculate_score, axis=1)
# Top-N recommendations
recommendations = unwatched.nlargest(top_n, 'score')
return recommendations[['title', 'genre', 'score']]
# Usage example
np.random.seed(42)
# Generate viewing history with timestamps
base_time = datetime.now()
viewing_data = []
for i in range(100):
timestamp = base_time - timedelta(days=np.random.randint(0, 30))
viewing_data.append({
'user_id': np.random.randint(0, 10),
'movie_id': np.random.randint(0, 8),
'rating': np.random.randint(1, 6),
'hour': timestamp.hour,
'timestamp': timestamp
})
viewing_df = pd.DataFrame(viewing_data)
viewing_df['genre'] = viewing_df['movie_id'].apply(
lambda x: movies.iloc[x]['genre']
)
# Train and recommend
recommender = TimeAwareRecommender(decay_factor=0.95)
recommender.fit(viewing_df)
# Recommendations at different times
morning_time = datetime.now().replace(hour=9)
evening_time = datetime.now().replace(hour=20)
print("=== Recommendations at 9 AM ===")
recs_morning = recommender.recommend(
user_id=0,
current_time=morning_time,
viewing_history=viewing_df,
movies_df=movies,
top_n=3
)
print(recs_morning)
print("\n=== Recommendations at 8 PM ===")
recs_evening = recommender.recommend(
user_id=0,
current_time=evening_time,
viewing_history=viewing_df,
movies_df=movies,
top_n=3
)
print(recs_evening)
Exercise 5 (Difficulty: Hard)
Explain the advantages and implementation methods of using embedding vectors from Word2Vec or BERT instead of TF-IDF in content-based recommendation.
Sample Answer
TF-IDF vs Embedding Vectors:
| Aspect | TF-IDF | Word2Vec/BERT |
|---|---|---|
| Representation | Sparse BoW | Dense distributed representation |
| Semantic Understanding | Word occurrence only | Captures semantic similarity |
| Dimensionality | Vocabulary size (high-dimensional) | Fixed (e.g., 768 dimensions) |
| Synonym Handling | Difficult | Possible |
| Computational Cost | Low | High (especially BERT) |
Word2Vec Implementation Example:
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Word2Vec Implementation Example:
Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
from gensim.models import Word2Vec
import numpy as np
# Text preprocessing
texts = [doc.lower().split() for doc in movies['description']]
# Train Word2Vec model
w2v_model = Word2Vec(
sentences=texts,
vector_size=100,
window=5,
min_count=1,
workers=4
)
def text_to_vector(text, model):
"""Convert text to Word2Vec vector"""
words = text.lower().split()
word_vectors = [
model.wv[word] for word in words
if word in model.wv
]
if len(word_vectors) == 0:
return np.zeros(model.vector_size)
# Average vector
return np.mean(word_vectors, axis=0)
# Movie embedding vectors
movie_embeddings = np.array([
text_to_vector(desc, w2v_model)
for desc in movies['description']
])
# Recommend using cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(movie_embeddings)
print("Word2Vec-based similarity:")
for i, title in enumerate(movies['title']):
print(f"{title} vs The Matrix: {similarity_matrix[0, i]:.3f}")
BERT Implementation Example:
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
# - transformers>=4.30.0
from transformers import BertTokenizer, BertModel
import torch
# Load BERT model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
def get_bert_embedding(text):
"""Get text embedding with BERT"""
inputs = tokenizer(
text,
return_tensors='pt',
padding=True,
truncation=True,
max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
# Use [CLS] token embedding
embedding = outputs.last_hidden_state[:, 0, :].numpy()
return embedding.flatten()
# BERT embedding vectors
bert_embeddings = np.array([
get_bert_embedding(desc)
for desc in movies['description']
])
# Cosine similarity
bert_similarity = cosine_similarity(bert_embeddings)
print("\nBERT-based similarity:")
for i, title in enumerate(movies['title']):
print(f"{title} vs The Matrix: {bert_similarity[0, i]:.3f}")
Advantages:
- Semantic Similarity: Recognizes "car" and "automobile" as similar
- Context Understanding: More accurately captures sentence meaning
- Polysemy Handling: Understands word meaning based on context
- Dimensionality Reduction: Low-dimensional dense vector representation
References
- Aggarwal, C. C. (2016). Recommender Systems: The Textbook. Springer.
- Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender Systems Handbook (2nd ed.). Springer.
- Lops, P., de Gemmis, M., & Semeraro, G. (2011). Content-based Recommender Systems: State of the Art and Trends. In Recommender Systems Handbook (pp. 73-105). Springer.
- Burke, R. (2002). Hybrid Recommender Systems: Survey and Experiments. User Modeling and User-Adapted Interaction, 12(4), 331-370.
- Adomavicius, G., & Tuzhilin, A. (2011). Context-Aware Recommender Systems. In Recommender Systems Handbook (pp. 217-253). Springer.