🌐 EN | 🇯🇵 JP | Last sync: 2025-11-16

Chapter 1: Recommendation Systems Fundamentals

Basic Concepts of Recommendation Systems and Data Processing Foundations

📖 Reading Time: 25-30 minutes 📊 Difficulty: Beginner 💻 Code Examples: 9 📝 Exercises: 5

This chapter covers the fundamentals of Recommendation Systems Fundamentals, which what are recommendation systems. You will learn classification of recommendation tasks, key challenges such as the Cold Start problem, and Perform preprocessing on the MovieLens dataset.

Learning Objectives

By reading this chapter, you will be able to:


1.1 What are Recommendation Systems

Importance of Personalization

Recommendation Systems are technologies that suggest optimal items (products, content, services, etc.) based on user preferences and behavior.

"In an age of information overload, recommendation systems play a crucial role in connecting users with valuable content."

Applications of Recommendation Systems

Industry Application Examples Recommendation Target
E-commerce Amazon, Rakuten Products
Video Streaming Netflix, YouTube Movies, Videos
Music Streaming Spotify, Apple Music Songs, Playlists
Social Networks Facebook, Twitter Friends, Posts
News Google News Articles
Job Search LinkedIn Jobs, Candidates

Business Value

Types of Recommendations

graph TD A[Recommendation Systems] --> B[Collaborative Filtering] A --> C[Content-Based] A --> D[Hybrid] B --> B1[User-based] B --> B2[Item-based] B --> B3[Matrix Factorization] C --> C1[Feature Extraction] C --> C2[Similarity Computation] D --> D1[Weighted Hybrid] D --> D2[Switching Hybrid] D --> D3[Feature Combination] style A fill:#e8f5e9 style B fill:#e3f2fd style C fill:#fff3e0 style D fill:#f3e5f5

1.2 Classification of Recommendation Tasks

Explicit vs Implicit Feedback

User feedback can be explicit or implicit.

Feedback Type Description Examples Advantages Disadvantages
Explicit Direct user ratings Star ratings, likes, reviews Clear preference information Limited data
Implicit Inferred from behavior Clicks, watch time, purchases Abundant data Ambiguous interpretation

Example of Explicit Feedback

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: Example of Explicit Feedback

Purpose: Demonstrate data manipulation and preprocessing
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

import pandas as pd
import numpy as np

# Explicit Feedback: Movie rating data
np.random.seed(42)
n_ratings = 100

explicit_data = pd.DataFrame({
    'user_id': np.random.randint(1, 21, n_ratings),
    'item_id': np.random.randint(1, 51, n_ratings),
    'rating': np.random.randint(1, 6, n_ratings),  # Ratings from 1-5
    'timestamp': pd.date_range('2024-01-01', periods=n_ratings, freq='H')
})

print("=== Explicit Feedback (Rating Data) ===")
print(explicit_data.head(10))
print(f"\nRating Distribution:")
print(explicit_data['rating'].value_counts().sort_index())
print(f"\nAverage Rating: {explicit_data['rating'].mean():.2f}")
print(f"Rating Standard Deviation: {explicit_data['rating'].std():.2f}")

Output:

=== Explicit Feedback (Rating Data) ===
   user_id  item_id  rating           timestamp
0        7       40       4 2024-01-01 00:00:00
1       20       34       3 2024-01-01 01:00:00
2       18       48       1 2024-01-01 02:00:00
3       11       14       5 2024-01-01 03:00:00
4        6       21       1 2024-01-01 04:00:00
5       17       28       4 2024-01-01 05:00:00
6        3        9       1 2024-01-01 06:00:00
7        9       37       4 2024-01-01 07:00:00
8       20       17       5 2024-01-01 08:00:00
9        8       46       2 2024-01-01 09:00:00

Rating Distribution:
1    23
2    19
3    18
4    21
5    19
Name: rating, dtype: int64

Average Rating: 2.98
Rating Standard Deviation: 1.47

Example of Implicit Feedback

# Implicit Feedback: Viewing data
implicit_data = pd.DataFrame({
    'user_id': np.random.randint(1, 21, n_ratings),
    'item_id': np.random.randint(1, 51, n_ratings),
    'watch_time': np.random.randint(1, 120, n_ratings),  # minutes
    'completed': np.random.choice([0, 1], n_ratings, p=[0.3, 0.7]),
    'timestamp': pd.date_range('2024-01-01', periods=n_ratings, freq='H')
})

# Infer preferences from implicit feedback
# Assume "liked" if watch time is long or video was completed
implicit_data['preference'] = (
    (implicit_data['watch_time'] > 60) |
    (implicit_data['completed'] == 1)
).astype(int)

print("\n=== Implicit Feedback (Viewing Data) ===")
print(implicit_data.head(10))
print(f"\nCompletion Rate: {implicit_data['completed'].mean():.1%}")
print(f"Inferred Preference Rate: {implicit_data['preference'].mean():.1%}")

Rating Prediction

Rating Prediction is the task of predicting the rating a user would give to an item they have not yet rated.

$$ \hat{r}_{ui} = f(\text{user}_u, \text{item}_i) $$

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: $$
\hat{r}_{ui} = f(\text{user}_u, \text{item}_i)
$$

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Build User-Item matrix (simplified version)
ratings_matrix = explicit_data.pivot_table(
    index='user_id',
    columns='item_id',
    values='rating',
    aggfunc='mean'
)

print("=== User-Item Rating Matrix ===")
print(f"Shape: {ratings_matrix.shape}")
print(f"Missing Rate: {ratings_matrix.isnull().sum().sum() / (ratings_matrix.shape[0] * ratings_matrix.shape[1]):.1%}")
print(f"\nMatrix (sample):")
print(ratings_matrix.iloc[:5, :5])

Top-N Recommendation

Top-N Recommendation is the task of recommending the top N items to each user.

# Simple Top-N recommendation (popularity-based)
def popularity_based_recommendation(data, n=5):
    """Popularity-based Top-N recommendation"""
    item_popularity = data.groupby('item_id')['rating'].agg(['count', 'mean'])
    item_popularity['score'] = (
        item_popularity['count'] * 0.3 +
        item_popularity['mean'] * 0.7
    )
    top_n = item_popularity.nlargest(n, 'score')
    return top_n

top_items = popularity_based_recommendation(explicit_data, n=5)

print("\n=== Top-5 Recommended Items (Popularity-based) ===")
print(top_items)
print(f"\nRationale:")
print("- Score = Rating Count × 0.3 + Average Rating × 0.7")

Ranking Problems

Ranking problems involve ordering candidate items. Items are sorted by relevance.

# Ranking example: Order items by score for each user
def rank_items_for_user(user_id, data):
    """Item ranking for a specific user"""
    # Consider user's past rating tendencies
    user_ratings = data[data['user_id'] == user_id]
    user_avg_rating = user_ratings['rating'].mean()

    # Information about all items
    all_items = data.groupby('item_id')['rating'].agg(['mean', 'count'])

    # Scoring (simplified version)
    all_items['score'] = (
        all_items['mean'] * 0.5 +
        user_avg_rating * 0.3 +
        np.log1p(all_items['count']) * 0.2
    )

    ranked_items = all_items.sort_values('score', ascending=False)
    return ranked_items

# Ranking for User 7
user_ranking = rank_items_for_user(7, explicit_data)
print("\n=== Item Ranking for User 7 (Top 10) ===")
print(user_ranking.head(10))

1.3 Evaluation Metrics

Precision, Recall, F1

These are basic metrics for measuring the accuracy of recommendation systems.

$$ \text{Precision@K} = \frac{\text{Number of Relevant Recommended Items}}{K} $$

$$ \text{Recall@K} = \frac{\text{Number of Relevant Recommended Items}}{\text{Total Number of Relevant Items}} $$

$$ \text{F1@K} = 2 \cdot \frac{\text{Precision@K} \cdot \text{Recall@K}}{\text{Precision@K} + \text{Recall@K}} $$

def precision_recall_at_k(recommended, relevant, k):
    """Calculate Precision@K and Recall@K"""
    recommended_k = recommended[:k]

    # Items that are both recommended and relevant
    hits = len(set(recommended_k) & set(relevant))

    precision = hits / k if k > 0 else 0
    recall = hits / len(relevant) if len(relevant) > 0 else 0

    if precision + recall > 0:
        f1 = 2 * precision * recall / (precision + recall)
    else:
        f1 = 0

    return precision, recall, f1

# Example: Recommendations to a user
recommended_items = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]  # Recommended items
relevant_items = [2, 3, 5, 8, 11, 15]  # Actually relevant items

for k in [5, 10]:
    p, r, f = precision_recall_at_k(recommended_items, relevant_items, k)
    print(f"\n=== Evaluation at K={k} ===")
    print(f"Precision@{k}: {p:.3f}")
    print(f"Recall@{k}: {r:.3f}")
    print(f"F1@{k}: {f:.3f}")

Output:

=== Evaluation at K=5 ===
Precision@5: 0.400
Recall@5: 0.333
F1@5: 0.364

=== Evaluation at K=10 ===
Precision@10: 0.400
Recall@10: 0.667
F1@10: 0.500

NDCG (Normalized Discounted Cumulative Gain)

NDCG is a metric that evaluates ranking quality. It emphasizes placing highly relevant items at the top.

$$ \text{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i + 1)} $$

$$ \text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}} $$

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np

def dcg_at_k(relevances, k):
    """Calculate DCG@K"""
    relevances = np.array(relevances[:k])
    if relevances.size:
        discounts = np.log2(np.arange(2, relevances.size + 2))
        return np.sum((2**relevances - 1) / discounts)
    return 0.0

def ndcg_at_k(relevances, k):
    """Calculate NDCG@K"""
    dcg = dcg_at_k(relevances, k)
    ideal_relevances = sorted(relevances, reverse=True)
    idcg = dcg_at_k(ideal_relevances, k)

    if idcg == 0:
        return 0.0
    return dcg / idcg

# Example: Relevance scores of recommendation results (5-level scale)
relevances = [3, 2, 5, 0, 1, 4, 2, 0, 3, 1]  # Relevance in recommendation order

print("=== NDCG Evaluation ===")
for k in [3, 5, 10]:
    ndcg = ndcg_at_k(relevances, k)
    print(f"NDCG@{k}: {ndcg:.3f}")

print(f"\nRelevance List (recommendation order): {relevances}")
print(f"Ideal Order: {sorted(relevances, reverse=True)}")

MAP (Mean Average Precision)

MAP is the mean of Average Precision across all users.

$$ \text{AP@K} = \frac{1}{\min(m, K)} \sum_{k=1}^{K} \text{Precision@k} \cdot \text{rel}(k) $$

$$ \text{MAP@K} = \frac{1}{|U|} \sum_{u \in U} \text{AP@K}_u $$

def average_precision_at_k(recommended, relevant, k):
    """Calculate Average Precision@K"""
    recommended_k = recommended[:k]

    score = 0.0
    num_hits = 0.0

    for i, item in enumerate(recommended_k):
        if item in relevant:
            num_hits += 1.0
            score += num_hits / (i + 1.0)

    if len(relevant) == 0:
        return 0.0

    return score / min(len(relevant), k)

# Example: MAP calculation for multiple users
users_recommendations = [
    ([1, 3, 5, 7, 9], [3, 5, 9]),      # User 1
    ([2, 4, 6, 8, 10], [4, 8]),        # User 2
    ([1, 2, 3, 4, 5], [1, 2, 5]),      # User 3
]

aps = []
for recommended, relevant in users_recommendations:
    ap = average_precision_at_k(recommended, relevant, k=5)
    aps.append(ap)
    print(f"Recommended: {recommended}, Relevant: {relevant} -> AP@5: {ap:.3f}")

map_score = np.mean(aps)
print(f"\n=== MAP@5: {map_score:.3f} ===")

Coverage, Diversity, Serendipity

These metrics evaluate the quality of recommendations from multiple perspectives.

Metric Description Purpose
Coverage Proportion of items recommended Long-tail item discovery
Diversity Variety within recommendation list Filter bubble avoidance
Serendipity Unexpected yet relevant recommendations Promoting new discoveries
def calculate_coverage(all_recommendations, total_items):
    """Calculate coverage"""
    unique_recommended = set()
    for recs in all_recommendations:
        unique_recommended.update(recs)

    coverage = len(unique_recommended) / total_items
    return coverage

def calculate_diversity(recommendations):
    """Calculate diversity of recommendation list (uniqueness rate)"""
    unique_items = len(set(recommendations))
    diversity = unique_items / len(recommendations)
    return diversity

# Example: Calculate coverage and diversity
all_recs = [
    [1, 2, 3, 4, 5],
    [1, 3, 6, 7, 8],
    [2, 4, 9, 10, 11],
    [1, 5, 12, 13, 14]
]

total_items = 50  # Total number of items

coverage = calculate_coverage(all_recs, total_items)
print(f"=== Coverage and Diversity ===")
print(f"Coverage: {coverage:.1%}")
print(f"Unique Recommended Items: {len(set([item for recs in all_recs for item in recs]))}")

for i, recs in enumerate(all_recs):
    diversity = calculate_diversity(recs)
    print(f"User {i+1} Recommendation Diversity: {diversity:.1%}")

1.4 Challenges in Recommendation Systems

Cold Start Problem

The Cold Start Problem refers to insufficient data for new users or new items.

Type Description Solutions
User Cold Start Unknown preferences for new users Recommend popular items, utilize demographic information
Item Cold Start No ratings for new items Content-based recommendation, utilize metadata
System Cold Start Lack of data for entire system External data, crowdsourcing

Data Sparsity

Data Sparsity is the problem where most of the User-Item matrix consists of missing values.

graph LR A[User-Item Matrix] --> B[Rated: 1%] A --> C[Unrated: 99%] B --> D[Collaborative Filtering Possible] C --> E[Recommendation Difficult] style A fill:#fff3e0 style B fill:#c8e6c9 style C fill:#ffcdd2 style D fill:#e8f5e9 style E fill:#ffebee

Scalability

Scalability is the problem of computational complexity as the number of users and items grows.

Solutions: Dimensionality reduction (Matrix Factorization), Approximate Nearest Neighbor (ANN), distributed processing

Filter Bubble

Filter Bubble is the problem where only similar items are recommended, reducing diversity.


1.5 Datasets and Preprocessing

MovieLens Dataset

MovieLens is the most widely used dataset in recommendation systems research.

Version Ratings Users Movies Use Case
100K 100,000 943 1,682 Learning, prototyping
1M 1 million 6,040 3,706 Research, evaluation
10M 10 million 71,567 10,681 Scalability testing
25M 25 million 162,541 62,423 Large-scale experiments

User-Item Matrix

The User-Item Matrix is the fundamental data structure for recommendation systems.

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: The User-Item Matrixis the fundamental data structure for re

Purpose: Demonstrate data manipulation and preprocessing
Target: Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix

# Create sample data (MovieLens-style)
np.random.seed(42)
n_users = 100
n_items = 50
n_ratings = 500

ratings_data = pd.DataFrame({
    'user_id': np.random.randint(1, n_users + 1, n_ratings),
    'item_id': np.random.randint(1, n_items + 1, n_ratings),
    'rating': np.random.randint(1, 6, n_ratings),
    'timestamp': pd.date_range('2024-01-01', periods=n_ratings, freq='H')
})

# Remove duplicates (keep latest rating for same user-item pair)
ratings_data = ratings_data.sort_values('timestamp').drop_duplicates(
    subset=['user_id', 'item_id'],
    keep='last'
)

print("=== Rating Data ===")
print(ratings_data.head(10))
print(f"\nTotal Ratings: {len(ratings_data)}")
print(f"Unique Users: {ratings_data['user_id'].nunique()}")
print(f"Unique Items: {ratings_data['item_id'].nunique()}")
print(f"Rating Distribution:\n{ratings_data['rating'].value_counts().sort_index()}")

# Build User-Item matrix
user_item_matrix = ratings_data.pivot_table(
    index='user_id',
    columns='item_id',
    values='rating',
    fill_value=0
)

print(f"\n=== User-Item Matrix ===")
print(f"Shape: {user_item_matrix.shape}")
print(f"Density: {(user_item_matrix > 0).sum().sum() / (user_item_matrix.shape[0] * user_item_matrix.shape[1]):.1%}")
print(f"\nMatrix Sample (first 5 users × 5 items):")
print(user_item_matrix.iloc[:5, :5])

# Convert to sparse matrix (memory efficiency)
sparse_matrix = csr_matrix(user_item_matrix.values)
print(f"\nSparse Matrix Size: {sparse_matrix.data.nbytes / 1024:.2f} KB")
print(f"Dense Matrix Size: {user_item_matrix.values.nbytes / 1024:.2f} KB")
print(f"Memory Reduction Rate: {(1 - sparse_matrix.data.nbytes / user_item_matrix.values.nbytes):.1%}")

Train-Test Split Strategies

For recommendation systems, time-series-aware splitting is important.

from sklearn.model_selection import train_test_split

# 1. Random split (simple but ignores time series)
train_random, test_random = train_test_split(
    ratings_data,
    test_size=0.2,
    random_state=42
)

print("=== 1. Random Split ===")
print(f"Training Data: {len(train_random)} samples")
print(f"Test Data: {len(test_random)} samples")

# 2. Temporal split (more realistic)
ratings_data_sorted = ratings_data.sort_values('timestamp')
split_idx = int(len(ratings_data_sorted) * 0.8)

train_temporal = ratings_data_sorted.iloc[:split_idx]
test_temporal = ratings_data_sorted.iloc[split_idx:]

print("\n=== 2. Temporal Split ===")
print(f"Training Period: {train_temporal['timestamp'].min()} ~ {train_temporal['timestamp'].max()}")
print(f"Test Period: {test_temporal['timestamp'].min()} ~ {test_temporal['timestamp'].max()}")
print(f"Training Data: {len(train_temporal)} samples")
print(f"Test Data: {len(test_temporal)} samples")

# 3. Per-user split (Leave-One-Out)
def leave_one_out_split(data):
    """Put each user's latest rating into test set"""
    train_list = []
    test_list = []

    for user_id, group in data.groupby('user_id'):
        group_sorted = group.sort_values('timestamp')
        if len(group_sorted) > 1:
            train_list.append(group_sorted.iloc[:-1])
            test_list.append(group_sorted.iloc[-1:])
        else:
            train_list.append(group_sorted)

    train = pd.concat(train_list)
    test = pd.concat(test_list) if test_list else pd.DataFrame()

    return train, test

train_loo, test_loo = leave_one_out_split(ratings_data)

print("\n=== 3. Leave-One-Out Split ===")
print(f"Training Data: {len(train_loo)} samples")
print(f"Test Data: {len(test_loo)} samples")
print(f"Test Users: {test_loo['user_id'].nunique()}")

Python Preprocessing

Practical example of data preprocessing for recommendation systems.

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

import pandas as pd
import numpy as np

class RecommendationDataPreprocessor:
    """Data preprocessing class for recommendation systems"""

    def __init__(self, min_user_ratings=5, min_item_ratings=5):
        self.min_user_ratings = min_user_ratings
        self.min_item_ratings = min_item_ratings
        self.user_mapping = {}
        self.item_mapping = {}

    def filter_rare_users_items(self, data):
        """Filter out users and items with few ratings"""
        print("=== Before Filtering ===")
        print(f"Users: {data['user_id'].nunique()}")
        print(f"Items: {data['item_id'].nunique()}")
        print(f"Ratings: {len(data)}")

        # Filter users
        user_counts = data['user_id'].value_counts()
        valid_users = user_counts[user_counts >= self.min_user_ratings].index
        data = data[data['user_id'].isin(valid_users)]

        # Filter items
        item_counts = data['item_id'].value_counts()
        valid_items = item_counts[item_counts >= self.min_item_ratings].index
        data = data[data['item_id'].isin(valid_items)]

        print("\n=== After Filtering ===")
        print(f"Users: {data['user_id'].nunique()}")
        print(f"Items: {data['item_id'].nunique()}")
        print(f"Ratings: {len(data)}")

        return data

    def create_mappings(self, data):
        """Map user and item IDs to continuous integers"""
        unique_users = sorted(data['user_id'].unique())
        unique_items = sorted(data['item_id'].unique())

        self.user_mapping = {uid: idx for idx, uid in enumerate(unique_users)}
        self.item_mapping = {iid: idx for idx, iid in enumerate(unique_items)}

        data['user_idx'] = data['user_id'].map(self.user_mapping)
        data['item_idx'] = data['item_id'].map(self.item_mapping)

        print("\n=== ID Mapping ===")
        print(f"User ID Range: {data['user_id'].min()} ~ {data['user_id'].max()}")
        print(f"User Index Range: {data['user_idx'].min()} ~ {data['user_idx'].max()}")
        print(f"Item ID Range: {data['item_id'].min()} ~ {data['item_id'].max()}")
        print(f"Item Index Range: {data['item_idx'].min()} ~ {data['item_idx'].max()}")

        return data

    def normalize_ratings(self, data, method='mean'):
        """Normalize rating values"""
        if method == 'mean':
            # Subtract mean
            user_means = data.groupby('user_id')['rating'].transform('mean')
            data['rating_normalized'] = data['rating'] - user_means
        elif method == 'minmax':
            # Scale to [0, 1]
            data['rating_normalized'] = (data['rating'] - data['rating'].min()) / (
                data['rating'].max() - data['rating'].min()
            )

        print(f"\n=== Rating Normalization ({method}) ===")
        print(f"Original Rating Range: [{data['rating'].min()}, {data['rating'].max()}]")
        print(f"Normalized Range: [{data['rating_normalized'].min():.2f}, {data['rating_normalized'].max():.2f}]")

        return data

# Execute preprocessing
preprocessor = RecommendationDataPreprocessor(
    min_user_ratings=3,
    min_item_ratings=3
)

# Filter data
filtered_data = preprocessor.filter_rare_users_items(ratings_data)

# ID mapping
mapped_data = preprocessor.create_mappings(filtered_data)

# Rating normalization
normalized_data = preprocessor.normalize_ratings(mapped_data, method='mean')

print("\n=== Preprocessed Data (sample) ===")
print(normalized_data[['user_id', 'user_idx', 'item_id', 'item_idx',
                        'rating', 'rating_normalized']].head(10))

1.6 Chapter Summary

What We Learned

  1. Role of Recommendation Systems

    • Suggest optimal content in an age of information overload
    • Contribute to revenue growth, engagement, and customer satisfaction
    • Collaborative filtering, content-based, and hybrid approaches
  2. Types of Recommendation Tasks

    • Explicit vs Implicit Feedback
    • Rating Prediction, Top-N Recommendation, Ranking
    • Selecting appropriate methods for each task
  3. Evaluation Metrics

    • Precision, Recall, F1: Basic accuracy metrics
    • NDCG: Evaluating ranking quality
    • MAP: Average precision evaluation
    • Coverage, Diversity, Serendipity: Recommendation quality
  4. Key Challenges

    • Cold Start Problem: Handling new users and items
    • Data Sparsity: Managing sparse data
    • Scalability: Processing large-scale data
    • Filter Bubble: Ensuring diversity
  5. Practical Data Processing

    • Utilizing the MovieLens dataset
    • Building User-Item matrices
    • Appropriate Train-Test splitting
    • Building preprocessing pipelines

Principles of Recommendation System Design

Principle Description
User-Centric Design Prioritize user satisfaction and experience
Multi-faceted Evaluation Consider not just accuracy but also diversity and novelty
Temporal Awareness Respect time series in splitting and evaluation
Scalability Design to handle large-scale data
Continuous Improvement Improve through A/B testing and regular evaluation

Next Chapter

In Chapter 2, we will learn about Collaborative Filtering, covering user-based collaborative filtering, item-based collaborative filtering, similarity computation methods including cosine similarity and Pearson correlation, nearest neighbor search and k-NN algorithms, and practical implementation and evaluation techniques.


Exercises

Problem 1 (Difficulty: easy)

Explain the difference between Explicit Feedback and Implicit Feedback, and describe the advantages and disadvantages of each.

Solution

Answer:

Explicit Feedback:

Implicit Feedback:

Use Cases:

Problem 2 (Difficulty: medium)

Calculate Precision@5 and Recall@5 for the following recommendation results.

recommended = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]
relevant = [2, 3, 5, 8, 11, 15, 20]
Solution
def precision_recall_at_k(recommended, relevant, k):
    """Calculate Precision@K and Recall@K"""
    recommended_k = recommended[:k]

    # Items that are both recommended and relevant
    hits = len(set(recommended_k) & set(relevant))

    precision = hits / k if k > 0 else 0
    recall = hits / len(relevant) if len(relevant) > 0 else 0

    return precision, recall

recommended = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]
relevant = [2, 3, 5, 8, 11, 15, 20]

precision, recall = precision_recall_at_k(recommended, relevant, k=5)

print("=== Calculation Process ===")
print(f"Recommended Items (top 5): {recommended[:5]}")
print(f"Relevant Items: {relevant}")
print(f"Hits: {set(recommended[:5]) & set(relevant)}")
print(f"Hit Count: {len(set(recommended[:5]) & set(relevant))}")
print(f"\nPrecision@5 = {len(set(recommended[:5]) & set(relevant))} / 5 = {precision:.3f}")
print(f"Recall@5 = {len(set(recommended[:5]) & set(relevant))} / {len(relevant)} = {recall:.3f}")

Output:

=== Calculation Process ===
Recommended Items (top 5): [1, 3, 5, 7, 9]
Relevant Items: [2, 3, 5, 8, 11, 15, 20]
Hits: {3, 5}
Hit Count: 2

Precision@5 = 2 / 5 = 0.400
Recall@5 = 2 / 7 = 0.286

Problem 3 (Difficulty: medium)

Explain the three types of Cold Start problems (User, Item, System) and propose solutions for each.

Solution

Answer:

1. User Cold Start (New User Problem):

2. Item Cold Start (New Item Problem):

3. System Cold Start (Overall System Problem):

Real-world Examples:

Problem 4 (Difficulty: hard)

For the following data, build a User-Item matrix, implement temporal splitting (80% training, 20% test), and calculate the matrix density.

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: For the following data, build a User-Item matrix, implement 

Purpose: Demonstrate data manipulation and preprocessing
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

import pandas as pd
import numpy as np

np.random.seed(42)
data = pd.DataFrame({
    'user_id': [1, 1, 2, 2, 2, 3, 3, 4, 4, 5],
    'item_id': [10, 20, 10, 30, 40, 20, 30, 10, 50, 40],
    'rating': [5, 4, 3, 5, 2, 4, 5, 3, 4, 5],
    'timestamp': pd.date_range('2024-01-01', periods=10, freq='D')
})
Solution
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: For the following data, build a User-Item matrix, implement 

Purpose: Demonstrate data manipulation and preprocessing
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""

import pandas as pd
import numpy as np

np.random.seed(42)
data = pd.DataFrame({
    'user_id': [1, 1, 2, 2, 2, 3, 3, 4, 4, 5],
    'item_id': [10, 20, 10, 30, 40, 20, 30, 10, 50, 40],
    'rating': [5, 4, 3, 5, 2, 4, 5, 3, 4, 5],
    'timestamp': pd.date_range('2024-01-01', periods=10, freq='D')
})

print("=== Original Data ===")
print(data)

# Build User-Item matrix
user_item_matrix = data.pivot_table(
    index='user_id',
    columns='item_id',
    values='rating',
    fill_value=0
)

print("\n=== User-Item Matrix ===")
print(user_item_matrix)

# Calculate density
total_cells = user_item_matrix.shape[0] * user_item_matrix.shape[1]
non_zero_cells = (user_item_matrix > 0).sum().sum()
density = non_zero_cells / total_cells

print(f"\n=== Matrix Statistics ===")
print(f"Shape: {user_item_matrix.shape}")
print(f"Total Cells: {total_cells}")
print(f"Rated Cells: {non_zero_cells}")
print(f"Density: {density:.1%}")
print(f"Sparsity: {(1 - density):.1%}")

# Temporal split
data_sorted = data.sort_values('timestamp')
split_idx = int(len(data_sorted) * 0.8)

train_data = data_sorted.iloc[:split_idx]
test_data = data_sorted.iloc[split_idx:]

print("\n=== Temporal Split ===")
print(f"Training Data Count: {len(train_data)}")
print(f"Test Data Count: {len(test_data)}")
print(f"\nTraining Period: {train_data['timestamp'].min()} ~ {train_data['timestamp'].max()}")
print(f"Test Period: {test_data['timestamp'].min()} ~ {test_data['timestamp'].max()}")

print("\nTraining Data:")
print(train_data)
print("\nTest Data:")
print(test_data)

# User-Item matrix for training and test data
train_matrix = train_data.pivot_table(
    index='user_id',
    columns='item_id',
    values='rating',
    fill_value=0
)

print("\n=== Training Data User-Item Matrix ===")
print(train_matrix)

Output:

=== Original Data ===
   user_id  item_id  rating  timestamp
0        1       10       5 2024-01-01
1        1       20       4 2024-01-02
2        2       10       3 2024-01-03
3        2       30       5 2024-01-04
4        2       40       2 2024-01-05
5        3       20       4 2024-01-06
6        3       30       5 2024-01-07
7        4       10       3 2024-01-08
8        4       50       4 2024-01-09
9        5       40       5 2024-01-10

=== User-Item Matrix ===
item_id  10  20  30  40  50
user_id
1         5   4   0   0   0
2         3   0   5   2   0
3         0   4   5   0   0
4         3   0   0   0   4
5         0   0   0   5   0

=== Matrix Statistics ===
Shape: (5, 5)
Total Cells: 25
Rated Cells: 10
Density: 40.0%
Sparsity: 60.0%

=== Temporal Split ===
Training Data Count: 8
Test Data Count: 2

Training Period: 2024-01-01 ~ 2024-01-08
Test Period: 2024-01-09 ~ 2024-01-10

Training Data:
   user_id  item_id  rating  timestamp
0        1       10       5 2024-01-01
1        1       20       4 2024-01-02
2        2       10       3 2024-01-03
3        2       30       5 2024-01-04
4        2       40       2 2024-01-05
5        3       20       4 2024-01-06
6        3       30       5 2024-01-07
7        4       10       3 2024-01-08

Test Data:
   user_id  item_id  rating  timestamp
8        4       50       4 2024-01-09
9        5       40       5 2024-01-10

=== Training Data User-Item Matrix ===
item_id  10  20  30  40
user_id
1         5   4   0   0
2         3   0   5   2
3         0   4   5   0
4         3   0   0   0

Problem 5 (Difficulty: hard)

Implement a function to calculate NDCG@5 and evaluate the quality of the following recommendation results. Relevance scores are on a 5-level scale (0-4).

relevances = [3, 2, 0, 1, 4, 0, 2, 3, 1, 0]  # Relevance in recommendation order
Solution
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np

def dcg_at_k(relevances, k):
    """Calculate DCG@K

    DCG@K = Σ (2^rel_i - 1) / log2(i + 1)
    """
    relevances = np.array(relevances[:k])
    if relevances.size:
        # Discount factor for position i: log2(i + 1)
        # Starting from i=1, so log2(2), log2(3), ...
        discounts = np.log2(np.arange(2, relevances.size + 2))
        gains = 2**relevances - 1
        dcg = np.sum(gains / discounts)
        return dcg
    return 0.0

def ndcg_at_k(relevances, k):
    """Calculate NDCG@K

    NDCG@K = DCG@K / IDCG@K
    """
    dcg = dcg_at_k(relevances, k)

    # Ideal DCG: DCG when relevances are sorted in descending order
    ideal_relevances = sorted(relevances, reverse=True)
    idcg = dcg_at_k(ideal_relevances, k)

    if idcg == 0:
        return 0.0

    ndcg = dcg / idcg
    return ndcg

# Example: Evaluate recommendation results
relevances = [3, 2, 0, 1, 4, 0, 2, 3, 1, 0]

print("=== NDCG Evaluation ===")
print(f"Relevance in Recommendation Order: {relevances}")
print(f"Ideal Order: {sorted(relevances, reverse=True)}")

for k in [3, 5, 10]:
    dcg = dcg_at_k(relevances, k)
    ideal_relevances = sorted(relevances, reverse=True)
    idcg = dcg_at_k(ideal_relevances, k)
    ndcg = ndcg_at_k(relevances, k)

    print(f"\n=== K={k} ===")
    print(f"DCG@{k}: {dcg:.3f}")
    print(f"IDCG@{k}: {idcg:.3f}")
    print(f"NDCG@{k}: {ndcg:.3f}")

# Detailed calculation example (K=5)
print("\n=== Detailed Calculation (K=5) ===")
k = 5
rels = relevances[:k]
print(f"Top {k} Relevances: {rels}")

for i, rel in enumerate(rels):
    pos = i + 1
    gain = 2**rel - 1
    discount = np.log2(pos + 1)
    contribution = gain / discount
    print(f"Position {pos}: rel={rel}, gain={gain}, discount={discount:.3f}, contribution={contribution:.3f}")

dcg = dcg_at_k(relevances, k)
print(f"\nDCG@5 = {dcg:.3f}")

ideal_rels = sorted(relevances, reverse=True)[:k]
print(f"\nIdeal Top {k}: {ideal_rels}")
idcg = dcg_at_k(ideal_rels, k)
print(f"IDCG@5 = {idcg:.3f}")

ndcg = ndcg_at_k(relevances, k)
print(f"\nNDCG@5 = {dcg:.3f} / {idcg:.3f} = {ndcg:.3f}")

Output:

=== NDCG Evaluation ===
Relevance in Recommendation Order: [3, 2, 0, 1, 4, 0, 2, 3, 1, 0]
Ideal Order: [4, 3, 3, 2, 2, 1, 1, 0, 0, 0]

=== K=3 ===
DCG@3: 7.500
IDCG@3: 11.131
NDCG@3: 0.674

=== K=5 ===
DCG@5: 16.714
IDCG@5: 19.714
NDCG@5: 0.848

=== K=10 ===
DCG@10: 20.344
IDCG@10: 23.344
NDCG@10: 0.871

=== Detailed Calculation (K=5) ===
Top 5 Relevances: [3, 2, 0, 1, 4]
Position 1: rel=3, gain=7, discount=1.000, contribution=7.000
Position 2: rel=2, gain=3, discount=1.585, contribution=1.893
Position 3: rel=0, gain=0, discount=2.000, contribution=0.000
Position 4: rel=1, gain=1, discount=2.322, contribution=0.431
Position 5: rel=4, gain=15, discount=2.585, contribution=5.803

DCG@5 = 15.127

Ideal Top 5: [4, 3, 3, 2, 2]
IDCG@5 = 19.714

NDCG@5 = 15.127 / 19.714 = 0.767

References

  1. Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender Systems Handbook (2nd ed.). Springer.
  2. Aggarwal, C. C. (2016). Recommender Systems: The Textbook. Springer.
  3. Falk, K. (2019). Practical Recommender Systems. Manning Publications.
  4. Harper, F. M., & Konstan, J. A. (2015). The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems, 5(4), 1-19.
  5. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. Computer, 42(8), 30-37.

Disclaimer