Chapter 1: Data Preprocessing Fundamentals

This chapter covers the fundamentals of Data Preprocessing Fundamentals, which the importance of data preprocessing. You will learn difference between scaling and Build preprocessing pipelines using scikit-learn.

Learning Objectives

By reading this chapter, you will master the following:

✅ Understand the importance and overall picture of data preprocessing
✅ Select appropriate handling methods for different types of missing values
✅ Detect and appropriately handle outliers
✅ Understand the difference between scaling and normalization, and use them appropriately
✅ Build preprocessing pipelines using scikit-learn
✅ Execute comprehensive preprocessing on real data

1.1 The Importance of Data Preprocessing

What is Data Preprocessing?

Data Preprocessing is the process of transforming raw data into a format suitable for machine learning models.

"Garbage In, Garbage Out (GIGO)" - Data quality determines model performance.

Why Preprocessing is Necessary

Problem	Impact	Solution
Missing Values	Training errors, biased predictions	Imputation, deletion
Outliers	Model distortion, overfitting	Detection, transformation, deletion
Scale Differences	Training instability	Normalization, standardization
Irrelevant Features	Curse of dimensionality, overfitting	Feature selection, dimensionality reduction

Overall Picture of Data Preprocessing

graph TD A[Raw Data] --> B[Missing Value Handling] B --> C[Outlier Handling] C --> D[Scaling & Normalization] D --> E[Feature Engineering] E --> F[Feature Selection] F --> G[Ready for Training] style A fill:#ffebee style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#e3f2fd style E fill:#e8f5e9 style F fill:#fce4ec style G fill:#c8e6c9

Example: Effects of Preprocessing

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: Example: Effects of Preprocessing

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Generate sample data (features with different scales)
np.random.seed(42)
n_samples = 1000

# Feature 1: Age (20-60)
X1 = np.random.uniform(20, 60, n_samples)

# Feature 2: Annual income (3-10 million yen)
X2 = np.random.uniform(300, 1000, n_samples)

# Target: Based on age and income
y = ((X1 > 40) & (X2 > 600)).astype(int)

# Create DataFrame
X = pd.DataFrame({'age': X1, 'income': X2})

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Training without preprocessing
model_raw = LogisticRegression(random_state=42, max_iter=1000)
model_raw.fit(X_train, y_train)
acc_raw = accuracy_score(y_test, model_raw.predict(X_test))

# Training with preprocessing
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(random_state=42, max_iter=1000)
model_scaled.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, model_scaled.predict(X_test_scaled))

print("=== Comparison of Preprocessing Effects ===")
print(f"Without preprocessing: Accuracy = {acc_raw:.3f}")
print(f"With preprocessing: Accuracy = {acc_scaled:.3f}")
print(f"Improvement: {(acc_scaled - acc_raw) * 100:.1f}%")

Output:

=== Comparison of Preprocessing Effects ===
Without preprocessing: Accuracy = 0.890
With preprocessing: Accuracy = 0.920
Improvement: 3.0%

Important: Scaling improves both convergence speed and accuracy.

1.2 Missing Value Handling

Types of Missing Values

Missing values are classified into three types based on their generation mechanism:

Type	Description	Example
MCAR (Missing Completely At Random)	Completely random missingness	Data loss due to equipment failure
MAR (Missing At Random)	Missingness dependent on other variables	Older people less likely to provide income
MNAR (Missing Not At Random)	Missingness itself is meaningful	Low-income individuals not filling income

Visualizing Missing Values

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
# - seaborn>=0.12.0

"""
Example: Visualizing Missing Values

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Generate sample data with missing values
np.random.seed(42)
n = 100

df = pd.DataFrame({
    'age': np.random.randint(20, 70, n),
    'income': np.random.randint(300, 1200, n),
    'score': np.random.uniform(0, 100, n)
})

# Intentionally create missing values
# 10% missing in age
missing_age = np.random.choice(n, size=int(n * 0.1), replace=False)
df.loc[missing_age, 'age'] = np.nan

# 20% missing in income (age-dependent)
missing_income = df[df['age'] > 50].sample(frac=0.4).index
df.loc[missing_income, 'income'] = np.nan

# 15% missing in score
missing_score = np.random.choice(n, size=int(n * 0.15), replace=False)
df.loc[missing_score, 'score'] = np.nan

# Check missing values
print("=== Missing Value Status ===")
print(df.isnull().sum())
print(f"\nMissing Rates:")
print((df.isnull().sum() / len(df) * 100).round(2))

# Visualize with heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='viridis')
plt.title('Missing Value Pattern Visualization (Yellow = Missing)', fontsize=14)
plt.xlabel('Features')
plt.tight_layout()
plt.show()

Output:

=== Missing Value Status ===
age        10
income     16
score      15
dtype: int64

Missing Rates:
age        10.0
income     16.0
score      15.0
dtype: float64

Deletion Methods

Row Deletion (Listwise Deletion)

# Delete rows containing missing values
df_droprows = df.dropna()

print(f"Original data: {len(df)} rows")
print(f"After deletion: {len(df_droprows)} rows")
print(f"Deleted rows: {len(df) - len(df_droprows)} rows ({(1 - len(df_droprows)/len(df))*100:.1f}%)")

Output:

Original data: 100 rows
After deletion: 64 rows
Deleted rows: 36 rows (36.0%)

Note: Data volume can be significantly reduced.

Column Deletion

# Delete columns with 30% or more missing rate
threshold = 0.3
df_dropcols = df.loc[:, df.isnull().mean() < threshold]

print(f"Original features: {df.shape[1]}")
print(f"After deletion: {df_dropcols.shape[1]}")
print(f"\nDeleted features: {set(df.columns) - set(df_dropcols.columns)}")

Imputation Methods

Simple Imputation

from sklearn.impute import SimpleImputer

# Mean imputation
imputer_mean = SimpleImputer(strategy='mean')
df_mean = pd.DataFrame(
    imputer_mean.fit_transform(df),
    columns=df.columns
)

# Median imputation
imputer_median = SimpleImputer(strategy='median')
df_median = pd.DataFrame(
    imputer_median.fit_transform(df),
    columns=df.columns
)

# Mode imputation
imputer_mode = SimpleImputer(strategy='most_frequent')
df_mode = pd.DataFrame(
    imputer_mode.fit_transform(df),
    columns=df.columns
)

# Constant imputation
imputer_constant = SimpleImputer(strategy='constant', fill_value=0)
df_constant = pd.DataFrame(
    imputer_constant.fit_transform(df),
    columns=df.columns
)

print("=== Comparison of Imputation Methods ===\n")
print(f"Original age mean: {df['age'].mean():.2f}")
print(f"After mean imputation: {df_mean['age'].mean():.2f}")
print(f"After median imputation: {df_median['age'].median():.2f}")
print(f"Mode imputation: {df_mode['age'].mode()[0]:.2f}")

KNN Imputation

from sklearn.impute import KNNImputer

# KNN imputation (k=5)
knn_imputer = KNNImputer(n_neighbors=5)
df_knn = pd.DataFrame(
    knn_imputer.fit_transform(df),
    columns=df.columns
)

print("\n=== KNN Imputation Details ===")
print(f"Age mean before missing: {df['age'].mean():.2f}")
print(f"Age mean after KNN imputation: {df_knn['age'].mean():.2f}")

# Visualization: Comparison of imputation methods
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
methods = [
    ('Original Data', df),
    ('Mean Imputation', df_mean),
    ('Median Imputation', df_median),
    ('Mode Imputation', df_mode),
    ('Constant Imputation', df_constant),
    ('KNN Imputation', df_knn)
]

for ax, (name, data) in zip(axes.flat, methods):
    ax.scatter(data['age'], data['income'], alpha=0.6)
    ax.set_xlabel('Age')
    ax.set_ylabel('Income')
    ax.set_title(name)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Multiple Imputation

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Multiple imputation (MICE: Multivariate Imputation by Chained Equations)
mice_imputer = IterativeImputer(random_state=42, max_iter=10)
df_mice = pd.DataFrame(
    mice_imputer.fit_transform(df),
    columns=df.columns
)

print("=== Multiple Imputation (MICE) ===")
print(f"Missing count before imputation: {df.isnull().sum().sum()}")
print(f"Missing count after imputation: {df_mice.isnull().sum().sum()}")
print(f"\nStatistics of each feature after imputation:")
print(df_mice.describe())

Guidelines for Choosing Imputation Methods

Situation	Recommended Method	Reason
Missing rate < 5%	Deletion	Minimal information loss
Numerical, normal distribution	Mean imputation	Preserves distribution
Numerical, with outliers	Median imputation	Robust
Categorical	Mode imputation	Reasonable estimate
Correlation between features	KNN, MICE	Utilizes relationships
MNAR	Use domain knowledge	Missingness itself is informative

1.3 Outlier Handling

What are Outliers?

Outliers are values that differ significantly from other data points, either due to measurement errors or genuine anomalies.

Outlier Detection Methods

1. IQR Method (Interquartile Range)

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: 1. IQR Method (Interquartile Range)

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sample data (with outliers)
np.random.seed(42)
data_normal = np.random.normal(50, 10, 95)
outliers = np.array([100, 105, 110, 0, -5])  # Outliers
data = np.concatenate([data_normal, outliers])

# Outlier detection using IQR method
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = (data < lower_bound) | (data > upper_bound)

print("=== Outlier Detection using IQR Method ===")
print(f"Q1 (25th percentile): {Q1:.2f}")
print(f"Q3 (75th percentile): {Q3:.2f}")
print(f"IQR: {IQR:.2f}")
print(f"Lower bound: {lower_bound:.2f}")
print(f"Upper bound: {upper_bound:.2f}")
print(f"Number of outliers: {outliers_iqr.sum()}")
print(f"Outliers: {data[outliers_iqr]}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
axes[0].boxplot(data, vert=True)
axes[0].axhline(y=lower_bound, color='r', linestyle='--',
                label=f'Lower bound: {lower_bound:.1f}')
axes[0].axhline(y=upper_bound, color='r', linestyle='--',
                label=f'Upper bound: {upper_bound:.1f}')
axes[0].set_ylabel('Value')
axes[0].set_title('Box Plot (IQR Method)', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Histogram
axes[1].hist(data, bins=20, alpha=0.7, edgecolor='black')
axes[1].axvline(x=lower_bound, color='r', linestyle='--', linewidth=2)
axes[1].axvline(x=upper_bound, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Histogram', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Output:

=== Outlier Detection using IQR Method ===
Q1 (25th percentile): 43.26
Q3 (75th percentile): 56.83
IQR: 13.57
Lower bound: 22.90
Upper bound: 77.19
Number of outliers: 5
Outliers: [100. 105. 110.   0.  -5.]

2. Z-score Detection

# Requirements:
# - Python 3.9+
# - scipy>=1.11.0

"""
Example: 2. Z-score Detection

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

from scipy import stats

# Calculate Z-score
z_scores = np.abs(stats.zscore(data))
threshold = 3  # Typically use 3 as threshold

outliers_zscore = z_scores > threshold

print("\n=== Outlier Detection using Z-score ===")
print(f"Threshold: {threshold}")
print(f"Number of outliers: {outliers_zscore.sum()}")
print(f"Z-scores of outliers: {z_scores[outliers_zscore]}")
print(f"Outliers: {data[outliers_zscore]}")

# Visualization
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(range(len(data)), data, c=outliers_zscore,
            cmap='coolwarm', s=50, alpha=0.7, edgecolors='black')
plt.axhline(y=data.mean() + 3*data.std(), color='r',
            linestyle='--', label='+3σ')
plt.axhline(y=data.mean() - 3*data.std(), color='r',
            linestyle='--', label='-3σ')
plt.xlabel('Index')
plt.ylabel('Value')
plt.title('Data Points (Red = Outliers)', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.scatter(range(len(z_scores)), z_scores,
            c=outliers_zscore, cmap='coolwarm', s=50,
            alpha=0.7, edgecolors='black')
plt.axhline(y=threshold, color='r', linestyle='--',
            label=f'Threshold: {threshold}')
plt.xlabel('Index')
plt.ylabel('|Z-score|')
plt.title('Z-score Distribution', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

3. Isolation Forest Detection

from sklearn.ensemble import IsolationForest

# Expand data to 2 dimensions
np.random.seed(42)
X = np.random.normal(50, 10, (95, 2))
X_outliers = np.array([[100, 100], [105, 105], [0, 0], [-5, -5], [110, 110]])
X_combined = np.vstack([X, X_outliers])

# Isolation Forest
iso_forest = IsolationForest(contamination=0.05, random_state=42)
outliers_iso = iso_forest.fit_predict(X_combined)
# -1: outlier, 1: normal

print("\n=== Outlier Detection using Isolation Forest ===")
print(f"Number of outliers: {(outliers_iso == -1).sum()}")
print(f"Number of normal points: {(outliers_iso == 1).sum()}")

# Visualization
plt.figure(figsize=(10, 8))
plt.scatter(X_combined[outliers_iso == 1, 0],
            X_combined[outliers_iso == 1, 1],
            c='blue', label='Normal', alpha=0.6, s=50, edgecolors='black')
plt.scatter(X_combined[outliers_iso == -1, 0],
            X_combined[outliers_iso == -1, 1],
            c='red', label='Outliers', alpha=0.8, s=100,
            edgecolors='black', marker='X')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Outlier Detection using Isolation Forest', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

How to Handle Outliers

1. Deletion

# Delete outliers
data_cleaned = data[~outliers_iqr]

print(f"Original data: {len(data)} points")
print(f"After deletion: {len(data_cleaned)} points")
print(f"Mean (before deletion): {data.mean():.2f}")
print(f"Mean (after deletion): {data_cleaned.mean():.2f}")

2. Transformation (Log Transformation)

# Log transformation (positive values only)
data_positive = data[data > 0]
data_log = np.log1p(data_positive)  # log(1 + x) to avoid 0

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(data_positive, bins=20, alpha=0.7, edgecolor='black')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Original Data', fontsize=14)
axes[0].grid(True, alpha=0.3)

axes[1].hist(data_log, bins=20, alpha=0.7, edgecolor='black', color='orange')
axes[1].set_xlabel('log(Value)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('After Log Transformation', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

3. Capping (Winsorization)

from scipy.stats import mstats

# Winsorization: Cap outliers at upper and lower bounds
data_winsorized = mstats.winsorize(data, limits=[0.05, 0.05])

print("\n=== Winsorization (Cap top and bottom 5%) ===")
print(f"Original data range: [{data.min():.2f}, {data.max():.2f}]")
print(f"Range after processing: [{data_winsorized.min():.2f}, {data_winsorized.max():.2f}]")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].boxplot(data)
axes[0].set_ylabel('Value')
axes[0].set_title('Original Data', fontsize=14)
axes[0].grid(True, alpha=0.3)

axes[1].boxplot(data_winsorized)
axes[1].set_ylabel('Value')
axes[1].set_title('After Winsorization', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Guidelines for Outlier Handling

Situation	Recommended Method	Reason
Clear error	Deletion	Improves data quality
True extreme value	Keep or cap	Preserves information
Skewed distribution	Log transformation	Approximates normal distribution
Robustness needed	Winsorization	Suppresses impact
Multidimensional data	Isolation Forest	Detects complex patterns

1.4 Scaling and Normalization

Why Scaling is Necessary

When features have different scales, the following problems occur:

Distance-based algorithms (KNN, SVM) are dominated by large values
Gradient descent convergence becomes slow
Regularization effects become uneven

1. StandardScaler (Standardization)

Standardization transforms data to have mean 0 and standard deviation 1.

$$ z = \frac{x - \mu}{\sigma} $$

$\mu$: mean
$\sigma$: standard deviation

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: $$
z = \frac{x - \mu}{\sigma}
$$

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sample data (different scales)
np.random.seed(42)
data = pd.DataFrame({
    'age': np.random.randint(20, 70, 100),
    'income': np.random.randint(300, 1500, 100),
    'score': np.random.uniform(0, 100, 100)
})

print("=== Original Data Statistics ===")
print(data.describe())

# Apply StandardScaler
scaler = StandardScaler()
data_scaled = pd.DataFrame(
    scaler.fit_transform(data),
    columns=data.columns
)

print("\n=== Statistics After Standardization ===")
print(data_scaled.describe())

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

for i, col in enumerate(data.columns):
    # Original data
    axes[0, i].hist(data[col], bins=20, alpha=0.7, edgecolor='black')
    axes[0, i].set_xlabel(col)
    axes[0, i].set_ylabel('Frequency')
    axes[0, i].set_title(f'{col} (Original Data)', fontsize=12)
    axes[0, i].grid(True, alpha=0.3)

    # After standardization
    axes[1, i].hist(data_scaled[col], bins=20, alpha=0.7,
                    edgecolor='black', color='orange')
    axes[1, i].set_xlabel(col)
    axes[1, i].set_ylabel('Frequency')
    axes[1, i].set_title(f'{col} (Standardized)', fontsize=12)
    axes[1, i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

2. MinMaxScaler (Normalization)

Normalization scales values to a specified range (typically [0, 1]).

$$ x_{\text{norm}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}} $$

from sklearn.preprocessing import MinMaxScaler

# Apply MinMaxScaler
minmax_scaler = MinMaxScaler(feature_range=(0, 1))
data_minmax = pd.DataFrame(
    minmax_scaler.fit_transform(data),
    columns=data.columns
)

print("=== Statistics After MinMaxScaler (Normalization) ===")
print(data_minmax.describe())

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for i, col in enumerate(data.columns):
    axes[i].hist(data_minmax[col], bins=20, alpha=0.7,
                 edgecolor='black', color='green')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')
    axes[i].set_title(f'{col} (MinMax: [0,1])', fontsize=12)
    axes[i].set_xlim(-0.1, 1.1)
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

3. RobustScaler (Robust Scaling)

RobustScaler uses median and IQR, making it robust to outliers.

$$ x_{\text{robust}} = \frac{x - \text{median}}{\text{IQR}} $$

from sklearn.preprocessing import RobustScaler

# Data with outliers
data_with_outliers = data.copy()
data_with_outliers.loc[0:5, 'income'] = [5000, 5500, 6000, 100, 50, 10000]

# Apply RobustScaler
robust_scaler = RobustScaler()
data_robust = pd.DataFrame(
    robust_scaler.fit_transform(data_with_outliers),
    columns=data.columns
)

# Comparison: StandardScaler vs RobustScaler
standard_scaler = StandardScaler()
data_standard = pd.DataFrame(
    standard_scaler.fit_transform(data_with_outliers),
    columns=data.columns
)

print("=== Comparison with Data Containing Outliers ===")
print("\nStandardScaler:")
print(data_standard['income'].describe())
print("\nRobustScaler:")
print(data_robust['income'].describe())

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].boxplot(data_with_outliers['income'])
axes[0].set_ylabel('income')
axes[0].set_title('Original Data (with outliers)', fontsize=12)
axes[0].grid(True, alpha=0.3)

axes[1].boxplot(data_standard['income'])
axes[1].set_ylabel('income (scaled)')
axes[1].set_title('StandardScaler', fontsize=12)
axes[1].grid(True, alpha=0.3)

axes[2].boxplot(data_robust['income'])
axes[2].set_ylabel('income (scaled)')
axes[2].set_title('RobustScaler (Robust to outliers)', fontsize=12)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Guidelines for Scaler Selection

Scaler	Use Case	Advantages	Disadvantages
StandardScaler	Normal distribution, no outliers	Standard for many algorithms	Sensitive to outliers
MinMaxScaler	Range matters, neural networks	Interpretable [0,1]	Large impact from outliers
RobustScaler	With outliers	Robust to outliers	Undefined range

Recommendations by Algorithm

Algorithm	Scaling Needed?	Recommended Scaler
Linear Regression	Recommended	StandardScaler
Logistic Regression	Required	StandardScaler
SVM	Required	StandardScaler
KNN	Required	StandardScaler, MinMaxScaler
Neural Networks	Required	MinMaxScaler, StandardScaler
Decision Trees	Not needed	-
Random Forest	Not needed	-
XGBoost	Not needed	-

1.5 Practical Example: Complete Preprocessing Pipeline

Data Preparation

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: Data Preparation

Purpose: Demonstrate data manipulation and preprocessing
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Generate realistic data
np.random.seed(42)
n = 1000

# Generate features
df = pd.DataFrame({
    'age': np.random.randint(18, 80, n),
    'income': np.random.normal(500, 200, n),
    'credit_score': np.random.uniform(300, 850, n),
    'loan_amount': np.random.uniform(1000, 50000, n),
    'employment_years': np.random.randint(0, 40, n)
})

# Target variable (loan approval)
df['approved'] = (
    (df['credit_score'] > 600) &
    (df['income'] > 400) &
    (df['age'] > 25)
).astype(int)

# Intentionally add data quality issues
# 1. Missing values
missing_idx = np.random.choice(n, size=100, replace=False)
df.loc[missing_idx[:50], 'income'] = np.nan
df.loc[missing_idx[50:], 'credit_score'] = np.nan

# 2. Outliers
outlier_idx = np.random.choice(n, size=20, replace=False)
df.loc[outlier_idx, 'loan_amount'] = df.loc[outlier_idx, 'loan_amount'] * 10

print("=== Data Overview ===")
print(df.head(10))
print(f"\nShape: {df.shape}")
print(f"\nMissing values:")
print(df.isnull().sum())
print(f"\nBasic statistics:")
print(df.describe())

Building the Preprocessing Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, RobustScaler

# Separate features and target
X = df.drop('approved', axis=1)
y = df['approved']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Define pipeline
# Split numerical features into two groups
sensitive_features = ['loan_amount']  # Sensitive to outliers
regular_features = ['age', 'income', 'credit_score', 'employment_years']

# Build pipeline
preprocessor = ColumnTransformer(
    transformers=[
        # Regular features: imputation → standardization
        ('regular', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), regular_features),

        # Outlier-sensitive features: imputation → robust scaling
        ('sensitive', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', RobustScaler())
        ]), sensitive_features)
    ]
)

# Complete pipeline (preprocessing + model)
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

print("=== Pipeline Structure ===")
print(full_pipeline)

Model Training and Evaluation

# Execute pipeline
full_pipeline.fit(X_train, y_train)

# Prediction
y_pred = full_pipeline.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)

print("\n=== Model Performance ===")
print(f"Accuracy: {accuracy:.3f}")
print(f"\nDetailed Report:")
print(classification_report(y_test, y_pred,
                           target_names=['Rejected', 'Approved']))

# Comparison with no preprocessing
from sklearn.ensemble import RandomForestClassifier

# Without preprocessing (simply drop missing values)
X_train_raw = X_train.dropna()
y_train_raw = y_train[X_train.dropna().index]
X_test_raw = X_test.fillna(X_test.median())

model_raw = RandomForestClassifier(n_estimators=100, random_state=42)
model_raw.fit(X_train_raw, y_train_raw)
y_pred_raw = model_raw.predict(X_test_raw)
accuracy_raw = accuracy_score(y_test, y_pred_raw)

print(f"\n=== Pipeline vs No Preprocessing ===")
print(f"With pipeline: {accuracy:.3f}")
print(f"Without preprocessing: {accuracy_raw:.3f}")
print(f"Improvement: {(accuracy - accuracy_raw) * 100:.1f}%")
print(f"\nTraining data size:")
print(f"  Pipeline: {len(X_train)} rows")
print(f"  Without preprocessing: {len(X_train_raw)} rows ({len(X_train) - len(X_train_raw)} rows deleted)")

Detailed Analysis of Preprocessing

# Get preprocessed data
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Get feature names
feature_names = regular_features + sensitive_features

print("\n=== Data After Preprocessing ===")
print(f"Shape: {X_train_processed.shape}")
print(f"\nStatistics after preprocessing (training data):")
df_processed = pd.DataFrame(X_train_processed, columns=feature_names)
print(df_processed.describe())

# Visualization: Effects of preprocessing
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
features_to_plot = ['age', 'income', 'loan_amount']

for i, feature in enumerate(features_to_plot):
    # Before preprocessing
    axes[0, i].hist(X_train[feature].dropna(), bins=30,
                    alpha=0.7, edgecolor='black')
    axes[0, i].set_xlabel(feature)
    axes[0, i].set_ylabel('Frequency')
    axes[0, i].set_title(f'{feature} (Before Preprocessing)', fontsize=12)
    axes[0, i].grid(True, alpha=0.3)

    # After preprocessing
    feature_idx = feature_names.index(feature)
    axes[1, i].hist(X_train_processed[:, feature_idx], bins=30,
                    alpha=0.7, edgecolor='black', color='orange')
    axes[1, i].set_xlabel(feature)
    axes[1, i].set_ylabel('Frequency')
    axes[1, i].set_title(f'{feature} (After Preprocessing)', fontsize=12)
    axes[1, i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Saving and Reusing the Pipeline

# Requirements:
# - Python 3.9+
# - joblib>=1.3.0

"""
Example: Saving and Reusing the Pipeline

Purpose: Demonstrate data manipulation and preprocessing
Target: Beginner
Execution time: ~5 seconds
Dependencies: None
"""

import joblib

# Save the pipeline
joblib.dump(full_pipeline, 'loan_approval_pipeline.pkl')
print("Pipeline saved: loan_approval_pipeline.pkl")

# Load and use
loaded_pipeline = joblib.load('loan_approval_pipeline.pkl')

# Prediction on new data
new_data = pd.DataFrame({
    'age': [35, 22, 50],
    'income': [700, 300, np.nan],  # Contains missing values
    'credit_score': [750, 550, 800],
    'loan_amount': [25000, 5000, 100000],  # Contains outliers
    'employment_years': [10, 1, 25]
})

predictions = loaded_pipeline.predict(new_data)
probabilities = loaded_pipeline.predict_proba(new_data)

print("\n=== Predictions on New Data ===")
for i, (pred, prob) in enumerate(zip(predictions, probabilities)):
    print(f"\nSample {i+1}:")
    print(f"  Prediction: {'Approved' if pred == 1 else 'Rejected'}")
    print(f"  Probability: Rejected={prob[0]:.2%}, Approved={prob[1]:.2%}")

1.6 Chapter Summary

What We Learned

The Importance of Data Preprocessing
- Data quality determines model performance
- Preprocessing improves accuracy and robustness
Missing Value Handling
- Three types: MCAR, MAR, MNAR
- Deletion methods, simple imputation, KNN imputation, multiple imputation
- Selecting appropriate methods based on the situation
Outlier Handling
- Detection using IQR method, Z-score, Isolation Forest
- Handling via deletion, transformation, capping
- Combining with domain knowledge
Scaling and Normalization
- StandardScaler: mean 0, standard deviation 1
- MinMaxScaler: scaling to specified range
- RobustScaler: robust to outliers
- Appropriate use by algorithm
Pipeline Construction
- Reproducible preprocessing flow
- Consistency between training and testing
- Easy deployment to production environments

Principles of Preprocessing

Principle	Description
Data Understanding First	Understand problems through visualization and statistics before processing
Leverage Domain Knowledge	Reflect business knowledge in preprocessing decisions
Prevent Data Leakage	Fit on training data and transform on test data
Ensure Reproducibility	Standardize processing with pipelines
Incremental Approach	Don't change too much at once, verify effects

Next Chapter

In Chapter 2, we will learn about Categorical Variable Encoding:

One-Hot Encoding
Label Encoding
Target Encoding
Frequency Encoding
Handling high cardinality

Practice Problems

Problem 1 (Difficulty: Easy)

Explain the three types of missing values (MCAR, MAR, MNAR) and provide specific examples for each.

Sample Answer

Answer:

MCAR (Missing Completely At Random)
- Description: Missingness is completely random and unrelated to other variables
- Example: Some data is not recorded due to sensor failure
MAR (Missing At Random)
- Description: Missingness depends on other observed variables
- Example: Elderly people are less likely to fill in health data (age is observed)
MNAR (Missing Not At Random)
- Description: Missingness depends on the missing value itself
- Example: Low-income individuals avoid filling in income (income itself causes the missingness)

Problem 2 (Difficulty: Medium)

For the following data, detect outliers using the IQR method and report their count.

data = np.array([12, 15, 14, 10, 8, 12, 15, 14, 100, 13, 12, 14, 15, -5, 11])

Sample Answer

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: For the following data, detect outliers using the IQR method

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

import numpy as np

data = np.array([12, 15, 14, 10, 8, 12, 15, 14, 100, 13, 12, 14, 15, -5, 11])

# IQR method
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = (data < lower_bound) | (data > upper_bound)

print("=== Outlier Detection using IQR Method ===")
print(f"Q1: {Q1}")
print(f"Q3: {Q3}")
print(f"IQR: {IQR}")
print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")
print(f"\nNumber of outliers: {outliers.sum()}")
print(f"Outliers: {data[outliers]}")
print(f"Normal values: {data[~outliers]}")

Output:

=== Outlier Detection using IQR Method ===
Q1: 11.5
Q3: 14.5
IQR: 3.0
Lower bound: 7.0
Upper bound: 19.0

Number of outliers: 2
Outliers: [100  -5]
Normal values: [12 15 14 10  8 12 15 14 13 12 14 15 11]

Problem 3 (Difficulty: Medium)

Explain the differences between StandardScaler and MinMaxScaler, and describe when each should be used.

Sample Answer

Answer:

StandardScaler (Standardization):

Transformation formula: $z = \frac{x - \mu}{\sigma}$
Result: mean 0, standard deviation 1
Characteristics: Preserves distribution shape, range is undefined

MinMaxScaler (Normalization):

Transformation formula: $x_{\text{norm}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$
Result: specified range (typically [0, 1])
Characteristics: Fixed range, large impact from outliers

Usage Guidelines:

Situation	Recommendation
Data close to normal distribution	StandardScaler
Range is important (e.g., [0, 1] required)	MinMaxScaler
Few outliers	Either is acceptable
Many outliers	RobustScaler (or standardization)
Neural networks	MinMaxScaler (match activation function range)
Linear models, SVM	StandardScaler

Problem 4 (Difficulty: Hard)

For the following data, build a complete preprocessing pipeline that includes missing value handling and outlier handling.

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: For the following data, build a complete preprocessing pipel

Purpose: Demonstrate data manipulation and preprocessing
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

import numpy as np
import pandas as pd

np.random.seed(42)
data = pd.DataFrame({
    'feature1': np.random.normal(50, 10, 100),
    'feature2': np.random.normal(100, 20, 100),
    'feature3': np.random.uniform(0, 1, 100)
})

# Add missing values
data.loc[0:10, 'feature1'] = np.nan
data.loc[20:25, 'feature2'] = np.nan

# Add outliers
data.loc[50, 'feature1'] = 200
data.loc[60, 'feature2'] = 500

Sample Answer

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: For the following data, build a complete preprocessing pipel

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.impute import KNNImputer
from sklearn.compose import ColumnTransformer

np.random.seed(42)
data = pd.DataFrame({
    'feature1': np.random.normal(50, 10, 100),
    'feature2': np.random.normal(100, 20, 100),
    'feature3': np.random.uniform(0, 1, 100)
})

# Add missing values
data.loc[0:10, 'feature1'] = np.nan
data.loc[20:25, 'feature2'] = np.nan

# Add outliers
data.loc[50, 'feature1'] = 200
data.loc[60, 'feature2'] = 500

print("=== Data Before Preprocessing ===")
print(data.describe())
print(f"\nMissing values:\n{data.isnull().sum()}")

# Build pipeline
# feature1, feature2: with missing values and outliers → KNN imputation + RobustScaler
# feature3: no issues → StandardScaler

preprocessor = ColumnTransformer(
    transformers=[
        ('features_with_issues', Pipeline([
            ('imputer', KNNImputer(n_neighbors=5)),
            ('scaler', RobustScaler())
        ]), ['feature1', 'feature2']),

        ('clean_features', Pipeline([
            ('scaler', StandardScaler())
        ]), ['feature3'])
    ]
)

# Execute preprocessing
data_processed = preprocessor.fit_transform(data)

print("\n=== Data After Preprocessing ===")
df_processed = pd.DataFrame(
    data_processed,
    columns=['feature1', 'feature2', 'feature3']
)
print(df_processed.describe())
print(f"\nMissing values: {df_processed.isnull().sum().sum()}")

# Visualization
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

for i, col in enumerate(['feature1', 'feature2', 'feature3']):
    # Before preprocessing
    axes[0, i].boxplot(data[col].dropna())
    axes[0, i].set_ylabel(col)
    axes[0, i].set_title(f'{col} (Before Preprocessing)', fontsize=12)
    axes[0, i].grid(True, alpha=0.3)

    # After preprocessing
    axes[1, i].boxplot(df_processed[col])
    axes[1, i].set_ylabel(col)
    axes[1, i].set_title(f'{col} (After Preprocessing)', fontsize=12)
    axes[1, i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✓ Pipeline construction completed")
print("✓ Missing value imputation completed (KNN, k=5)")
print("✓ Robust scaling to outliers completed (RobustScaler)")

Problem 5 (Difficulty: Hard)

Explain why data leakage occurs when scaling training and test data separately. Also demonstrate the correct method.

Sample Answer

Answer:

Why Data Leakage Occurs:

When scaling test data separately, you use the statistical information (mean, standard deviation, etc.) of the test data. This causes the following problems:

Using future information: In production, the statistics of new data are unknown in advance
Biased evaluation: Using test data information leads to overestimated performance
Lack of reproducibility: Cannot perform the same transformation during actual deployment

Incorrect Method (with data leakage):

# ❌ Wrong
scaler_train = StandardScaler()
X_train_scaled = scaler_train.fit_transform(X_train)

scaler_test = StandardScaler()
X_test_scaled = scaler_test.fit_transform(X_test)  # Fitting on test data

Correct Method:

# ✅ Correct
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit on training data
X_test_scaled = scaler.transform(X_test)  # Transform using training data statistics

Verification with Example:

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Verification with Example:

Purpose: Demonstrate machine learning model training and evaluation
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np
from sklearn.preprocessing import StandardScaler

# Sample data
X_train = np.array([[1], [2], [3], [4], [5]])
X_test = np.array([[100], [200], [300]])

# Incorrect method
scaler_train = StandardScaler()
scaler_test = StandardScaler()
X_train_wrong = scaler_train.fit_transform(X_train)
X_test_wrong = scaler_test.fit_transform(X_test)

# Correct method
scaler = StandardScaler()
X_train_correct = scaler.fit_transform(X_train)
X_test_correct = scaler.transform(X_test)

print("=== Incorrect Method (with data leakage) ===")
print(f"Training data mean: {X_train_wrong.mean():.3f}")
print(f"Test data mean: {X_test_wrong.mean():.3f}")
print("→ Both close to 0 (scaled independently)")

print("\n=== Correct Method ===")
print(f"Training data mean: {X_train_correct.mean():.3f}")
print(f"Test data mean: {X_test_correct.mean():.3f}")
print("→ Test data transformed with training data statistics")
print(f"\nTest data values: {X_test_correct.flatten()}")
print("→ Extremely large values compared to training distribution (correctly detected)")

Output:

=== Incorrect Method (with data leakage) ===
Training data mean: 0.000
Test data mean: 0.000
→ Both close to 0 (scaled independently)

=== Correct Method ===
Training data mean: 0.000
Test data mean: 63.246
→ Test data transformed with training data statistics

Test data values: [63.25 126.49 189.74]
→ Extremely large values compared to training distribution (correctly detected)

References

Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly Media.
Kuhn, M., & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press.
Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning. O'Reilly Media.
Little, R. J., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley.

Learning Objectives

1.1 The Importance of Data Preprocessing

What is Data Preprocessing?

Why Preprocessing is Necessary

Overall Picture of Data Preprocessing

Example: Effects of Preprocessing

1.2 Missing Value Handling

Types of Missing Values

Visualizing Missing Values

Deletion Methods

Row Deletion (Listwise Deletion)

Column Deletion

Imputation Methods

Simple Imputation

KNN Imputation

Multiple Imputation

Guidelines for Choosing Imputation Methods

1.3 Outlier Handling

What are Outliers?

Outlier Detection Methods

1. IQR Method (Interquartile Range)

2. Z-score Detection

3. Isolation Forest Detection

How to Handle Outliers

1. Deletion

2. Transformation (Log Transformation)

3. Capping (Winsorization)

Guidelines for Outlier Handling

1.4 Scaling and Normalization

Why Scaling is Necessary

1. StandardScaler (Standardization)

2. MinMaxScaler (Normalization)

3. RobustScaler (Robust Scaling)

Guidelines for Scaler Selection

Recommendations by Algorithm

1.5 Practical Example: Complete Preprocessing Pipeline

Data Preparation

Building the Preprocessing Pipeline

Model Training and Evaluation

Detailed Analysis of Preprocessing

Saving and Reusing the Pipeline

1.6 Chapter Summary

What We Learned

Principles of Preprocessing

Next Chapter

Practice Problems

Problem 1 (Difficulty: Easy)

Problem 2 (Difficulty: Medium)

Problem 3 (Difficulty: Medium)

Problem 4 (Difficulty: Hard)

Problem 5 (Difficulty: Hard)

References

Disclaimer