Introduction
Statistics is the discipline of extracting meaningful information from data and making rational decisions under uncertainty. In machine learning, statistical knowledge is essential for understanding data characteristics, evaluating model performance, and quantifying prediction uncertainty.
In this chapter, we will learn about descriptive statistics and probability theory, which form the foundation of statistics. In descriptive statistics, we will learn how to express the central tendency (mean, median) and spread (variance, standard deviation) of data numerically. In probability theory, we will master how to mathematically handle uncertain events.
- Summarizing data characteristics with numerical indicators (mean, variance, quartiles, etc.)
- Visualizing data with appropriate graphs (histograms, box plots, etc.)
- Basic probability calculations and conditional probability
- Understanding and applying Bayes' theorem
- Mathematical definitions of expected value and variance
1. Fundamentals of Descriptive Statistics
Descriptive Statistics is a method for summarizing and expressing data characteristics in an easily understandable form. By representing large amounts of data with a few numerical indicators, we can grasp the overall picture of the data.
1.1 Measures of Central Tendency
These are indicators that show where the "center" of the data is.
Mean
The sum of all data values divided by the number of data points, the most basic measure of central tendency.
Mathematical expression:
$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i$$
Where $n$ is the number of data points and $x_i$ is the $i$-th data value.
Median
The middle value when data is arranged in ascending order. A robust measure that is less affected by outliers.
Mode
The value that appears most frequently in the data. Can also be applied to categorical data.
Test scores of 5 students: 65, 70, 75, 80, 95 points
- Mean: $(65+70+75+80+95)/5 = 77$ points
- Median: 75 points (middle value)
If an extreme value (e.g., 10 points) is included, the mean changes significantly, but the median remains relatively stable.
1.2 Measures of Spread
These are indicators that show how much the data is scattered.
Variance
The average of the squared differences between each data value and the mean.
$$\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$$
When estimating population variance from a sample, divide by $n-1$ instead of $n$ (unbiased estimator):
$$s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$$
Standard Deviation
The square root of variance. Can express spread in the same units as the original data.
$$\sigma = \sqrt{\sigma^2}$$
Quartiles and Percentiles
Indicators that show position when data is ordered.
- First Quartile (Q1): Position at 25% of the data
- Second Quartile (Q2): Median (50% position)
- Third Quartile (Q3): Position at 75% of the data
- Interquartile Range (IQR): $Q3 - Q1$ (range of the middle 50% of data)
1.3 Python Implementation
Let's calculate descriptive statistics using NumPy and SciPy.
import numpy as np
from scipy import stats
# Sample data: Student test scores
scores = np.array([65, 70, 72, 75, 78, 80, 82, 85, 88, 95, 98])
# Measures of central tendency
mean = np.mean(scores)
median = np.median(scores)
mode_result = stats.mode(scores, keepdims=True)
mode = mode_result.mode[0] if len(mode_result.mode) > 0 else None
print(f"Mean: {mean:.2f}")
print(f"Median: {median:.2f}")
print(f"Mode: {mode}")
# Measures of spread
variance = np.var(scores) # Population variance
std_dev = np.std(scores) # Population standard deviation
sample_variance = np.var(scores, ddof=1) # Sample variance (unbiased estimator)
sample_std = np.std(scores, ddof=1) # Sample standard deviation
print(f"\nPopulation Variance: {variance:.2f}")
print(f"Population Standard Deviation: {std_dev:.2f}")
print(f"Sample Variance: {sample_variance:.2f}")
print(f"Sample Standard Deviation: {sample_std:.2f}")
# Quartiles
q1 = np.percentile(scores, 25)
q2 = np.percentile(scores, 50) # Median
q3 = np.percentile(scores, 75)
iqr = q3 - q1
print(f"\nFirst Quartile (Q1): {q1:.2f}")
print(f"Second Quartile (Q2): {q2:.2f}")
print(f"Third Quartile (Q3): {q3:.2f}")
print(f"Interquartile Range (IQR): {iqr:.2f}")
Execution Result:
Mean: 80.73
Median: 80.00
Mode: 65
Population Variance: 103.29
Population Standard Deviation: 10.16
Sample Variance: 113.62
Sample Standard Deviation: 10.66
First Quartile (Q1): 73.50
Second Quartile (Q2): 80.00
Third Quartile (Q3): 88.00
Interquartile Range (IQR): 14.50
2. Data Visualization
Not only numerical indicators but also visualization through graphs is essential for understanding data.
2.1 Histogram
A graph that visually represents the distribution of data. Data is divided into classes (bins), and the frequency of each class is displayed as a bar graph.
import matplotlib.pyplot as plt
import numpy as np
# Generate sample data following a normal distribution
np.random.seed(42)
data = np.random.normal(loc=70, scale=10, size=1000)
# Draw histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, edgecolor='black', alpha=0.7, color='skyblue')
plt.axvline(np.mean(data), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(data):.2f}')
plt.axvline(np.median(data), color='green', linestyle='--', linewidth=2, label=f'Median: {np.median(data):.2f}')
plt.xlabel('Value', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Histogram: Data Distribution', fontsize=14)
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.show()
2.2 Box Plot
A graph that visually represents quartiles and can also identify outliers.
import matplotlib.pyplot as plt
import numpy as np
# Multiple group data
np.random.seed(42)
group_a = np.random.normal(70, 10, 100)
group_b = np.random.normal(75, 8, 100)
group_c = np.random.normal(65, 12, 100)
data_groups = [group_a, group_b, group_c]
# Draw box plot
plt.figure(figsize=(10, 6))
bp = plt.boxplot(data_groups, labels=['Group A', 'Group B', 'Group C'],
patch_artist=True, notch=True)
# Customize colors
colors = ['lightblue', 'lightgreen', 'lightcoral']
for patch, color in zip(bp['boxes'], colors):
patch.set_facecolor(color)
plt.ylabel('Score', fontsize=12)
plt.title('Box Plot: Comparison Between Groups', fontsize=14)
plt.grid(axis='y', alpha=0.3)
plt.show()
- Bottom of box: First quartile (Q1)
- Line in box: Median (Q2)
- Top of box: Third quartile (Q3)
- Whiskers: Data range (excluding outliers)
- Dots: Outliers
2.3 Scatter Plot
A graph that visualizes the relationship between two variables.
import matplotlib.pyplot as plt
import numpy as np
# Generate correlated data
np.random.seed(42)
x = np.random.normal(50, 10, 100)
y = 2 * x + np.random.normal(0, 10, 100) # y has positive correlation with x
# Draw scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(x, y, alpha=0.6, edgecolors='black', s=50)
plt.xlabel('Variable X', fontsize=12)
plt.ylabel('Variable Y', fontsize=12)
plt.title('Scatter Plot: Relationship Between Two Variables', fontsize=14)
plt.grid(alpha=0.3)
# Calculate and display correlation coefficient
correlation = np.corrcoef(x, y)[0, 1]
plt.text(0.05, 0.95, f'Correlation: {correlation:.3f}',
transform=plt.gca().transAxes, fontsize=12,
verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.show()
3. Probability Basics
Probability theory is a framework for mathematically handling uncertain events. In machine learning, probability theory is used to model data generation processes and quantify prediction uncertainty.
3.1 Definition and Axioms of Probability
Probability is a numerical value from 0 to 1 that represents the likelihood of an event occurring.
Kolmogorov's Axioms (basic properties of probability):
- Non-negativity: For all events $A$, $P(A) \geq 0$
- Total Probability: The probability of the entire event space is 1, $P(\Omega) = 1$
- Additivity: For mutually exclusive events $A$ and $B$, $P(A \cup B) = P(A) + P(B)$
Basic Probability Calculations
- Complement: $P(A^c) = 1 - P(A)$
- Union: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
3.2 Conditional Probability
The probability that event $A$ occurs given that event $B$ has occurred is called conditional probability, denoted as $P(A|B)$.
$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$
Where $P(B) > 0$.
When drawing one card from a 52-card deck:
- Probability that the drawn card is a spade: $P(\text{Spade}) = 13/52 = 1/4$
- Probability that the drawn card is a face card (J, Q, K): $P(\text{Face card}) = 12/52 = 3/13$
- Probability that it is a face card given that it is a spade: $P(\text{Face card}|\text{Spade}) = 3/13$
3.3 Bayes' Theorem
Bayes' theorem is an important formula for reversing conditional probability. It plays a central role in machine learning, especially in Bayesian statistics.
$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$
Meaning of each term:
- $P(A|B)$: Posterior probability - Probability of $A$ after observing event $B$
- $P(B|A)$: Likelihood - Probability of observing $B$ when $A$ is true
- $P(A)$: Prior probability - Probability of $A$ before observation
- $P(B)$: Marginal probability (Evidence) - Overall probability of observing $B$
Application of Bayes' Theorem: Medical Diagnosis
For a rare disease:
- 1% of the population has this disease: $P(\text{Disease}) = 0.01$
- Test sensitivity (correctly identifies diseased as positive): $P(\text{Positive}|\text{Disease}) = 0.99$
- Test false positive rate: $P(\text{Positive}|\text{Healthy}) = 0.05$
If the test comes back positive, what is the probability of actually having the disease?
Applying Bayes' theorem:
$$P(\text{Disease}|\text{Positive}) = \frac{P(\text{Positive}|\text{Disease}) \cdot P(\text{Disease})}{P(\text{Positive})}$$
First, calculate $P(\text{Positive})$ (law of total probability):
$$P(\text{Positive}) = P(\text{Positive}|\text{Disease})P(\text{Disease}) + P(\text{Positive}|\text{Healthy})P(\text{Healthy})$$
$$= 0.99 \times 0.01 + 0.05 \times 0.99 = 0.0099 + 0.0495 = 0.0594$$
Therefore:
$$P(\text{Disease}|\text{Positive}) = \frac{0.99 \times 0.01}{0.0594} \approx 0.167$$
In other words, even if the test is positive, the probability of actually having the disease is only about 16.7%. This is due to the low prevalence of the disease.
Python Implementation
import numpy as np
def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a):
"""
Calculate Bayes' theorem
Parameters:
-----------
p_a : float
Prior probability P(A)
p_b_given_a : float
Likelihood P(B|A)
p_b_given_not_a : float
P(B|not A)
Returns:
--------
float
Posterior probability P(A|B)
"""
# Calculate P(B) using law of total probability
p_not_a = 1 - p_a
p_b = p_b_given_a * p_a + p_b_given_not_a * p_not_a
# Calculate P(A|B) using Bayes' theorem
p_a_given_b = (p_b_given_a * p_a) / p_b
return p_a_given_b, p_b
# Medical diagnosis example
p_disease = 0.01 # Prior probability of disease (prevalence)
p_positive_given_disease = 0.99 # Sensitivity (true positive rate)
p_positive_given_healthy = 0.05 # False positive rate
p_disease_given_positive, p_positive = bayes_theorem(
p_disease,
p_positive_given_disease,
p_positive_given_healthy
)
print("=== Bayes' Theorem in Medical Diagnosis ===")
print(f"Disease prevalence: {p_disease * 100:.1f}%")
print(f"Test sensitivity: {p_positive_given_disease * 100:.1f}%")
print(f"False positive rate: {p_positive_given_healthy * 100:.1f}%")
print(f"\nProbability of positive result: {p_positive * 100:.2f}%")
print(f"Probability of actually having disease when positive: {p_disease_given_positive * 100:.2f}%")
# Compare with varying sensitivity
print("\n=== Comparison with Varying Sensitivity ===")
sensitivities = [0.90, 0.95, 0.99, 0.999]
for sens in sensitivities:
prob, _ = bayes_theorem(p_disease, sens, p_positive_given_healthy)
print(f"Sensitivity {sens*100:.1f}%: Disease probability when positive = {prob*100:.2f}%")
Execution Result:
=== Bayes' Theorem in Medical Diagnosis ===
Disease prevalence: 1.0%
Test sensitivity: 99.0%
False positive rate: 5.0%
Probability of positive result: 5.94%
Probability of actually having disease when positive: 16.64%
=== Comparison with Varying Sensitivity ===
Sensitivity 90.0%: Disease probability when positive = 15.38%
Sensitivity 95.0%: Disease probability when positive = 16.10%
Sensitivity 99.0%: Disease probability when positive = 16.64%
Sensitivity 99.9%: Disease probability when positive = 16.72%
4. Expected Value and Variance
These are important concepts for expressing characteristics of random variables numerically.
4.1 Expected Value
The expected value represents the average value of a random variable.
For discrete random variables:
$$E[X] = \sum_{i} x_i \cdot P(X = x_i)$$
For continuous random variables:
$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx$$
Where $f(x)$ is the probability density function.
Properties of Expected Value
- Linearity: $E[aX + b] = aE[X] + b$
- Additivity: $E[X + Y] = E[X] + E[Y]$
4.2 Variance and Standard Deviation
Variance represents how much a random variable is scattered from its expected value.
$$\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2$$
Standard deviation is the square root of variance:
$$\sigma = \sqrt{\text{Var}(X)}$$
Properties of Variance
- $\text{Var}(aX + b) = a^2 \text{Var}(X)$
- If $X$ and $Y$ are independent: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$
4.3 Python Implementation
import numpy as np
import matplotlib.pyplot as plt
# Example of dice roll experiment
def dice_expectation():
"""Calculate expected value of dice roll"""
outcomes = np.array([1, 2, 3, 4, 5, 6])
probabilities = np.array([1/6] * 6)
# Calculate expected value
expectation = np.sum(outcomes * probabilities)
# Calculate variance
variance = np.sum((outcomes - expectation)**2 * probabilities)
std_dev = np.sqrt(variance)
print("=== Expected Value and Variance of Dice Roll ===")
print(f"Expected value E[X]: {expectation:.4f}")
print(f"Variance Var(X): {variance:.4f}")
print(f"Standard deviation σ: {std_dev:.4f}")
# Visualization
plt.figure(figsize=(10, 6))
plt.bar(outcomes, probabilities, edgecolor='black', alpha=0.7, color='skyblue')
plt.axvline(expectation, color='red', linestyle='--', linewidth=2,
label=f'Expected value: {expectation:.2f}')
plt.xlabel('Outcome', fontsize=12)
plt.ylabel('Probability', fontsize=12)
plt.title('Probability Distribution of Dice Roll', fontsize=14)
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.xticks(outcomes)
plt.show()
dice_expectation()
# Verification by simulation
print("\n=== Verification by Simulation ===")
n_trials = 10000
dice_rolls = np.random.randint(1, 7, n_trials)
empirical_mean = np.mean(dice_rolls)
empirical_variance = np.var(dice_rolls)
empirical_std = np.std(dice_rolls)
print(f"Number of simulations: {n_trials}")
print(f"Empirical mean: {empirical_mean:.4f}")
print(f"Empirical variance: {empirical_variance:.4f}")
print(f"Empirical standard deviation: {empirical_std:.4f}")
print(f"\nDifference from theoretical values:")
print(f"Difference in mean: {abs(empirical_mean - 3.5):.4f}")
print(f"Difference in variance: {abs(empirical_variance - 35/12):.4f}")
5. Summary and Next Steps
In this chapter, we learned the basics of descriptive statistics and probability theory, which form the foundation of statistics.
- Measures of central tendency (mean, median, mode) and spread (variance, standard deviation)
- Data visualization with histograms, box plots, and scatter plots
- Basic axioms of probability and conditional probability
- Theory and applications of Bayes' theorem
- Mathematical definitions and calculations of expected value and variance
- Implementation of statistical analysis using NumPy/SciPy/Matplotlib
- Mean is susceptible to outliers, but median is robust
- When calculating sample variance, divide by $n-1$ (unbiased estimator)
- In Bayes' theorem, the prior probability greatly influences the posterior probability
- Expected value has linearity, and variance is squared with respect to linear transformations
Next Steps
In the next chapter, we will learn about probability distributions. We will master the properties and applications of probability distributions frequently used in machine learning, such as normal distribution, binomial distribution, and Poisson distribution.
Practice Problems
Problem 1: Calculating Descriptive Statistics
Calculate the mean, median, variance, and standard deviation for the following dataset.
Data: 12, 15, 18, 20, 22, 25, 28, 30, 35, 40
import numpy as np
data = np.array([12, 15, 18, 20, 22, 25, 28, 30, 35, 40])
mean = np.mean(data)
median = np.median(data)
variance = np.var(data, ddof=1)
std_dev = np.std(data, ddof=1)
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Variance: {variance:.2f}")
print(f"Standard deviation: {std_dev:.2f}")
Problem 2: Applying Bayes' Theorem
Consider a spam email filter. 10% of emails are spam, and the probability that an email contains the word "free" is 80% for spam emails and 5% for normal emails. Calculate the probability that an email containing the word "free" is spam.
def spam_filter_bayes(p_spam, p_free_given_spam, p_free_given_normal):
p_normal = 1 - p_spam
p_free = p_free_given_spam * p_spam + p_free_given_normal * p_normal
p_spam_given_free = (p_free_given_spam * p_spam) / p_free
return p_spam_given_free
# Parameters
p_spam = 0.10
p_free_given_spam = 0.80
p_free_given_normal = 0.05
result = spam_filter_bayes(p_spam, p_free_given_spam, p_free_given_normal)
print(f"Probability that email containing 'free' is spam: {result * 100:.2f}%")
Problem 3: Calculating Expected Value
In a lottery game, buying a 1000 yen ticket gives a 10% chance of winning 5000 yen, a 5% chance of winning 10000 yen, and the rest is 0 yen. Calculate the expected value of this game and determine whether it is worth playing.
import numpy as np
# Outcomes and probabilities
outcomes = np.array([5000, 10000, 0])
probabilities = np.array([0.10, 0.05, 0.85])
# Calculate expected value
expected_value = np.sum(outcomes * probabilities)
net_expected_value = expected_value - 1000 # Subtract ticket cost
print(f"Expected value: {expected_value:.2f} yen")
print(f"Net expected value (after ticket cost): {net_expected_value:.2f} yen")
if net_expected_value > 0:
print("It is worth playing in terms of expected value")
else:
print("It is not worth playing in terms of expected value")