This chapter covers Traditional Speech Recognition. You will learn Hidden Markov Model (HMM) principles and Construct a complete HMM-GMM based ASR pipeline.
Learning Objectives
By reading this chapter, you will be able to:
- ✅ Understand the definition and evaluation metrics of Automatic Speech Recognition (ASR) tasks
- ✅ Implement Hidden Markov Model (HMM) principles and algorithms
- ✅ Understand acoustic modeling using Gaussian Mixture Models (GMM)
- ✅ Build and evaluate language models (N-grams)
- ✅ Construct a complete HMM-GMM based ASR pipeline
- ✅ Evaluate systems using Word Error Rate (WER)
2.1 Fundamentals of Speech Recognition
Definition of ASR Task
Automatic Speech Recognition (ASR) is a task that converts speech signals into text.
Goal of speech recognition: Given an observed acoustic signal $X$, find the most likely word sequence $W$
This is formulated using Bayes' theorem as follows:
$$ \hat{W} = \arg\max_{W} P(W|X) = \arg\max_{W} \frac{P(X|W) P(W)}{P(X)} = \arg\max_{W} P(X|W) P(W) $$
- $P(X|W)$: Acoustic Model - probability of the acoustic signal given a word sequence
- $P(W)$: Language Model - prior probability of the word sequence
Components of ASR Systems
MFCC] B --> C[Acoustic Model
HMM-GMM] C --> D[Decoding
Viterbi] D --> E[Language Model
N-gram] E --> F[Recognition Result
Text] style A fill:#ffebee style B fill:#fff3e0 style C fill:#e3f2fd style D fill:#f3e5f5 style E fill:#e8f5e9 style F fill:#c8e6c9
Evaluation Metrics
Word Error Rate (WER)
WER is the standard evaluation metric for speech recognition:
$$ \text{WER} = \frac{S + D + I}{N} \times 100\% $$
- $S$: Substitutions
- $D$: Deletions
- $I$: Insertions
- $N$: Total number of words in the reference text
Character Error Rate (CER)
For languages like Japanese and Chinese, character-level CER is also used:
$$ \text{CER} = \frac{S_c + D_c + I_c}{N_c} \times 100\% $$
Implementation: WER Calculation
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
import numpy as np
from typing import List, Tuple
def levenshtein_distance(ref: List[str], hyp: List[str]) -> Tuple[int, int, int, int]:
"""
Calculate Levenshtein distance and error statistics
Returns:
(distance, substitutions, deletions, insertions)
"""
m, n = len(ref), len(hyp)
# DP table
dp = np.zeros((m + 1, n + 1), dtype=int)
# Initialization
for i in range(m + 1):
dp[i][0] = i
for j in range(n + 1):
dp[0][j] = j
# DP
for i in range(1, m + 1):
for j in range(1, n + 1):
if ref[i-1] == hyp[j-1]:
dp[i][j] = dp[i-1][j-1]
else:
dp[i][j] = min(
dp[i-1][j-1] + 1, # substitution
dp[i-1][j] + 1, # deletion
dp[i][j-1] + 1 # insertion
)
# Backtrack: Calculate error statistics
i, j = m, n
subs, dels, ins = 0, 0, 0
while i > 0 or j > 0:
if i > 0 and j > 0 and ref[i-1] == hyp[j-1]:
i -= 1
j -= 1
elif i > 0 and j > 0 and dp[i][j] == dp[i-1][j-1] + 1:
subs += 1
i -= 1
j -= 1
elif i > 0 and dp[i][j] == dp[i-1][j] + 1:
dels += 1
i -= 1
else:
ins += 1
j -= 1
return dp[m][n], subs, dels, ins
def calculate_wer(reference: str, hypothesis: str) -> dict:
"""
Calculate WER (Word Error Rate)
Args:
reference: Reference text
hypothesis: Recognition result
Returns:
Dictionary containing error statistics
"""
ref_words = reference.split()
hyp_words = hypothesis.split()
dist, subs, dels, ins = levenshtein_distance(ref_words, hyp_words)
n_words = len(ref_words)
wer = (dist / n_words * 100) if n_words > 0 else 0
return {
'WER': wer,
'substitutions': subs,
'deletions': dels,
'insertions': ins,
'total_errors': dist,
'total_words': n_words
}
# Test
reference = "the quick brown fox jumps over the lazy dog"
hypothesis = "the quick brown fox jumped over a lazy dog"
result = calculate_wer(reference, hypothesis)
print("=== WER Calculation Example ===")
print(f"Reference: {reference}")
print(f"Hypothesis: {hypothesis}")
print(f"\nWER: {result['WER']:.2f}%")
print(f"Substitutions: {result['substitutions']}")
print(f"Deletions: {result['deletions']}")
print(f"Insertions: {result['insertions']}")
print(f"Total Errors: {result['total_errors']}")
print(f"Total Words: {result['total_words']}")
Output:
=== WER Calculation Example ===
Reference: the quick brown fox jumps over the lazy dog
Hypothesis: the quick brown fox jumped over a lazy dog
WER: 22.22%
Substitutions: 2
Deletions: 0
Insertions: 0
Total Errors: 2
Total Words: 9
Important: WER can exceed 100% (when there are many insertion errors).
2.2 Hidden Markov Models (HMM)
HMM Fundamentals
Hidden Markov Models (HMM) are probabilistic models consisting of unobservable hidden states and observable outputs.
Components of HMM
- $N$: Number of states
- $M$: Number of observation symbols
- $A = \{a_{ij}\}$: State transition probability matrix $a_{ij} = P(q_{t+1}=j | q_t=i)$
- $B = \{b_j(k)\}$: Output probability distribution $b_j(k) = P(o_t=k | q_t=j)$
- $\pi = \{\pi_i\}$: Initial state probability $\pi_i = P(q_1=i)$
Three Fundamental Problems of HMM
| Problem | Description | Algorithm |
|---|---|---|
| Evaluation | Calculate probability of observation sequence | Forward-Backward |
| Decoding | Estimate most likely state sequence | Viterbi |
| Learning | Estimate parameters | Baum-Welch (EM) |
Forward-Backward Algorithm
Forward Algorithm calculates the probability of the observation sequence $P(O|\lambda)$:
$$ \alpha_t(i) = P(o_1, o_2, \ldots, o_t, q_t=i | \lambda) $$
Recursion formula:
$$ \alpha_t(j) = \left[\sum_{i=1}^N \alpha_{t-1}(i) a_{ij}\right] b_j(o_t) $$
Viterbi Algorithm
Viterbi Algorithm finds the most likely state sequence:
$$ \delta_t(i) = \max_{q_1, \ldots, q_{t-1}} P(q_1, \ldots, q_{t-1}, q_t=i, o_1, \ldots, o_t | \lambda) $$
Recursion formula:
$$ \delta_t(j) = \left[\max_i \delta_{t-1}(i) a_{ij}\right] b_j(o_t) $$
Implementation: Basic HMM Operations
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Implementation: Basic HMM Operations
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import numpy as np
from hmmlearn import hmm
import matplotlib.pyplot as plt
# Example: Weather model
# States: 0=Sunny, 1=Rainy
# Observations: 0=Walk, 1=Shopping, 2=Cleaning
# HMM parameter definition
n_states = 2
n_observations = 3
# Build the model
model = hmm.MultinomialHMM(n_components=n_states, random_state=42)
# State transition probabilities
model.startprob_ = np.array([0.6, 0.4]) # Initial probabilities
model.transmat_ = np.array([
[0.7, 0.3], # Sunny → [Sunny, Rainy]
[0.4, 0.6] # Rainy → [Sunny, Rainy]
])
# Emission probabilities
model.emissionprob_ = np.array([
[0.6, 0.3, 0.1], # Sunny day: [Walk, Shopping, Cleaning]
[0.1, 0.4, 0.5] # Rainy day: [Walk, Shopping, Cleaning]
])
print("=== HMM Parameters ===")
print("\nInitial state probabilities:")
print(f" Sunny: {model.startprob_[0]:.2f}")
print(f" Rainy: {model.startprob_[1]:.2f}")
print("\nState transition probabilities:")
print(model.transmat_)
print("\nEmission probabilities:")
print(" Walk Shopping Cleaning")
print(f"Sunny: {model.emissionprob_[0]}")
print(f"Rainy: {model.emissionprob_[1]}")
# Observation sequence
observations = np.array([[0], [1], [2], [1], [0]]) # Walk, Shopping, Cleaning, Shopping, Walk
# Forward algorithm: Probability of observation sequence
log_prob = model.score(observations)
print(f"\nLog-likelihood of observation sequence: {log_prob:.4f}")
print(f"Probability of observation sequence: {np.exp(log_prob):.6f}")
# Viterbi algorithm: Most likely state sequence
log_prob, states = model.decode(observations)
state_names = ['Sunny', 'Rainy']
print(f"\nMost likely state sequence:")
for i, (obs, state) in enumerate(zip(observations.flatten(), states)):
obs_names = ['Walk', 'Shopping', 'Cleaning']
print(f" Day{i+1}: Observation={obs_names[obs]}, State={state_names[state]}")
Output:
=== HMM Parameters ===
Initial state probabilities:
Sunny: 0.60
Rainy: 0.40
State transition probabilities:
[[0.7 0.3]
[0.4 0.6]]
Emission probabilities:
Walk Shopping Cleaning
Sunny: [0.6 0.3 0.1]
Rainy: [0.1 0.4 0.5]
Log-likelihood of observation sequence: -6.3218
Probability of observation sequence: 0.001802
Most likely state sequence:
Day1: Observation=Walk, State=Sunny
Day2: Observation=Shopping, State=Sunny
Day3: Observation=Cleaning, State=Rainy
Day4: Observation=Shopping, State=Rainy
Day5: Observation=Walk, State=Rainy
Phoneme Modeling with HMM
In speech recognition, each phoneme is modeled with a 3-state Left-to-Right HMM:
Phoneme Begin] S1 --> S1 S1 --> S2[State 2
Phoneme Middle] S2 --> S2 S2 --> S3[State 3
Phoneme End] S3 --> S3 S3 --> End((End)) style Start fill:#c8e6c9 style S1 fill:#e3f2fd style S2 fill:#fff3e0 style S3 fill:#f3e5f5 style End fill:#ffcdd2
# Left-to-Right HMM (Phoneme model)
n_states = 3
# Left-to-Right structure transition matrix
# States can only advance or self-loop
lr_model = hmm.GaussianHMM(n_components=n_states, covariance_type="diag", random_state=42)
# Transition probabilities (left-to-right only)
lr_model.transmat_ = np.array([
[0.5, 0.5, 0.0], # State 1: self-loop or to State 2
[0.0, 0.5, 0.5], # State 2: self-loop or to State 3
[0.0, 0.0, 1.0] # State 3: self-loop only
])
lr_model.startprob_ = np.array([1.0, 0.0, 0.0]) # Always start from State 1
print("=== Left-to-Right HMM ===")
print("Transition probability matrix:")
print(lr_model.transmat_)
print("\nStarts from State 1 and can only move left-to-right")
2.3 Gaussian Mixture Models (GMM)
GMM Fundamentals
Gaussian Mixture Models (GMM) represent a probability distribution as a linear combination of multiple Gaussian distributions:
$$ p(x) = \sum_{k=1}^K w_k \mathcal{N}(x | \mu_k, \Sigma_k) $$
- $K$: Number of mixture components
- $w_k$: Mixture weights ($\sum_k w_k = 1$)
- $\mu_k$: Mean vector
- $\Sigma_k$: Covariance matrix
Learning with EM Algorithm
GMM parameters are estimated using the EM (Expectation-Maximization) algorithm:
E-step: Calculate Responsibilities
$$ \gamma_{nk} = \frac{w_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_{j=1}^K w_j \mathcal{N}(x_n | \mu_j, \Sigma_j)} $$
M-step: Update Parameters
$$ \mu_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk} x_n $$
$$ \Sigma_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk} (x_n - \mu_k^{\text{new}})(x_n - \mu_k^{\text{new}})^T $$
$$ w_k^{\text{new}} = \frac{N_k}{N}, \quad N_k = \sum_{n=1}^N \gamma_{nk} $$
Implementation: Clustering with GMM
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Implementation: Clustering with GMM
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
import numpy as np
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
# Data generated from 3 Gaussian distributions
np.random.seed(42)
# Data generation
n_samples = 300
X1 = np.random.randn(n_samples // 3, 2) * 0.5 + np.array([0, 0])
X2 = np.random.randn(n_samples // 3, 2) * 0.7 + np.array([3, 3])
X3 = np.random.randn(n_samples // 3, 2) * 0.6 + np.array([0, 3])
X = np.vstack([X1, X2, X3])
# GMM clustering
n_components = 3
gmm = GaussianMixture(n_components=n_components, covariance_type='full', random_state=42)
gmm.fit(X)
# Prediction
labels = gmm.predict(X)
proba = gmm.predict_proba(X)
print("=== GMM Parameters ===")
print(f"Number of components: {n_components}")
print(f"Number of iterations to converge: {gmm.n_iter_}")
print(f"Log-likelihood: {gmm.score(X) * len(X):.2f}")
print("\nMixture weights:")
for i, weight in enumerate(gmm.weights_):
print(f" Component{i+1}: {weight:.3f}")
print("\nMean vectors:")
for i, mean in enumerate(gmm.means_):
print(f" Component{i+1}: [{mean[0]:.2f}, {mean[1]:.2f}]")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Clustering results
axes[0].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6, edgecolors='black')
axes[0].scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='red', s=200, marker='X',
edgecolors='black', linewidths=2, label='Centers')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].set_title('GMM Clustering', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Probability density visualization
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
Z = -gmm.score_samples(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axes[1].contourf(xx, yy, Z, levels=20, cmap='viridis', alpha=0.6)
axes[1].scatter(X[:, 0], X[:, 1], c='white', s=10, alpha=0.5, edgecolors='black')
axes[1].scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='red', s=200, marker='X',
edgecolors='black', linewidths=2)
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_title('GMM Probability Density', fontsize=14)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
GMM-HMM System
In traditional ASR, the output probabilities of each HMM state are modeled with GMMs:
$$ b_j(o_t) = \sum_{m=1}^M c_{jm} \mathcal{N}(o_t | \mu_{jm}, \Sigma_{jm}) $$
- $j$: HMM state
- $m$: GMM component
- $c_{jm}$: Weight of component $m$ in state $j$
from hmmlearn import hmm
# HMM with GMM as emission distribution
n_states = 3
n_mix = 4 # Number of GMM components per state
# GaussianHMM: Each state has a Gaussian distribution
# n_mix > 1 makes each state a GMM
gmm_hmm = hmm.GMMHMM(n_components=n_states, n_mix=n_mix,
covariance_type='diag', n_iter=100, random_state=42)
# Training data generation (2D features)
np.random.seed(42)
n_samples = 200
train_data = np.random.randn(n_samples, 2) * 0.5
# Model training
gmm_hmm.fit(train_data)
print("\n=== GMM-HMM System ===")
print(f"Number of HMM states: {n_states}")
print(f"Number of GMM components per state: {n_mix}")
print(f"Number of iterations to converge: {gmm_hmm.monitor_.iter}")
print(f"Log-likelihood: {gmm_hmm.score(train_data) * len(train_data):.2f}")
# Decoding
test_data = np.random.randn(10, 2) * 0.5
log_prob, states = gmm_hmm.decode(test_data)
print(f"\nState sequence for test data:")
print(f" {states}")
print(f" Log probability: {log_prob:.4f}")
2.4 Language Models
N-gram Models
N-gram models predict the probability of a word sequence based on the previous $n-1$ words:
$$ P(w_1, w_2, \ldots, w_n) \approx \prod_{i=1}^n P(w_i | w_{i-n+1}, \ldots, w_{i-1}) $$
Main N-gram Models
| Model | Definition | Example |
|---|---|---|
| Unigram | $P(w_i)$ | Independent word probability |
| Bigram | $P(w_i | w_{i-1})$ | Depends on previous word |
| Trigram | $P(w_i | w_{i-2}, w_{i-1})$ | Depends on previous 2 words |
Maximum Likelihood Estimation
N-gram probabilities are estimated from counts in the training corpus:
$$ P(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})} $$
- $C(w_{i-1}, w_i)$: Count of bigram $(w_{i-1}, w_i)$
- $C(w_{i-1})$: Count of word $w_{i-1}$
Perplexity
Perplexity is an evaluation metric for language models:
$$ \text{PPL} = P(w_1, \ldots, w_N)^{-1/N} = \sqrt[N]{\prod_{i=1}^N \frac{1}{P(w_i | w_1, \ldots, w_{i-1})}} $$
Lower is better.
Smoothing Techniques
Techniques for assigning probabilities to unobserved N-grams:
1. Add-k Smoothing (Laplace)
$$ P(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i) + k}{C(w_{i-1}) + k|V|} $$
2. Kneser-Ney Smoothing
Considers context diversity:
$$ P_{\text{KN}}(w_i | w_{i-1}) = \frac{\max(C(w_{i-1}, w_i) - \delta, 0)}{C(w_{i-1})} + \lambda(w_{i-1}) P_{\text{continuation}}(w_i) $$
Implementation: N-gram Language Model
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
import numpy as np
from collections import defaultdict, Counter
from typing import List, Tuple
class BigramLanguageModel:
"""
Bigram language model (with Add-k smoothing)
"""
def __init__(self, k: float = 1.0):
self.k = k
self.unigram_counts = Counter()
self.bigram_counts = defaultdict(Counter)
self.vocab = set()
def train(self, corpus: List[List[str]]):
"""
Train the model from corpus
Args:
corpus: List of sentences (each sentence is a list of words)
"""
for sentence in corpus:
# Add start/end tags
words = ['<s>'] + sentence + ['</s>']
for word in words:
self.vocab.add(word)
self.unigram_counts[word] += 1
for w1, w2 in zip(words[:-1], words[1:]):
self.bigram_counts[w1][w2] += 1
print(f"Vocabulary size: {len(self.vocab)}")
print(f"Total words: {sum(self.unigram_counts.values())}")
print(f"Unique bigrams: {sum(len(counts) for counts in self.bigram_counts.values())}")
def probability(self, w1: str, w2: str) -> float:
"""
Calculate bigram probability P(w2|w1) (with Add-k smoothing)
"""
numerator = self.bigram_counts[w1][w2] + self.k
denominator = self.unigram_counts[w1] + self.k * len(self.vocab)
return numerator / denominator
def sentence_probability(self, sentence: List[str]) -> float:
"""
Calculate sentence probability
"""
words = ['<s>'] + sentence + ['</s>']
prob = 1.0
for w1, w2 in zip(words[:-1], words[1:]):
prob *= self.probability(w1, w2)
return prob
def perplexity(self, test_corpus: List[List[str]]) -> float:
"""
Calculate perplexity of test corpus
"""
log_prob = 0
n_words = 0
for sentence in test_corpus:
words = ['<s>'] + sentence + ['</s>']
n_words += len(words) - 1
for w1, w2 in zip(words[:-1], words[1:]):
prob = self.probability(w1, w2)
log_prob += np.log2(prob)
return 2 ** (-log_prob / n_words)
# Sample corpus
train_corpus = [
['I', 'love', 'machine', 'learning'],
['machine', 'learning', 'is', 'fun'],
['I', 'love', 'deep', 'learning'],
['deep', 'learning', 'is', 'powerful'],
['I', 'study', 'machine', 'learning'],
]
test_corpus = [
['I', 'love', 'learning'],
['machine', 'learning', 'is', 'interesting']
]
# Model training
print("=== Bigram Language Model ===\n")
lm = BigramLanguageModel(k=0.1)
lm.train(train_corpus)
# Probability calculation
print("\nBigram probability examples:")
bigrams = [('I', 'love'), ('love', 'learning'), ('machine', 'learning'), ('learning', 'is')]
for w1, w2 in bigrams:
prob = lm.probability(w1, w2)
print(f" P({w2}|{w1}) = {prob:.4f}")
# Sentence probability
print("\nSentence probabilities:")
for sentence in test_corpus:
prob = lm.sentence_probability(sentence)
print(f" '{' '.join(sentence)}': {prob:.6e}")
# Perplexity
ppl = lm.perplexity(test_corpus)
print(f"\nTest corpus perplexity: {ppl:.2f}")
Output:
=== Bigram Language Model ===
Vocabulary size: 12
Total words: 35
Unique bigrams: 25
Bigram probability examples:
P(love|I) = 0.4255
P(learning|love) = 0.3571
P(learning|machine) = 0.6667
P(is|learning) = 0.5000
Sentence probabilities:
'I love learning': 2.547618e-03
'machine learning is interesting': 1.984127e-04
Test corpus perplexity: 8.91
KenLM Library
For practical N-gram language models, use KenLM:
# Advanced language model using KenLM
# Note: Installation required: pip install https://github.com/kpu/kenlm/archive/master.zip
import kenlm
# Load from ARPA format language model file
# model = kenlm.Model('path/to/model.arpa')
# Calculate sentence score
# score = model.score('this is a test sentence', bos=True, eos=True)
# perplexity = model.perplexity('this is a test sentence')
print("KenLM is an efficient N-gram language model implementation")
print("Optimized for training and querying on large corpora")
2.5 Traditional ASR Pipeline
Complete Pipeline Architecture
Waveform] --> B[Pre-processing
Pre-emphasis] B --> C[Framing
Framing] C --> D[Windowing
Windowing] D --> E[MFCC Extraction
Feature Extraction] E --> F[Delta Features
Delta/Delta-Delta] F --> G[Acoustic Model
GMM-HMM] G --> H[Viterbi Decoding
+ Language Model] H --> I[Recognition Result
Text] style A fill:#ffebee style E fill:#e3f2fd style G fill:#fff3e0 style H fill:#f3e5f5 style I fill:#c8e6c9
Implementation: Simple ASR System
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
import numpy as np
import librosa
from hmmlearn import hmm
from sklearn.mixture import GaussianMixture
from typing import List, Tuple
class SimpleASR:
"""
Simple speech recognition system (for demonstration)
"""
def __init__(self, n_mfcc: int = 13, n_states: int = 3):
self.n_mfcc = n_mfcc
self.n_states = n_states
self.models = {} # HMM models for each word
def extract_features(self, audio_path: str, sr: int = 16000) -> np.ndarray:
"""
Extract MFCC features from audio file
"""
# Load audio
y, sr = librosa.load(audio_path, sr=sr)
# MFCC
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=self.n_mfcc)
# Delta features
delta = librosa.feature.delta(mfcc)
delta2 = librosa.feature.delta(mfcc, order=2)
# Concatenate
features = np.vstack([mfcc, delta, delta2])
return features.T # (time, features)
def train_word_model(self, word: str, audio_files: List[str]):
"""
Train HMM model for a specific word
"""
# Extract features from all training data
all_features = []
lengths = []
for audio_file in audio_files:
features = self.extract_features(audio_file)
all_features.append(features)
lengths.append(len(features))
# Concatenate
X = np.vstack(all_features)
# Left-to-Right HMM
model = hmm.GaussianHMM(
n_components=self.n_states,
covariance_type='diag',
n_iter=100,
random_state=42
)
# Constrain transition probabilities (Left-to-Right)
model.transmat_ = np.zeros((self.n_states, self.n_states))
for i in range(self.n_states):
if i < self.n_states - 1:
model.transmat_[i, i] = 0.5
model.transmat_[i, i+1] = 0.5
else:
model.transmat_[i, i] = 1.0
model.startprob_ = np.zeros(self.n_states)
model.startprob_[0] = 1.0
# Training
model.fit(X, lengths)
self.models[word] = model
print(f"Trained model for word '{word}'")
print(f" Number of training samples: {len(audio_files)}")
print(f" Total frames: {len(X)}")
def recognize(self, audio_path: str) -> Tuple[str, float]:
"""
Recognize audio file
Returns:
(recognized word, score)
"""
# Feature extraction
features = self.extract_features(audio_path)
# Calculate score for each word model
scores = {}
for word, model in self.models.items():
try:
score = model.score(features)
scores[word] = score
except:
scores[word] = -np.inf
# Select word with highest score
best_word = max(scores, key=scores.get)
best_score = scores[best_word]
return best_word, best_score
# Demo usage example (requires actual audio files)
print("=== Simple ASR System ===\n")
print("This system operates as follows:")
print("1. Extract MFCC features (+ delta) from speech")
print("2. Model each word with Left-to-Right HMM")
print("3. Select most likely word using Viterbi algorithm")
print("\nActual usage requires audio files")
# asr = SimpleASR(n_mfcc=13, n_states=3)
#
# # Training (multiple audio samples per word)
# asr.train_word_model('hello', ['hello1.wav', 'hello2.wav', 'hello3.wav'])
# asr.train_word_model('world', ['world1.wav', 'world2.wav', 'world3.wav'])
#
# # Recognition
# word, score = asr.recognize('test.wav')
# print(f"Recognition result: {word} (Score: {score:.2f})")
Integration with Language Model
In actual ASR, the acoustic model and language model are integrated:
$$ \hat{W} = \arg\max_W \left[\log P(X|W) + \lambda \log P(W)\right] $$
- $\lambda$: Language model weight
class ASRWithLanguageModel:
"""
ASR with integrated language model
"""
def __init__(self, acoustic_model, language_model, lm_weight: float = 1.0):
self.acoustic_model = acoustic_model
self.language_model = language_model
self.lm_weight = lm_weight
def recognize_with_lm(self, audio_features: np.ndarray,
previous_words: List[str] = None) -> str:
"""
Recognition using language model
"""
# Acoustic score (for each word candidate)
acoustic_scores = {}
for word in self.acoustic_model.models.keys():
acoustic_scores[word] = self.acoustic_model.models[word].score(audio_features)
# Language model score
if previous_words:
lm_scores = {}
for word in acoustic_scores.keys():
# Bigram probability
prev_word = previous_words[-1] if previous_words else ''
lm_scores[word] = np.log(self.language_model.probability(prev_word, word))
else:
lm_scores = {word: 0 for word in acoustic_scores.keys()}
# Combined score
total_scores = {
word: acoustic_scores[word] + self.lm_weight * lm_scores[word]
for word in acoustic_scores.keys()
}
# Select best word
best_word = max(total_scores, key=total_scores.get)
return best_word
print("\n=== Language Model Integration ===")
print("By combining acoustic and language scores,")
print("recognition accuracy can be improved with context consideration")
2.6 Chapter Summary
What We Learned
Fundamentals of Speech Recognition
- ASR is a combination of acoustic and language models
- Evaluation using WER (Word Error Rate)
- Error calculation using Levenshtein distance
Hidden Markov Models
- Modeling state transitions and output probabilities
- Forward-Backward algorithm (Evaluation)
- Viterbi algorithm (Decoding)
- Baum-Welch algorithm (Learning)
Gaussian Mixture Models
- Density estimation with multiple Gaussian distributions
- Parameter estimation using EM algorithm
- Acoustic modeling with GMM-HMM
Language Models
- Probability modeling of word sequences with N-grams
- Smoothing techniques (handling unseen events)
- Evaluation using perplexity
ASR Pipeline
- Feature extraction (MFCC + delta)
- Acoustic model (GMM-HMM)
- Decoding (Viterbi)
- Language model integration
Advantages and Disadvantages of Traditional ASR
| Advantages | Disadvantages |
|---|---|
| Theoretically clear | Requires large amounts of labeled data |
| Independent components | Difficult to optimize entire pipeline |
| Interpretability at phoneme level | Weak modeling of long-term dependencies |
| Works with limited data | Dependent on feature engineering |
Next Chapter
In Chapter 3, we will learn about modern End-to-End speech recognition:
- Deep Speech (CTC)
- Listen, Attend and Spell
- Transformer-based ASR
- Wav2Vec 2.0
- Whisper
Exercises
Problem 1 (Difficulty: easy)
Calculate the WER by hand for the following reference and hypothesis sentences.
- Reference: "the cat sat on the mat"
- Hypothesis: "the cat sit on mat"
Solution
Answer:
Align reference and hypothesis:
Reference: the cat sat on the mat
Hypothesis: the cat sit on --- mat
Count errors:
- Substitutions (S): sat → sit (1)
- Deletions (D): the (1)
- Insertions (I): 0
Calculate WER:
$$ \text{WER} = \frac{S + D + I}{N} = \frac{1 + 1 + 0}{6} = \frac{2}{6} = 0.333 = 33.3\% $$
Answer: 33.3%
Problem 2 (Difficulty: medium)
For a 3-state HMM with the following parameters, calculate the probability of the observation sequence [0, 1, 0] using the Forward algorithm.
# Initial probabilities
pi = [0.6, 0.3, 0.1]
# Transition probabilities
A = [[0.7, 0.2, 0.1],
[0.3, 0.5, 0.2],
[0.2, 0.3, 0.5]]
# Emission probabilities (observations 0 and 1)
B = [[0.8, 0.2],
[0.4, 0.6],
[0.3, 0.7]]
Solution
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: For a 3-state HMM with the following parameters, calculate t
Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: 5-10 seconds
Dependencies: None
"""
import numpy as np
# Parameters
pi = np.array([0.6, 0.3, 0.1])
A = np.array([[0.7, 0.2, 0.1],
[0.3, 0.5, 0.2],
[0.2, 0.3, 0.5]])
B = np.array([[0.8, 0.2],
[0.4, 0.6],
[0.3, 0.7]])
observations = [0, 1, 0]
T = len(observations)
N = len(pi)
# Forward variables
alpha = np.zeros((T, N))
# Initialization (t=0)
alpha[0] = pi * B[:, observations[0]]
print(f"t=0: α = {alpha[0]}")
# Recursion (t=1, 2, ...)
for t in range(1, T):
for j in range(N):
alpha[t, j] = np.sum(alpha[t-1] * A[:, j]) * B[j, observations[t]]
print(f"t={t}: α = {alpha[t]}")
# Probability of observation sequence
prob = np.sum(alpha[T-1])
print(f"\nProbability of observation sequence {observations}:")
print(f"P(O|λ) = {prob:.6f}")
Output:
t=0: α = [0.48 0.12 0.03]
t=1: α = [0.0672 0.1092 0.0588]
t=2: α = [0.06048 0.02688 0.01512]
Probability of observation sequence [0, 1, 0]:
P(O|λ) = 0.102480
Problem 3 (Difficulty: medium)
In a bigram language model, estimate P("learning"|"machine") using maximum likelihood estimation from the following corpus.
I love machine learning
machine learning is fun
I study machine learning
deep learning is great
Solution
Answer:
Counts:
- C(machine, learning) = 3
- C(machine) = 3
Maximum likelihood estimation:
$$ P(\text{learning} | \text{machine}) = \frac{C(\text{machine}, \text{learning})}{C(\text{machine})} = \frac{3}{3} = 1.0 $$
Answer: 1.0 (100%)
In this corpus, "machine" is always followed by "learning".
Problem 4 (Difficulty: hard)
Using a GMM with 2 components (K=2), cluster the following 1D data and find the parameters (mean, variance, weight) of each component.
data = np.array([1.2, 1.5, 1.8, 2.0, 2.1, 8.5, 9.0, 9.2, 9.5, 10.0])
Solution
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Using a GMM with 2 components (K=2), cluster the following 1
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
import numpy as np
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
data = np.array([1.2, 1.5, 1.8, 2.0, 2.1, 8.5, 9.0, 9.2, 9.5, 10.0])
X = data.reshape(-1, 1)
# GMM clustering
gmm = GaussianMixture(n_components=2, random_state=42)
gmm.fit(X)
# Parameters
print("=== GMM Parameters ===")
print(f"\nComponent 1:")
print(f" Mean: {gmm.means_[0][0]:.3f}")
print(f" Variance: {gmm.covariances_[0][0][0]:.3f}")
print(f" Weight: {gmm.weights_[0]:.3f}")
print(f"\nComponent 2:")
print(f" Mean: {gmm.means_[1][0]:.3f}")
print(f" Variance: {gmm.covariances_[1][0][0]:.3f}")
print(f" Weight: {gmm.weights_[1]:.3f}")
# Cluster labels
labels = gmm.predict(X)
print(f"\nCluster labels:")
for i, (val, label) in enumerate(zip(data, labels)):
print(f" Data{i+1}: {val:.1f} → Cluster{label}")
# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(data, np.zeros_like(data), c=labels, cmap='viridis',
s=100, alpha=0.6, edgecolors='black')
plt.scatter(gmm.means_, [0, 0], c='red', s=200, marker='X',
edgecolors='black', linewidths=2, label='Centers')
plt.xlabel('Value')
plt.title('1D Data Clustering with GMM', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Example output:
=== GMM Parameters ===
Component 1:
Mean: 1.720
Variance: 0.124
Weight: 0.500
Component 2:
Mean: 9.240
Variance: 0.294
Weight: 0.500
Cluster labels:
Data1: 1.2 → Cluster0
Data2: 1.5 → Cluster0
Data3: 1.8 → Cluster0
Data4: 2.0 → Cluster0
Data5: 2.1 → Cluster0
Data6: 8.5 → Cluster1
Data7: 9.0 → Cluster1
Data8: 9.2 → Cluster1
Data9: 9.5 → Cluster1
Data10: 10.0 → Cluster1
Problem 5 (Difficulty: hard)
In an ASR system that integrates acoustic and language models, explain the impact of varying the language model weight (LM weight) on recognition results. Also, describe how the optimal weight should be determined.
Solution
Answer:
Impact of LM weight:
Combined score:
$$ \text{Score}(W) = \log P(X|W) + \lambda \log P(W) $$
| LM weight $\lambda$ | Impact | Recognition tendency |
|---|---|---|
| Small (close to 0) | Acoustic model priority | Selects acoustically similar words, grammatically unnatural |
| Appropriate | Balanced | Considers both acoustic and grammar, best recognition accuracy |
| Large | Language model priority | Grammatically correct but acoustically incorrect, biased toward frequent words |
Methods for determining optimal weight:
Grid search on development set
lambda_values = [0.1, 0.5, 1.0, 2.0, 5.0, 10.0] best_lambda = None best_wer = float('inf') for lam in lambda_values: wer = evaluate_asr(dev_set, lm_weight=lam) if wer < best_wer: best_wer = wer best_lambda = lam print(f"Optimal LM weight: {best_lambda}")Considering domain dependency
- Read speech: High acoustic quality → smaller $\lambda$
- Spontaneous speech: Noisy → larger $\lambda$
- Many technical terms: Low language model reliability → smaller $\lambda$
Dynamic adjustment
- Dynamically adjust based on confidence scores
- Adjust according to acoustic quality (SNR)
Example:
Speech: "I scream" (ice cream)
λ = 0.1 (acoustic priority):
→ "I scream" (acoustically accurate)
λ = 5.0 (language priority):
→ "ice cream" (grammatically natural, high frequency in language model)
λ = 1.0 (balanced):
→ Appropriate selection depending on context
Conclusion: The optimal LM weight depends on domain, acoustic conditions, and language model quality, and requires experimental tuning on a development set.
References
- Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed.). Draft.
- Gales, M., & Young, S. (2008). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195-304.
- Heafield, K. (2011). KenLM: Faster and smaller language model queries. Proceedings of the Sixth Workshop on Statistical Machine Translation, 187-197.