Chapter 2: Traditional Speech Recognition

This chapter covers Traditional Speech Recognition. You will learn Hidden Markov Model (HMM) principles and Construct a complete HMM-GMM based ASR pipeline.

Learning Objectives

By reading this chapter, you will be able to:

✅ Understand the definition and evaluation metrics of Automatic Speech Recognition (ASR) tasks
✅ Implement Hidden Markov Model (HMM) principles and algorithms
✅ Understand acoustic modeling using Gaussian Mixture Models (GMM)
✅ Build and evaluate language models (N-grams)
✅ Construct a complete HMM-GMM based ASR pipeline
✅ Evaluate systems using Word Error Rate (WER)

2.1 Fundamentals of Speech Recognition

Definition of ASR Task

Automatic Speech Recognition (ASR) is a task that converts speech signals into text.

Goal of speech recognition: Given an observed acoustic signal $X$, find the most likely word sequence $W$

This is formulated using Bayes' theorem as follows:

$$ \hat{W} = \arg\max_{W} P(W|X) = \arg\max_{W} \frac{P(X|W) P(W)}{P(X)} = \arg\max_{W} P(X|W) P(W) $$

$P(X|W)$: Acoustic Model - probability of the acoustic signal given a word sequence
$P(W)$: Language Model - prior probability of the word sequence

Components of ASR Systems

graph LR A[Speech Signal] --> B[Feature Extraction
MFCC] B --> C[Acoustic Model
HMM-GMM] C --> D[Decoding
Viterbi] D --> E[Language Model
N-gram] E --> F[Recognition Result
Text] style A fill:#ffebee style B fill:#fff3e0 style C fill:#e3f2fd style D fill:#f3e5f5 style E fill:#e8f5e9 style F fill:#c8e6c9

Evaluation Metrics

Word Error Rate (WER)

WER is the standard evaluation metric for speech recognition:

$$ \text{WER} = \frac{S + D + I}{N} \times 100\% $$

$S$: Substitutions
$D$: Deletions
$I$: Insertions
$N$: Total number of words in the reference text

Character Error Rate (CER)

For languages like Japanese and Chinese, character-level CER is also used:

$$ \text{CER} = \frac{S_c + D_c + I_c}{N_c} \times 100\% $$

Implementation: WER Calculation

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np
from typing import List, Tuple

def levenshtein_distance(ref: List[str], hyp: List[str]) -> Tuple[int, int, int, int]:
    """
    Calculate Levenshtein distance and error statistics

    Returns:
        (distance, substitutions, deletions, insertions)
    """
    m, n = len(ref), len(hyp)

    # DP table
    dp = np.zeros((m + 1, n + 1), dtype=int)

    # Initialization
    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    # DP
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if ref[i-1] == hyp[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = min(
                    dp[i-1][j-1] + 1,  # substitution
                    dp[i-1][j] + 1,    # deletion
                    dp[i][j-1] + 1     # insertion
                )

    # Backtrack: Calculate error statistics
    i, j = m, n
    subs, dels, ins = 0, 0, 0

    while i > 0 or j > 0:
        if i > 0 and j > 0 and ref[i-1] == hyp[j-1]:
            i -= 1
            j -= 1
        elif i > 0 and j > 0 and dp[i][j] == dp[i-1][j-1] + 1:
            subs += 1
            i -= 1
            j -= 1
        elif i > 0 and dp[i][j] == dp[i-1][j] + 1:
            dels += 1
            i -= 1
        else:
            ins += 1
            j -= 1

    return dp[m][n], subs, dels, ins


def calculate_wer(reference: str, hypothesis: str) -> dict:
    """
    Calculate WER (Word Error Rate)

    Args:
        reference: Reference text
        hypothesis: Recognition result

    Returns:
        Dictionary containing error statistics
    """
    ref_words = reference.split()
    hyp_words = hypothesis.split()

    dist, subs, dels, ins = levenshtein_distance(ref_words, hyp_words)

    n_words = len(ref_words)
    wer = (dist / n_words * 100) if n_words > 0 else 0

    return {
        'WER': wer,
        'substitutions': subs,
        'deletions': dels,
        'insertions': ins,
        'total_errors': dist,
        'total_words': n_words
    }


# Test
reference = "the quick brown fox jumps over the lazy dog"
hypothesis = "the quick brown fox jumped over a lazy dog"

result = calculate_wer(reference, hypothesis)

print("=== WER Calculation Example ===")
print(f"Reference: {reference}")
print(f"Hypothesis: {hypothesis}")
print(f"\nWER: {result['WER']:.2f}%")
print(f"Substitutions: {result['substitutions']}")
print(f"Deletions: {result['deletions']}")
print(f"Insertions: {result['insertions']}")
print(f"Total Errors: {result['total_errors']}")
print(f"Total Words: {result['total_words']}")

Output:

=== WER Calculation Example ===
Reference: the quick brown fox jumps over the lazy dog
Hypothesis: the quick brown fox jumped over a lazy dog

WER: 22.22%
Substitutions: 2
Deletions: 0
Insertions: 0
Total Errors: 2
Total Words: 9

Important: WER can exceed 100% (when there are many insertion errors).

2.2 Hidden Markov Models (HMM)

HMM Fundamentals

Hidden Markov Models (HMM) are probabilistic models consisting of unobservable hidden states and observable outputs.

Components of HMM

$N$: Number of states
$M$: Number of observation symbols
$A = \{a_{ij}\}$: State transition probability matrix $a_{ij} = P(q_{t+1}=j | q_t=i)$
$B = \{b_j(k)\}$: Output probability distribution $b_j(k) = P(o_t=k | q_t=j)$
$\pi = \{\pi_i\}$: Initial state probability $\pi_i = P(q_1=i)$

Three Fundamental Problems of HMM

Problem	Description	Algorithm
Evaluation	Calculate probability of observation sequence	Forward-Backward
Decoding	Estimate most likely state sequence	Viterbi
Learning	Estimate parameters	Baum-Welch (EM)

Forward-Backward Algorithm

Forward Algorithm calculates the probability of the observation sequence $P(O|\lambda)$:

$$ \alpha_t(i) = P(o_1, o_2, \ldots, o_t, q_t=i | \lambda) $$

Recursion formula:

$$ \alpha_t(j) = \left[\sum_{i=1}^N \alpha_{t-1}(i) a_{ij}\right] b_j(o_t) $$

Viterbi Algorithm

Viterbi Algorithm finds the most likely state sequence:

$$ \delta_t(i) = \max_{q_1, \ldots, q_{t-1}} P(q_1, \ldots, q_{t-1}, q_t=i, o_1, \ldots, o_t | \lambda) $$

Recursion formula:

$$ \delta_t(j) = \left[\max_i \delta_{t-1}(i) a_{ij}\right] b_j(o_t) $$

Implementation: Basic HMM Operations

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Implementation: Basic HMM Operations

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
from hmmlearn import hmm
import matplotlib.pyplot as plt

# Example: Weather model
# States: 0=Sunny, 1=Rainy
# Observations: 0=Walk, 1=Shopping, 2=Cleaning

# HMM parameter definition
n_states = 2
n_observations = 3

# Build the model
model = hmm.MultinomialHMM(n_components=n_states, random_state=42)

# State transition probabilities
model.startprob_ = np.array([0.6, 0.4])  # Initial probabilities
model.transmat_ = np.array([
    [0.7, 0.3],  # Sunny → [Sunny, Rainy]
    [0.4, 0.6]   # Rainy → [Sunny, Rainy]
])

# Emission probabilities
model.emissionprob_ = np.array([
    [0.6, 0.3, 0.1],  # Sunny day: [Walk, Shopping, Cleaning]
    [0.1, 0.4, 0.5]   # Rainy day: [Walk, Shopping, Cleaning]
])

print("=== HMM Parameters ===")
print("\nInitial state probabilities:")
print(f"  Sunny: {model.startprob_[0]:.2f}")
print(f"  Rainy: {model.startprob_[1]:.2f}")

print("\nState transition probabilities:")
print(model.transmat_)

print("\nEmission probabilities:")
print("       Walk   Shopping  Cleaning")
print(f"Sunny: {model.emissionprob_[0]}")
print(f"Rainy: {model.emissionprob_[1]}")

# Observation sequence
observations = np.array([[0], [1], [2], [1], [0]])  # Walk, Shopping, Cleaning, Shopping, Walk

# Forward algorithm: Probability of observation sequence
log_prob = model.score(observations)
print(f"\nLog-likelihood of observation sequence: {log_prob:.4f}")
print(f"Probability of observation sequence: {np.exp(log_prob):.6f}")

# Viterbi algorithm: Most likely state sequence
log_prob, states = model.decode(observations)
state_names = ['Sunny', 'Rainy']
print(f"\nMost likely state sequence:")
for i, (obs, state) in enumerate(zip(observations.flatten(), states)):
    obs_names = ['Walk', 'Shopping', 'Cleaning']
    print(f"  Day{i+1}: Observation={obs_names[obs]}, State={state_names[state]}")

Output:

=== HMM Parameters ===

Initial state probabilities:
  Sunny: 0.60
  Rainy: 0.40

State transition probabilities:
[[0.7 0.3]
 [0.4 0.6]]

Emission probabilities:
       Walk   Shopping  Cleaning
Sunny: [0.6 0.3 0.1]
Rainy: [0.1 0.4 0.5]

Log-likelihood of observation sequence: -6.3218
Probability of observation sequence: 0.001802

Most likely state sequence:
  Day1: Observation=Walk, State=Sunny
  Day2: Observation=Shopping, State=Sunny
  Day3: Observation=Cleaning, State=Rainy
  Day4: Observation=Shopping, State=Rainy
  Day5: Observation=Walk, State=Rainy

Phoneme Modeling with HMM

In speech recognition, each phoneme is modeled with a 3-state Left-to-Right HMM:

graph LR Start((Start)) --> S1[State 1
Phoneme Begin] S1 --> S1 S1 --> S2[State 2
Phoneme Middle] S2 --> S2 S2 --> S3[State 3
Phoneme End] S3 --> S3 S3 --> End((End)) style Start fill:#c8e6c9 style S1 fill:#e3f2fd style S2 fill:#fff3e0 style S3 fill:#f3e5f5 style End fill:#ffcdd2

# Left-to-Right HMM (Phoneme model)
n_states = 3

# Left-to-Right structure transition matrix
# States can only advance or self-loop
lr_model = hmm.GaussianHMM(n_components=n_states, covariance_type="diag", random_state=42)

# Transition probabilities (left-to-right only)
lr_model.transmat_ = np.array([
    [0.5, 0.5, 0.0],  # State 1: self-loop or to State 2
    [0.0, 0.5, 0.5],  # State 2: self-loop or to State 3
    [0.0, 0.0, 1.0]   # State 3: self-loop only
])

lr_model.startprob_ = np.array([1.0, 0.0, 0.0])  # Always start from State 1

print("=== Left-to-Right HMM ===")
print("Transition probability matrix:")
print(lr_model.transmat_)
print("\nStarts from State 1 and can only move left-to-right")

2.3 Gaussian Mixture Models (GMM)

GMM Fundamentals

Gaussian Mixture Models (GMM) represent a probability distribution as a linear combination of multiple Gaussian distributions:

$$ p(x) = \sum_{k=1}^K w_k \mathcal{N}(x | \mu_k, \Sigma_k) $$

$K$: Number of mixture components
$w_k$: Mixture weights ($\sum_k w_k = 1$)
$\mu_k$: Mean vector
$\Sigma_k$: Covariance matrix

Learning with EM Algorithm

GMM parameters are estimated using the EM (Expectation-Maximization) algorithm:

E-step: Calculate Responsibilities

$$ \gamma_{nk} = \frac{w_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_{j=1}^K w_j \mathcal{N}(x_n | \mu_j, \Sigma_j)} $$

M-step: Update Parameters

$$ \mu_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk} x_n $$

$$ \Sigma_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk} (x_n - \mu_k^{\text{new}})(x_n - \mu_k^{\text{new}})^T $$

$$ w_k^{\text{new}} = \frac{N_k}{N}, \quad N_k = \sum_{n=1}^N \gamma_{nk} $$

Implementation: Clustering with GMM

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Implementation: Clustering with GMM

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt

# Data generated from 3 Gaussian distributions
np.random.seed(42)

# Data generation
n_samples = 300
X1 = np.random.randn(n_samples // 3, 2) * 0.5 + np.array([0, 0])
X2 = np.random.randn(n_samples // 3, 2) * 0.7 + np.array([3, 3])
X3 = np.random.randn(n_samples // 3, 2) * 0.6 + np.array([0, 3])

X = np.vstack([X1, X2, X3])

# GMM clustering
n_components = 3
gmm = GaussianMixture(n_components=n_components, covariance_type='full', random_state=42)
gmm.fit(X)

# Prediction
labels = gmm.predict(X)
proba = gmm.predict_proba(X)

print("=== GMM Parameters ===")
print(f"Number of components: {n_components}")
print(f"Number of iterations to converge: {gmm.n_iter_}")
print(f"Log-likelihood: {gmm.score(X) * len(X):.2f}")

print("\nMixture weights:")
for i, weight in enumerate(gmm.weights_):
    print(f"  Component{i+1}: {weight:.3f}")

print("\nMean vectors:")
for i, mean in enumerate(gmm.means_):
    print(f"  Component{i+1}: [{mean[0]:.2f}, {mean[1]:.2f}]")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Clustering results
axes[0].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6, edgecolors='black')
axes[0].scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='red', s=200, marker='X',
                edgecolors='black', linewidths=2, label='Centers')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].set_title('GMM Clustering', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Probability density visualization
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))
Z = -gmm.score_samples(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

axes[1].contourf(xx, yy, Z, levels=20, cmap='viridis', alpha=0.6)
axes[1].scatter(X[:, 0], X[:, 1], c='white', s=10, alpha=0.5, edgecolors='black')
axes[1].scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='red', s=200, marker='X',
                edgecolors='black', linewidths=2)
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_title('GMM Probability Density', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

GMM-HMM System

In traditional ASR, the output probabilities of each HMM state are modeled with GMMs:

$$ b_j(o_t) = \sum_{m=1}^M c_{jm} \mathcal{N}(o_t | \mu_{jm}, \Sigma_{jm}) $$

$j$: HMM state
$m$: GMM component
$c_{jm}$: Weight of component $m$ in state $j$

from hmmlearn import hmm

# HMM with GMM as emission distribution
n_states = 3
n_mix = 4  # Number of GMM components per state

# GaussianHMM: Each state has a Gaussian distribution
# n_mix > 1 makes each state a GMM
gmm_hmm = hmm.GMMHMM(n_components=n_states, n_mix=n_mix,
                      covariance_type='diag', n_iter=100, random_state=42)

# Training data generation (2D features)
np.random.seed(42)
n_samples = 200
train_data = np.random.randn(n_samples, 2) * 0.5

# Model training
gmm_hmm.fit(train_data)

print("\n=== GMM-HMM System ===")
print(f"Number of HMM states: {n_states}")
print(f"Number of GMM components per state: {n_mix}")
print(f"Number of iterations to converge: {gmm_hmm.monitor_.iter}")
print(f"Log-likelihood: {gmm_hmm.score(train_data) * len(train_data):.2f}")

# Decoding
test_data = np.random.randn(10, 2) * 0.5
log_prob, states = gmm_hmm.decode(test_data)

print(f"\nState sequence for test data:")
print(f"  {states}")
print(f"  Log probability: {log_prob:.4f}")

2.4 Language Models

N-gram Models

N-gram models predict the probability of a word sequence based on the previous $n-1$ words:

$$ P(w_1, w_2, \ldots, w_n) \approx \prod_{i=1}^n P(w_i | w_{i-n+1}, \ldots, w_{i-1}) $$

Main N-gram Models

Model	Definition	Example
Unigram	$P(w_i)$	Independent word probability
Bigram	$P(w_i \| w_{i-1})$	Depends on previous word
Trigram	$P(w_i \| w_{i-2}, w_{i-1})$	Depends on previous 2 words

Maximum Likelihood Estimation

N-gram probabilities are estimated from counts in the training corpus:

$$ P(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})} $$

$C(w_{i-1}, w_i)$: Count of bigram $(w_{i-1}, w_i)$
$C(w_{i-1})$: Count of word $w_{i-1}$

Perplexity

Perplexity is an evaluation metric for language models:

$$ \text{PPL} = P(w_1, \ldots, w_N)^{-1/N} = \sqrt[N]{\prod_{i=1}^N \frac{1}{P(w_i | w_1, \ldots, w_{i-1})}} $$

Lower is better.

Smoothing Techniques

Techniques for assigning probabilities to unobserved N-grams:

1. Add-k Smoothing (Laplace)

$$ P(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i) + k}{C(w_{i-1}) + k|V|} $$

2. Kneser-Ney Smoothing

Considers context diversity:

$$ P_{\text{KN}}(w_i | w_{i-1}) = \frac{\max(C(w_{i-1}, w_i) - \delta, 0)}{C(w_{i-1})} + \lambda(w_{i-1}) P_{\text{continuation}}(w_i) $$

Implementation: N-gram Language Model

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np
from collections import defaultdict, Counter
from typing import List, Tuple

class BigramLanguageModel:
    """
    Bigram language model (with Add-k smoothing)
    """
    def __init__(self, k: float = 1.0):
        self.k = k
        self.unigram_counts = Counter()
        self.bigram_counts = defaultdict(Counter)
        self.vocab = set()

    def train(self, corpus: List[List[str]]):
        """
        Train the model from corpus

        Args:
            corpus: List of sentences (each sentence is a list of words)
        """
        for sentence in corpus:
            # Add start/end tags
            words = ['<s>'] + sentence + ['</s>']

            for word in words:
                self.vocab.add(word)
                self.unigram_counts[word] += 1

            for w1, w2 in zip(words[:-1], words[1:]):
                self.bigram_counts[w1][w2] += 1

        print(f"Vocabulary size: {len(self.vocab)}")
        print(f"Total words: {sum(self.unigram_counts.values())}")
        print(f"Unique bigrams: {sum(len(counts) for counts in self.bigram_counts.values())}")

    def probability(self, w1: str, w2: str) -> float:
        """
        Calculate bigram probability P(w2|w1) (with Add-k smoothing)
        """
        numerator = self.bigram_counts[w1][w2] + self.k
        denominator = self.unigram_counts[w1] + self.k * len(self.vocab)
        return numerator / denominator

    def sentence_probability(self, sentence: List[str]) -> float:
        """
        Calculate sentence probability
        """
        words = ['<s>'] + sentence + ['</s>']
        prob = 1.0

        for w1, w2 in zip(words[:-1], words[1:]):
            prob *= self.probability(w1, w2)

        return prob

    def perplexity(self, test_corpus: List[List[str]]) -> float:
        """
        Calculate perplexity of test corpus
        """
        log_prob = 0
        n_words = 0

        for sentence in test_corpus:
            words = ['<s>'] + sentence + ['</s>']
            n_words += len(words) - 1

            for w1, w2 in zip(words[:-1], words[1:]):
                prob = self.probability(w1, w2)
                log_prob += np.log2(prob)

        return 2 ** (-log_prob / n_words)


# Sample corpus
train_corpus = [
    ['I', 'love', 'machine', 'learning'],
    ['machine', 'learning', 'is', 'fun'],
    ['I', 'love', 'deep', 'learning'],
    ['deep', 'learning', 'is', 'powerful'],
    ['I', 'study', 'machine', 'learning'],
]

test_corpus = [
    ['I', 'love', 'learning'],
    ['machine', 'learning', 'is', 'interesting']
]

# Model training
print("=== Bigram Language Model ===\n")
lm = BigramLanguageModel(k=0.1)
lm.train(train_corpus)

# Probability calculation
print("\nBigram probability examples:")
bigrams = [('I', 'love'), ('love', 'learning'), ('machine', 'learning'), ('learning', 'is')]
for w1, w2 in bigrams:
    prob = lm.probability(w1, w2)
    print(f"  P({w2}|{w1}) = {prob:.4f}")

# Sentence probability
print("\nSentence probabilities:")
for sentence in test_corpus:
    prob = lm.sentence_probability(sentence)
    print(f"  '{' '.join(sentence)}': {prob:.6e}")

# Perplexity
ppl = lm.perplexity(test_corpus)
print(f"\nTest corpus perplexity: {ppl:.2f}")

Output:

=== Bigram Language Model ===

Vocabulary size: 12
Total words: 35
Unique bigrams: 25

Bigram probability examples:
  P(love|I) = 0.4255
  P(learning|love) = 0.3571
  P(learning|machine) = 0.6667
  P(is|learning) = 0.5000

Sentence probabilities:
  'I love learning': 2.547618e-03
  'machine learning is interesting': 1.984127e-04

Test corpus perplexity: 8.91

KenLM Library

For practical N-gram language models, use KenLM:

# Advanced language model using KenLM
# Note: Installation required: pip install https://github.com/kpu/kenlm/archive/master.zip

import kenlm

# Load from ARPA format language model file
# model = kenlm.Model('path/to/model.arpa')

# Calculate sentence score
# score = model.score('this is a test sentence', bos=True, eos=True)
# perplexity = model.perplexity('this is a test sentence')

print("KenLM is an efficient N-gram language model implementation")
print("Optimized for training and querying on large corpora")

2.5 Traditional ASR Pipeline

Complete Pipeline Architecture

graph TD A[Speech Signal
Waveform] --> B[Pre-processing
Pre-emphasis] B --> C[Framing
Framing] C --> D[Windowing
Windowing] D --> E[MFCC Extraction
Feature Extraction] E --> F[Delta Features
Delta/Delta-Delta] F --> G[Acoustic Model
GMM-HMM] G --> H[Viterbi Decoding
+ Language Model] H --> I[Recognition Result
Text] style A fill:#ffebee style E fill:#e3f2fd style G fill:#fff3e0 style H fill:#f3e5f5 style I fill:#c8e6c9

Implementation: Simple ASR System

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np
import librosa
from hmmlearn import hmm
from sklearn.mixture import GaussianMixture
from typing import List, Tuple

class SimpleASR:
    """
    Simple speech recognition system (for demonstration)
    """
    def __init__(self, n_mfcc: int = 13, n_states: int = 3):
        self.n_mfcc = n_mfcc
        self.n_states = n_states
        self.models = {}  # HMM models for each word

    def extract_features(self, audio_path: str, sr: int = 16000) -> np.ndarray:
        """
        Extract MFCC features from audio file
        """
        # Load audio
        y, sr = librosa.load(audio_path, sr=sr)

        # MFCC
        mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=self.n_mfcc)

        # Delta features
        delta = librosa.feature.delta(mfcc)
        delta2 = librosa.feature.delta(mfcc, order=2)

        # Concatenate
        features = np.vstack([mfcc, delta, delta2])

        return features.T  # (time, features)

    def train_word_model(self, word: str, audio_files: List[str]):
        """
        Train HMM model for a specific word
        """
        # Extract features from all training data
        all_features = []
        lengths = []

        for audio_file in audio_files:
            features = self.extract_features(audio_file)
            all_features.append(features)
            lengths.append(len(features))

        # Concatenate
        X = np.vstack(all_features)

        # Left-to-Right HMM
        model = hmm.GaussianHMM(
            n_components=self.n_states,
            covariance_type='diag',
            n_iter=100,
            random_state=42
        )

        # Constrain transition probabilities (Left-to-Right)
        model.transmat_ = np.zeros((self.n_states, self.n_states))
        for i in range(self.n_states):
            if i < self.n_states - 1:
                model.transmat_[i, i] = 0.5
                model.transmat_[i, i+1] = 0.5
            else:
                model.transmat_[i, i] = 1.0

        model.startprob_ = np.zeros(self.n_states)
        model.startprob_[0] = 1.0

        # Training
        model.fit(X, lengths)

        self.models[word] = model

        print(f"Trained model for word '{word}'")
        print(f"  Number of training samples: {len(audio_files)}")
        print(f"  Total frames: {len(X)}")

    def recognize(self, audio_path: str) -> Tuple[str, float]:
        """
        Recognize audio file

        Returns:
            (recognized word, score)
        """
        # Feature extraction
        features = self.extract_features(audio_path)

        # Calculate score for each word model
        scores = {}
        for word, model in self.models.items():
            try:
                score = model.score(features)
                scores[word] = score
            except:
                scores[word] = -np.inf

        # Select word with highest score
        best_word = max(scores, key=scores.get)
        best_score = scores[best_word]

        return best_word, best_score


# Demo usage example (requires actual audio files)
print("=== Simple ASR System ===\n")
print("This system operates as follows:")
print("1. Extract MFCC features (+ delta) from speech")
print("2. Model each word with Left-to-Right HMM")
print("3. Select most likely word using Viterbi algorithm")
print("\nActual usage requires audio files")

# asr = SimpleASR(n_mfcc=13, n_states=3)
#
# # Training (multiple audio samples per word)
# asr.train_word_model('hello', ['hello1.wav', 'hello2.wav', 'hello3.wav'])
# asr.train_word_model('world', ['world1.wav', 'world2.wav', 'world3.wav'])
#
# # Recognition
# word, score = asr.recognize('test.wav')
# print(f"Recognition result: {word} (Score: {score:.2f})")

Integration with Language Model

In actual ASR, the acoustic model and language model are integrated:

$$ \hat{W} = \arg\max_W \left[\log P(X|W) + \lambda \log P(W)\right] $$

$\lambda$: Language model weight

class ASRWithLanguageModel:
    """
    ASR with integrated language model
    """
    def __init__(self, acoustic_model, language_model, lm_weight: float = 1.0):
        self.acoustic_model = acoustic_model
        self.language_model = language_model
        self.lm_weight = lm_weight

    def recognize_with_lm(self, audio_features: np.ndarray,
                          previous_words: List[str] = None) -> str:
        """
        Recognition using language model
        """
        # Acoustic score (for each word candidate)
        acoustic_scores = {}
        for word in self.acoustic_model.models.keys():
            acoustic_scores[word] = self.acoustic_model.models[word].score(audio_features)

        # Language model score
        if previous_words:
            lm_scores = {}
            for word in acoustic_scores.keys():
                # Bigram probability
                prev_word = previous_words[-1] if previous_words else ''
                lm_scores[word] = np.log(self.language_model.probability(prev_word, word))
        else:
            lm_scores = {word: 0 for word in acoustic_scores.keys()}

        # Combined score
        total_scores = {
            word: acoustic_scores[word] + self.lm_weight * lm_scores[word]
            for word in acoustic_scores.keys()
        }

        # Select best word
        best_word = max(total_scores, key=total_scores.get)

        return best_word

print("\n=== Language Model Integration ===")
print("By combining acoustic and language scores,")
print("recognition accuracy can be improved with context consideration")

2.6 Chapter Summary

What We Learned

Fundamentals of Speech Recognition
- ASR is a combination of acoustic and language models
- Evaluation using WER (Word Error Rate)
- Error calculation using Levenshtein distance
Hidden Markov Models
- Modeling state transitions and output probabilities
- Forward-Backward algorithm (Evaluation)
- Viterbi algorithm (Decoding)
- Baum-Welch algorithm (Learning)
Gaussian Mixture Models
- Density estimation with multiple Gaussian distributions
- Parameter estimation using EM algorithm
- Acoustic modeling with GMM-HMM
Language Models
- Probability modeling of word sequences with N-grams
- Smoothing techniques (handling unseen events)
- Evaluation using perplexity
ASR Pipeline
- Feature extraction (MFCC + delta)
- Acoustic model (GMM-HMM)
- Decoding (Viterbi)
- Language model integration

Advantages and Disadvantages of Traditional ASR

Advantages	Disadvantages
Theoretically clear	Requires large amounts of labeled data
Independent components	Difficult to optimize entire pipeline
Interpretability at phoneme level	Weak modeling of long-term dependencies
Works with limited data	Dependent on feature engineering

Next Chapter

In Chapter 3, we will learn about modern End-to-End speech recognition:

Deep Speech (CTC)
Listen, Attend and Spell
Transformer-based ASR
Wav2Vec 2.0
Whisper

Exercises

Problem 1 (Difficulty: easy)

Calculate the WER by hand for the following reference and hypothesis sentences.

Reference: "the cat sat on the mat"
Hypothesis: "the cat sit on mat"

Solution

Answer:

Align reference and hypothesis:

Reference: the cat sat on the mat
Hypothesis: the cat sit on --- mat

Count errors:

Substitutions (S): sat → sit (1)
Deletions (D): the (1)
Insertions (I): 0

Calculate WER:

$$ \text{WER} = \frac{S + D + I}{N} = \frac{1 + 1 + 0}{6} = \frac{2}{6} = 0.333 = 33.3\% $$

Answer: 33.3%

Problem 2 (Difficulty: medium)

For a 3-state HMM with the following parameters, calculate the probability of the observation sequence [0, 1, 0] using the Forward algorithm.

# Initial probabilities
pi = [0.6, 0.3, 0.1]

# Transition probabilities
A = [[0.7, 0.2, 0.1],
     [0.3, 0.5, 0.2],
     [0.2, 0.3, 0.5]]

# Emission probabilities (observations 0 and 1)
B = [[0.8, 0.2],
     [0.4, 0.6],
     [0.3, 0.7]]

Solution

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: For a 3-state HMM with the following parameters, calculate t

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: 5-10 seconds
Dependencies: None
"""

import numpy as np

# Parameters
pi = np.array([0.6, 0.3, 0.1])
A = np.array([[0.7, 0.2, 0.1],
              [0.3, 0.5, 0.2],
              [0.2, 0.3, 0.5]])
B = np.array([[0.8, 0.2],
              [0.4, 0.6],
              [0.3, 0.7]])

observations = [0, 1, 0]
T = len(observations)
N = len(pi)

# Forward variables
alpha = np.zeros((T, N))

# Initialization (t=0)
alpha[0] = pi * B[:, observations[0]]
print(f"t=0: α = {alpha[0]}")

# Recursion (t=1, 2, ...)
for t in range(1, T):
    for j in range(N):
        alpha[t, j] = np.sum(alpha[t-1] * A[:, j]) * B[j, observations[t]]
    print(f"t={t}: α = {alpha[t]}")

# Probability of observation sequence
prob = np.sum(alpha[T-1])

print(f"\nProbability of observation sequence {observations}:")
print(f"P(O|λ) = {prob:.6f}")

Output:

t=0: α = [0.48 0.12 0.03]
t=1: α = [0.0672 0.1092 0.0588]
t=2: α = [0.06048 0.02688 0.01512]

Probability of observation sequence [0, 1, 0]:
P(O|λ) = 0.102480

Problem 3 (Difficulty: medium)

In a bigram language model, estimate P("learning"|"machine") using maximum likelihood estimation from the following corpus.

I love machine learning
machine learning is fun
I study machine learning
deep learning is great

Solution

Answer:

Counts:

C(machine, learning) = 3
C(machine) = 3

Maximum likelihood estimation:

$$ P(\text{learning} | \text{machine}) = \frac{C(\text{machine}, \text{learning})}{C(\text{machine})} = \frac{3}{3} = 1.0 $$

Answer: 1.0 (100%)

In this corpus, "machine" is always followed by "learning".

Problem 4 (Difficulty: hard)

Using a GMM with 2 components (K=2), cluster the following 1D data and find the parameters (mean, variance, weight) of each component.

data = np.array([1.2, 1.5, 1.8, 2.0, 2.1, 8.5, 9.0, 9.2, 9.5, 10.0])

Solution

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Using a GMM with 2 components (K=2), cluster the following 1

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt

data = np.array([1.2, 1.5, 1.8, 2.0, 2.1, 8.5, 9.0, 9.2, 9.5, 10.0])
X = data.reshape(-1, 1)

# GMM clustering
gmm = GaussianMixture(n_components=2, random_state=42)
gmm.fit(X)

# Parameters
print("=== GMM Parameters ===")
print(f"\nComponent 1:")
print(f"  Mean: {gmm.means_[0][0]:.3f}")
print(f"  Variance: {gmm.covariances_[0][0][0]:.3f}")
print(f"  Weight: {gmm.weights_[0]:.3f}")

print(f"\nComponent 2:")
print(f"  Mean: {gmm.means_[1][0]:.3f}")
print(f"  Variance: {gmm.covariances_[1][0][0]:.3f}")
print(f"  Weight: {gmm.weights_[1]:.3f}")

# Cluster labels
labels = gmm.predict(X)
print(f"\nCluster labels:")
for i, (val, label) in enumerate(zip(data, labels)):
    print(f"  Data{i+1}: {val:.1f} → Cluster{label}")

# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(data, np.zeros_like(data), c=labels, cmap='viridis',
            s=100, alpha=0.6, edgecolors='black')
plt.scatter(gmm.means_, [0, 0], c='red', s=200, marker='X',
            edgecolors='black', linewidths=2, label='Centers')
plt.xlabel('Value')
plt.title('1D Data Clustering with GMM', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Example output:

=== GMM Parameters ===

Component 1:
  Mean: 1.720
  Variance: 0.124
  Weight: 0.500

Component 2:
  Mean: 9.240
  Variance: 0.294
  Weight: 0.500

Cluster labels:
  Data1: 1.2 → Cluster0
  Data2: 1.5 → Cluster0
  Data3: 1.8 → Cluster0
  Data4: 2.0 → Cluster0
  Data5: 2.1 → Cluster0
  Data6: 8.5 → Cluster1
  Data7: 9.0 → Cluster1
  Data8: 9.2 → Cluster1
  Data9: 9.5 → Cluster1
  Data10: 10.0 → Cluster1

Problem 5 (Difficulty: hard)

In an ASR system that integrates acoustic and language models, explain the impact of varying the language model weight (LM weight) on recognition results. Also, describe how the optimal weight should be determined.

Solution

Answer:

Impact of LM weight:

Combined score:

$$ \text{Score}(W) = \log P(X|W) + \lambda \log P(W) $$

LM weight $\lambda$	Impact	Recognition tendency
Small (close to 0)	Acoustic model priority	Selects acoustically similar words, grammatically unnatural
Appropriate	Balanced	Considers both acoustic and grammar, best recognition accuracy
Large	Language model priority	Grammatically correct but acoustically incorrect, biased toward frequent words

Methods for determining optimal weight:

Grid search on development set

lambda_values = [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
best_lambda = None
best_wer = float('inf')

for lam in lambda_values:
    wer = evaluate_asr(dev_set, lm_weight=lam)
    if wer < best_wer:
        best_wer = wer
        best_lambda = lam

print(f"Optimal LM weight: {best_lambda}")

Considering domain dependency
- Read speech: High acoustic quality → smaller $\lambda$
- Spontaneous speech: Noisy → larger $\lambda$
- Many technical terms: Low language model reliability → smaller $\lambda$
Dynamic adjustment
- Dynamically adjust based on confidence scores
- Adjust according to acoustic quality (SNR)

Example:

Speech: "I scream" (ice cream)

λ = 0.1 (acoustic priority):
  → "I scream" (acoustically accurate)

λ = 5.0 (language priority):
  → "ice cream" (grammatically natural, high frequency in language model)

λ = 1.0 (balanced):
  → Appropriate selection depending on context

Conclusion: The optimal LM weight depends on domain, acoustic conditions, and language model quality, and requires experimental tuning on a development set.

References

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed.). Draft.
Gales, M., & Young, S. (2008). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195-304.
Heafield, K. (2011). KenLM: Faster and smaller language model queries. Proceedings of the Sixth Workshop on Statistical Machine Translation, 187-197.

Learning Objectives

2.1 Fundamentals of Speech Recognition

Definition of ASR Task

Components of ASR Systems

Evaluation Metrics

Word Error Rate (WER)

Character Error Rate (CER)

Implementation: WER Calculation

2.2 Hidden Markov Models (HMM)

HMM Fundamentals

Components of HMM

Three Fundamental Problems of HMM

Forward-Backward Algorithm

Viterbi Algorithm

Implementation: Basic HMM Operations

Phoneme Modeling with HMM

2.3 Gaussian Mixture Models (GMM)

GMM Fundamentals

Learning with EM Algorithm

E-step: Calculate Responsibilities

M-step: Update Parameters

Implementation: Clustering with GMM

GMM-HMM System

2.4 Language Models

N-gram Models

Main N-gram Models

Maximum Likelihood Estimation

Perplexity

Smoothing Techniques

1. Add-k Smoothing (Laplace)

2. Kneser-Ney Smoothing

Implementation: N-gram Language Model

KenLM Library

2.5 Traditional ASR Pipeline

Complete Pipeline Architecture

Implementation: Simple ASR System

Integration with Language Model

2.6 Chapter Summary

What We Learned

Advantages and Disadvantages of Traditional ASR

Next Chapter

Exercises

Problem 1 (Difficulty: easy)

Problem 2 (Difficulty: medium)

Problem 3 (Difficulty: medium)

Problem 4 (Difficulty: hard)

Problem 5 (Difficulty: hard)

References

Disclaimer