🌐 EN | 🇯🇵 JP | Last sync: 2025-11-16

Chapter 2: Traditional Speech Recognition

HMM-GMM Era Speech Recognition Technology - Statistical Model-Based Approaches

📖 Reading Time: 35-40 minutes 📊 Difficulty: Intermediate 💻 Code Examples: 8 📝 Exercises: 5

This chapter covers Traditional Speech Recognition. You will learn Hidden Markov Model (HMM) principles and Construct a complete HMM-GMM based ASR pipeline.

Learning Objectives

By reading this chapter, you will be able to:


2.1 Fundamentals of Speech Recognition

Definition of ASR Task

Automatic Speech Recognition (ASR) is a task that converts speech signals into text.

Goal of speech recognition: Given an observed acoustic signal $X$, find the most likely word sequence $W$

This is formulated using Bayes' theorem as follows:

$$ \hat{W} = \arg\max_{W} P(W|X) = \arg\max_{W} \frac{P(X|W) P(W)}{P(X)} = \arg\max_{W} P(X|W) P(W) $$

Components of ASR Systems

graph LR A[Speech Signal] --> B[Feature Extraction
MFCC] B --> C[Acoustic Model
HMM-GMM] C --> D[Decoding
Viterbi] D --> E[Language Model
N-gram] E --> F[Recognition Result
Text] style A fill:#ffebee style B fill:#fff3e0 style C fill:#e3f2fd style D fill:#f3e5f5 style E fill:#e8f5e9 style F fill:#c8e6c9

Evaluation Metrics

Word Error Rate (WER)

WER is the standard evaluation metric for speech recognition:

$$ \text{WER} = \frac{S + D + I}{N} \times 100\% $$

Character Error Rate (CER)

For languages like Japanese and Chinese, character-level CER is also used:

$$ \text{CER} = \frac{S_c + D_c + I_c}{N_c} \times 100\% $$

Implementation: WER Calculation

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np
from typing import List, Tuple

def levenshtein_distance(ref: List[str], hyp: List[str]) -> Tuple[int, int, int, int]:
    """
    Calculate Levenshtein distance and error statistics

    Returns:
        (distance, substitutions, deletions, insertions)
    """
    m, n = len(ref), len(hyp)

    # DP table
    dp = np.zeros((m + 1, n + 1), dtype=int)

    # Initialization
    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    # DP
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if ref[i-1] == hyp[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = min(
                    dp[i-1][j-1] + 1,  # substitution
                    dp[i-1][j] + 1,    # deletion
                    dp[i][j-1] + 1     # insertion
                )

    # Backtrack: Calculate error statistics
    i, j = m, n
    subs, dels, ins = 0, 0, 0

    while i > 0 or j > 0:
        if i > 0 and j > 0 and ref[i-1] == hyp[j-1]:
            i -= 1
            j -= 1
        elif i > 0 and j > 0 and dp[i][j] == dp[i-1][j-1] + 1:
            subs += 1
            i -= 1
            j -= 1
        elif i > 0 and dp[i][j] == dp[i-1][j] + 1:
            dels += 1
            i -= 1
        else:
            ins += 1
            j -= 1

    return dp[m][n], subs, dels, ins


def calculate_wer(reference: str, hypothesis: str) -> dict:
    """
    Calculate WER (Word Error Rate)

    Args:
        reference: Reference text
        hypothesis: Recognition result

    Returns:
        Dictionary containing error statistics
    """
    ref_words = reference.split()
    hyp_words = hypothesis.split()

    dist, subs, dels, ins = levenshtein_distance(ref_words, hyp_words)

    n_words = len(ref_words)
    wer = (dist / n_words * 100) if n_words > 0 else 0

    return {
        'WER': wer,
        'substitutions': subs,
        'deletions': dels,
        'insertions': ins,
        'total_errors': dist,
        'total_words': n_words
    }


# Test
reference = "the quick brown fox jumps over the lazy dog"
hypothesis = "the quick brown fox jumped over a lazy dog"

result = calculate_wer(reference, hypothesis)

print("=== WER Calculation Example ===")
print(f"Reference: {reference}")
print(f"Hypothesis: {hypothesis}")
print(f"\nWER: {result['WER']:.2f}%")
print(f"Substitutions: {result['substitutions']}")
print(f"Deletions: {result['deletions']}")
print(f"Insertions: {result['insertions']}")
print(f"Total Errors: {result['total_errors']}")
print(f"Total Words: {result['total_words']}")

Output:

=== WER Calculation Example ===
Reference: the quick brown fox jumps over the lazy dog
Hypothesis: the quick brown fox jumped over a lazy dog

WER: 22.22%
Substitutions: 2
Deletions: 0
Insertions: 0
Total Errors: 2
Total Words: 9

Important: WER can exceed 100% (when there are many insertion errors).


2.2 Hidden Markov Models (HMM)

HMM Fundamentals

Hidden Markov Models (HMM) are probabilistic models consisting of unobservable hidden states and observable outputs.

Components of HMM

Three Fundamental Problems of HMM

Problem Description Algorithm
Evaluation Calculate probability of observation sequence Forward-Backward
Decoding Estimate most likely state sequence Viterbi
Learning Estimate parameters Baum-Welch (EM)

Forward-Backward Algorithm

Forward Algorithm calculates the probability of the observation sequence $P(O|\lambda)$:

$$ \alpha_t(i) = P(o_1, o_2, \ldots, o_t, q_t=i | \lambda) $$

Recursion formula:

$$ \alpha_t(j) = \left[\sum_{i=1}^N \alpha_{t-1}(i) a_{ij}\right] b_j(o_t) $$

Viterbi Algorithm

Viterbi Algorithm finds the most likely state sequence:

$$ \delta_t(i) = \max_{q_1, \ldots, q_{t-1}} P(q_1, \ldots, q_{t-1}, q_t=i, o_1, \ldots, o_t | \lambda) $$

Recursion formula:

$$ \delta_t(j) = \left[\max_i \delta_{t-1}(i) a_{ij}\right] b_j(o_t) $$

Implementation: Basic HMM Operations

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Implementation: Basic HMM Operations

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
from hmmlearn import hmm
import matplotlib.pyplot as plt

# Example: Weather model
# States: 0=Sunny, 1=Rainy
# Observations: 0=Walk, 1=Shopping, 2=Cleaning

# HMM parameter definition
n_states = 2
n_observations = 3

# Build the model
model = hmm.MultinomialHMM(n_components=n_states, random_state=42)

# State transition probabilities
model.startprob_ = np.array([0.6, 0.4])  # Initial probabilities
model.transmat_ = np.array([
    [0.7, 0.3],  # Sunny → [Sunny, Rainy]
    [0.4, 0.6]   # Rainy → [Sunny, Rainy]
])

# Emission probabilities
model.emissionprob_ = np.array([
    [0.6, 0.3, 0.1],  # Sunny day: [Walk, Shopping, Cleaning]
    [0.1, 0.4, 0.5]   # Rainy day: [Walk, Shopping, Cleaning]
])

print("=== HMM Parameters ===")
print("\nInitial state probabilities:")
print(f"  Sunny: {model.startprob_[0]:.2f}")
print(f"  Rainy: {model.startprob_[1]:.2f}")

print("\nState transition probabilities:")
print(model.transmat_)

print("\nEmission probabilities:")
print("       Walk   Shopping  Cleaning")
print(f"Sunny: {model.emissionprob_[0]}")
print(f"Rainy: {model.emissionprob_[1]}")

# Observation sequence
observations = np.array([[0], [1], [2], [1], [0]])  # Walk, Shopping, Cleaning, Shopping, Walk

# Forward algorithm: Probability of observation sequence
log_prob = model.score(observations)
print(f"\nLog-likelihood of observation sequence: {log_prob:.4f}")
print(f"Probability of observation sequence: {np.exp(log_prob):.6f}")

# Viterbi algorithm: Most likely state sequence
log_prob, states = model.decode(observations)
state_names = ['Sunny', 'Rainy']
print(f"\nMost likely state sequence:")
for i, (obs, state) in enumerate(zip(observations.flatten(), states)):
    obs_names = ['Walk', 'Shopping', 'Cleaning']
    print(f"  Day{i+1}: Observation={obs_names[obs]}, State={state_names[state]}")

Output:

=== HMM Parameters ===

Initial state probabilities:
  Sunny: 0.60
  Rainy: 0.40

State transition probabilities:
[[0.7 0.3]
 [0.4 0.6]]

Emission probabilities:
       Walk   Shopping  Cleaning
Sunny: [0.6 0.3 0.1]
Rainy: [0.1 0.4 0.5]

Log-likelihood of observation sequence: -6.3218
Probability of observation sequence: 0.001802

Most likely state sequence:
  Day1: Observation=Walk, State=Sunny
  Day2: Observation=Shopping, State=Sunny
  Day3: Observation=Cleaning, State=Rainy
  Day4: Observation=Shopping, State=Rainy
  Day5: Observation=Walk, State=Rainy

Phoneme Modeling with HMM

In speech recognition, each phoneme is modeled with a 3-state Left-to-Right HMM:

graph LR Start((Start)) --> S1[State 1
Phoneme Begin] S1 --> S1 S1 --> S2[State 2
Phoneme Middle] S2 --> S2 S2 --> S3[State 3
Phoneme End] S3 --> S3 S3 --> End((End)) style Start fill:#c8e6c9 style S1 fill:#e3f2fd style S2 fill:#fff3e0 style S3 fill:#f3e5f5 style End fill:#ffcdd2
# Left-to-Right HMM (Phoneme model)
n_states = 3

# Left-to-Right structure transition matrix
# States can only advance or self-loop
lr_model = hmm.GaussianHMM(n_components=n_states, covariance_type="diag", random_state=42)

# Transition probabilities (left-to-right only)
lr_model.transmat_ = np.array([
    [0.5, 0.5, 0.0],  # State 1: self-loop or to State 2
    [0.0, 0.5, 0.5],  # State 2: self-loop or to State 3
    [0.0, 0.0, 1.0]   # State 3: self-loop only
])

lr_model.startprob_ = np.array([1.0, 0.0, 0.0])  # Always start from State 1

print("=== Left-to-Right HMM ===")
print("Transition probability matrix:")
print(lr_model.transmat_)
print("\nStarts from State 1 and can only move left-to-right")

2.3 Gaussian Mixture Models (GMM)

GMM Fundamentals

Gaussian Mixture Models (GMM) represent a probability distribution as a linear combination of multiple Gaussian distributions:

$$ p(x) = \sum_{k=1}^K w_k \mathcal{N}(x | \mu_k, \Sigma_k) $$

Learning with EM Algorithm

GMM parameters are estimated using the EM (Expectation-Maximization) algorithm:

E-step: Calculate Responsibilities

$$ \gamma_{nk} = \frac{w_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_{j=1}^K w_j \mathcal{N}(x_n | \mu_j, \Sigma_j)} $$

M-step: Update Parameters

$$ \mu_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk} x_n $$

$$ \Sigma_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk} (x_n - \mu_k^{\text{new}})(x_n - \mu_k^{\text{new}})^T $$

$$ w_k^{\text{new}} = \frac{N_k}{N}, \quad N_k = \sum_{n=1}^N \gamma_{nk} $$

Implementation: Clustering with GMM

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Implementation: Clustering with GMM

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt

# Data generated from 3 Gaussian distributions
np.random.seed(42)

# Data generation
n_samples = 300
X1 = np.random.randn(n_samples // 3, 2) * 0.5 + np.array([0, 0])
X2 = np.random.randn(n_samples // 3, 2) * 0.7 + np.array([3, 3])
X3 = np.random.randn(n_samples // 3, 2) * 0.6 + np.array([0, 3])

X = np.vstack([X1, X2, X3])

# GMM clustering
n_components = 3
gmm = GaussianMixture(n_components=n_components, covariance_type='full', random_state=42)
gmm.fit(X)

# Prediction
labels = gmm.predict(X)
proba = gmm.predict_proba(X)

print("=== GMM Parameters ===")
print(f"Number of components: {n_components}")
print(f"Number of iterations to converge: {gmm.n_iter_}")
print(f"Log-likelihood: {gmm.score(X) * len(X):.2f}")

print("\nMixture weights:")
for i, weight in enumerate(gmm.weights_):
    print(f"  Component{i+1}: {weight:.3f}")

print("\nMean vectors:")
for i, mean in enumerate(gmm.means_):
    print(f"  Component{i+1}: [{mean[0]:.2f}, {mean[1]:.2f}]")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Clustering results
axes[0].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6, edgecolors='black')
axes[0].scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='red', s=200, marker='X',
                edgecolors='black', linewidths=2, label='Centers')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].set_title('GMM Clustering', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Probability density visualization
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))
Z = -gmm.score_samples(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

axes[1].contourf(xx, yy, Z, levels=20, cmap='viridis', alpha=0.6)
axes[1].scatter(X[:, 0], X[:, 1], c='white', s=10, alpha=0.5, edgecolors='black')
axes[1].scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='red', s=200, marker='X',
                edgecolors='black', linewidths=2)
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_title('GMM Probability Density', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

GMM-HMM System

In traditional ASR, the output probabilities of each HMM state are modeled with GMMs:

$$ b_j(o_t) = \sum_{m=1}^M c_{jm} \mathcal{N}(o_t | \mu_{jm}, \Sigma_{jm}) $$

from hmmlearn import hmm

# HMM with GMM as emission distribution
n_states = 3
n_mix = 4  # Number of GMM components per state

# GaussianHMM: Each state has a Gaussian distribution
# n_mix > 1 makes each state a GMM
gmm_hmm = hmm.GMMHMM(n_components=n_states, n_mix=n_mix,
                      covariance_type='diag', n_iter=100, random_state=42)

# Training data generation (2D features)
np.random.seed(42)
n_samples = 200
train_data = np.random.randn(n_samples, 2) * 0.5

# Model training
gmm_hmm.fit(train_data)

print("\n=== GMM-HMM System ===")
print(f"Number of HMM states: {n_states}")
print(f"Number of GMM components per state: {n_mix}")
print(f"Number of iterations to converge: {gmm_hmm.monitor_.iter}")
print(f"Log-likelihood: {gmm_hmm.score(train_data) * len(train_data):.2f}")

# Decoding
test_data = np.random.randn(10, 2) * 0.5
log_prob, states = gmm_hmm.decode(test_data)

print(f"\nState sequence for test data:")
print(f"  {states}")
print(f"  Log probability: {log_prob:.4f}")

2.4 Language Models

N-gram Models

N-gram models predict the probability of a word sequence based on the previous $n-1$ words:

$$ P(w_1, w_2, \ldots, w_n) \approx \prod_{i=1}^n P(w_i | w_{i-n+1}, \ldots, w_{i-1}) $$

Main N-gram Models

Model Definition Example
Unigram $P(w_i)$ Independent word probability
Bigram $P(w_i | w_{i-1})$ Depends on previous word
Trigram $P(w_i | w_{i-2}, w_{i-1})$ Depends on previous 2 words

Maximum Likelihood Estimation

N-gram probabilities are estimated from counts in the training corpus:

$$ P(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})} $$

Perplexity

Perplexity is an evaluation metric for language models:

$$ \text{PPL} = P(w_1, \ldots, w_N)^{-1/N} = \sqrt[N]{\prod_{i=1}^N \frac{1}{P(w_i | w_1, \ldots, w_{i-1})}} $$

Lower is better.

Smoothing Techniques

Techniques for assigning probabilities to unobserved N-grams:

1. Add-k Smoothing (Laplace)

$$ P(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i) + k}{C(w_{i-1}) + k|V|} $$

2. Kneser-Ney Smoothing

Considers context diversity:

$$ P_{\text{KN}}(w_i | w_{i-1}) = \frac{\max(C(w_{i-1}, w_i) - \delta, 0)}{C(w_{i-1})} + \lambda(w_{i-1}) P_{\text{continuation}}(w_i) $$

Implementation: N-gram Language Model

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np
from collections import defaultdict, Counter
from typing import List, Tuple

class BigramLanguageModel:
    """
    Bigram language model (with Add-k smoothing)
    """
    def __init__(self, k: float = 1.0):
        self.k = k
        self.unigram_counts = Counter()
        self.bigram_counts = defaultdict(Counter)
        self.vocab = set()

    def train(self, corpus: List[List[str]]):
        """
        Train the model from corpus

        Args:
            corpus: List of sentences (each sentence is a list of words)
        """
        for sentence in corpus:
            # Add start/end tags
            words = ['<s>'] + sentence + ['</s>']

            for word in words:
                self.vocab.add(word)
                self.unigram_counts[word] += 1

            for w1, w2 in zip(words[:-1], words[1:]):
                self.bigram_counts[w1][w2] += 1

        print(f"Vocabulary size: {len(self.vocab)}")
        print(f"Total words: {sum(self.unigram_counts.values())}")
        print(f"Unique bigrams: {sum(len(counts) for counts in self.bigram_counts.values())}")

    def probability(self, w1: str, w2: str) -> float:
        """
        Calculate bigram probability P(w2|w1) (with Add-k smoothing)
        """
        numerator = self.bigram_counts[w1][w2] + self.k
        denominator = self.unigram_counts[w1] + self.k * len(self.vocab)
        return numerator / denominator

    def sentence_probability(self, sentence: List[str]) -> float:
        """
        Calculate sentence probability
        """
        words = ['<s>'] + sentence + ['</s>']
        prob = 1.0

        for w1, w2 in zip(words[:-1], words[1:]):
            prob *= self.probability(w1, w2)

        return prob

    def perplexity(self, test_corpus: List[List[str]]) -> float:
        """
        Calculate perplexity of test corpus
        """
        log_prob = 0
        n_words = 0

        for sentence in test_corpus:
            words = ['<s>'] + sentence + ['</s>']
            n_words += len(words) - 1

            for w1, w2 in zip(words[:-1], words[1:]):
                prob = self.probability(w1, w2)
                log_prob += np.log2(prob)

        return 2 ** (-log_prob / n_words)


# Sample corpus
train_corpus = [
    ['I', 'love', 'machine', 'learning'],
    ['machine', 'learning', 'is', 'fun'],
    ['I', 'love', 'deep', 'learning'],
    ['deep', 'learning', 'is', 'powerful'],
    ['I', 'study', 'machine', 'learning'],
]

test_corpus = [
    ['I', 'love', 'learning'],
    ['machine', 'learning', 'is', 'interesting']
]

# Model training
print("=== Bigram Language Model ===\n")
lm = BigramLanguageModel(k=0.1)
lm.train(train_corpus)

# Probability calculation
print("\nBigram probability examples:")
bigrams = [('I', 'love'), ('love', 'learning'), ('machine', 'learning'), ('learning', 'is')]
for w1, w2 in bigrams:
    prob = lm.probability(w1, w2)
    print(f"  P({w2}|{w1}) = {prob:.4f}")

# Sentence probability
print("\nSentence probabilities:")
for sentence in test_corpus:
    prob = lm.sentence_probability(sentence)
    print(f"  '{' '.join(sentence)}': {prob:.6e}")

# Perplexity
ppl = lm.perplexity(test_corpus)
print(f"\nTest corpus perplexity: {ppl:.2f}")

Output:

=== Bigram Language Model ===

Vocabulary size: 12
Total words: 35
Unique bigrams: 25

Bigram probability examples:
  P(love|I) = 0.4255
  P(learning|love) = 0.3571
  P(learning|machine) = 0.6667
  P(is|learning) = 0.5000

Sentence probabilities:
  'I love learning': 2.547618e-03
  'machine learning is interesting': 1.984127e-04

Test corpus perplexity: 8.91

KenLM Library

For practical N-gram language models, use KenLM:

# Advanced language model using KenLM
# Note: Installation required: pip install https://github.com/kpu/kenlm/archive/master.zip

import kenlm

# Load from ARPA format language model file
# model = kenlm.Model('path/to/model.arpa')

# Calculate sentence score
# score = model.score('this is a test sentence', bos=True, eos=True)
# perplexity = model.perplexity('this is a test sentence')

print("KenLM is an efficient N-gram language model implementation")
print("Optimized for training and querying on large corpora")

2.5 Traditional ASR Pipeline

Complete Pipeline Architecture

graph TD A[Speech Signal
Waveform] --> B[Pre-processing
Pre-emphasis] B --> C[Framing
Framing] C --> D[Windowing
Windowing] D --> E[MFCC Extraction
Feature Extraction] E --> F[Delta Features
Delta/Delta-Delta] F --> G[Acoustic Model
GMM-HMM] G --> H[Viterbi Decoding
+ Language Model] H --> I[Recognition Result
Text] style A fill:#ffebee style E fill:#e3f2fd style G fill:#fff3e0 style H fill:#f3e5f5 style I fill:#c8e6c9

Implementation: Simple ASR System

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np
import librosa
from hmmlearn import hmm
from sklearn.mixture import GaussianMixture
from typing import List, Tuple

class SimpleASR:
    """
    Simple speech recognition system (for demonstration)
    """
    def __init__(self, n_mfcc: int = 13, n_states: int = 3):
        self.n_mfcc = n_mfcc
        self.n_states = n_states
        self.models = {}  # HMM models for each word

    def extract_features(self, audio_path: str, sr: int = 16000) -> np.ndarray:
        """
        Extract MFCC features from audio file
        """
        # Load audio
        y, sr = librosa.load(audio_path, sr=sr)

        # MFCC
        mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=self.n_mfcc)

        # Delta features
        delta = librosa.feature.delta(mfcc)
        delta2 = librosa.feature.delta(mfcc, order=2)

        # Concatenate
        features = np.vstack([mfcc, delta, delta2])

        return features.T  # (time, features)

    def train_word_model(self, word: str, audio_files: List[str]):
        """
        Train HMM model for a specific word
        """
        # Extract features from all training data
        all_features = []
        lengths = []

        for audio_file in audio_files:
            features = self.extract_features(audio_file)
            all_features.append(features)
            lengths.append(len(features))

        # Concatenate
        X = np.vstack(all_features)

        # Left-to-Right HMM
        model = hmm.GaussianHMM(
            n_components=self.n_states,
            covariance_type='diag',
            n_iter=100,
            random_state=42
        )

        # Constrain transition probabilities (Left-to-Right)
        model.transmat_ = np.zeros((self.n_states, self.n_states))
        for i in range(self.n_states):
            if i < self.n_states - 1:
                model.transmat_[i, i] = 0.5
                model.transmat_[i, i+1] = 0.5
            else:
                model.transmat_[i, i] = 1.0

        model.startprob_ = np.zeros(self.n_states)
        model.startprob_[0] = 1.0

        # Training
        model.fit(X, lengths)

        self.models[word] = model

        print(f"Trained model for word '{word}'")
        print(f"  Number of training samples: {len(audio_files)}")
        print(f"  Total frames: {len(X)}")

    def recognize(self, audio_path: str) -> Tuple[str, float]:
        """
        Recognize audio file

        Returns:
            (recognized word, score)
        """
        # Feature extraction
        features = self.extract_features(audio_path)

        # Calculate score for each word model
        scores = {}
        for word, model in self.models.items():
            try:
                score = model.score(features)
                scores[word] = score
            except:
                scores[word] = -np.inf

        # Select word with highest score
        best_word = max(scores, key=scores.get)
        best_score = scores[best_word]

        return best_word, best_score


# Demo usage example (requires actual audio files)
print("=== Simple ASR System ===\n")
print("This system operates as follows:")
print("1. Extract MFCC features (+ delta) from speech")
print("2. Model each word with Left-to-Right HMM")
print("3. Select most likely word using Viterbi algorithm")
print("\nActual usage requires audio files")

# asr = SimpleASR(n_mfcc=13, n_states=3)
#
# # Training (multiple audio samples per word)
# asr.train_word_model('hello', ['hello1.wav', 'hello2.wav', 'hello3.wav'])
# asr.train_word_model('world', ['world1.wav', 'world2.wav', 'world3.wav'])
#
# # Recognition
# word, score = asr.recognize('test.wav')
# print(f"Recognition result: {word} (Score: {score:.2f})")

Integration with Language Model

In actual ASR, the acoustic model and language model are integrated:

$$ \hat{W} = \arg\max_W \left[\log P(X|W) + \lambda \log P(W)\right] $$

class ASRWithLanguageModel:
    """
    ASR with integrated language model
    """
    def __init__(self, acoustic_model, language_model, lm_weight: float = 1.0):
        self.acoustic_model = acoustic_model
        self.language_model = language_model
        self.lm_weight = lm_weight

    def recognize_with_lm(self, audio_features: np.ndarray,
                          previous_words: List[str] = None) -> str:
        """
        Recognition using language model
        """
        # Acoustic score (for each word candidate)
        acoustic_scores = {}
        for word in self.acoustic_model.models.keys():
            acoustic_scores[word] = self.acoustic_model.models[word].score(audio_features)

        # Language model score
        if previous_words:
            lm_scores = {}
            for word in acoustic_scores.keys():
                # Bigram probability
                prev_word = previous_words[-1] if previous_words else ''
                lm_scores[word] = np.log(self.language_model.probability(prev_word, word))
        else:
            lm_scores = {word: 0 for word in acoustic_scores.keys()}

        # Combined score
        total_scores = {
            word: acoustic_scores[word] + self.lm_weight * lm_scores[word]
            for word in acoustic_scores.keys()
        }

        # Select best word
        best_word = max(total_scores, key=total_scores.get)

        return best_word

print("\n=== Language Model Integration ===")
print("By combining acoustic and language scores,")
print("recognition accuracy can be improved with context consideration")

2.6 Chapter Summary

What We Learned

  1. Fundamentals of Speech Recognition

    • ASR is a combination of acoustic and language models
    • Evaluation using WER (Word Error Rate)
    • Error calculation using Levenshtein distance
  2. Hidden Markov Models

    • Modeling state transitions and output probabilities
    • Forward-Backward algorithm (Evaluation)
    • Viterbi algorithm (Decoding)
    • Baum-Welch algorithm (Learning)
  3. Gaussian Mixture Models

    • Density estimation with multiple Gaussian distributions
    • Parameter estimation using EM algorithm
    • Acoustic modeling with GMM-HMM
  4. Language Models

    • Probability modeling of word sequences with N-grams
    • Smoothing techniques (handling unseen events)
    • Evaluation using perplexity
  5. ASR Pipeline

    • Feature extraction (MFCC + delta)
    • Acoustic model (GMM-HMM)
    • Decoding (Viterbi)
    • Language model integration

Advantages and Disadvantages of Traditional ASR

Advantages Disadvantages
Theoretically clear Requires large amounts of labeled data
Independent components Difficult to optimize entire pipeline
Interpretability at phoneme level Weak modeling of long-term dependencies
Works with limited data Dependent on feature engineering

Next Chapter

In Chapter 3, we will learn about modern End-to-End speech recognition:


Exercises

Problem 1 (Difficulty: easy)

Calculate the WER by hand for the following reference and hypothesis sentences.

Solution

Answer:

Align reference and hypothesis:

Reference: the cat sat on the mat
Hypothesis: the cat sit on --- mat

Count errors:

Calculate WER:

$$ \text{WER} = \frac{S + D + I}{N} = \frac{1 + 1 + 0}{6} = \frac{2}{6} = 0.333 = 33.3\% $$

Answer: 33.3%

Problem 2 (Difficulty: medium)

For a 3-state HMM with the following parameters, calculate the probability of the observation sequence [0, 1, 0] using the Forward algorithm.

# Initial probabilities
pi = [0.6, 0.3, 0.1]

# Transition probabilities
A = [[0.7, 0.2, 0.1],
     [0.3, 0.5, 0.2],
     [0.2, 0.3, 0.5]]

# Emission probabilities (observations 0 and 1)
B = [[0.8, 0.2],
     [0.4, 0.6],
     [0.3, 0.7]]
Solution
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: For a 3-state HMM with the following parameters, calculate t

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: 5-10 seconds
Dependencies: None
"""

import numpy as np

# Parameters
pi = np.array([0.6, 0.3, 0.1])
A = np.array([[0.7, 0.2, 0.1],
              [0.3, 0.5, 0.2],
              [0.2, 0.3, 0.5]])
B = np.array([[0.8, 0.2],
              [0.4, 0.6],
              [0.3, 0.7]])

observations = [0, 1, 0]
T = len(observations)
N = len(pi)

# Forward variables
alpha = np.zeros((T, N))

# Initialization (t=0)
alpha[0] = pi * B[:, observations[0]]
print(f"t=0: α = {alpha[0]}")

# Recursion (t=1, 2, ...)
for t in range(1, T):
    for j in range(N):
        alpha[t, j] = np.sum(alpha[t-1] * A[:, j]) * B[j, observations[t]]
    print(f"t={t}: α = {alpha[t]}")

# Probability of observation sequence
prob = np.sum(alpha[T-1])

print(f"\nProbability of observation sequence {observations}:")
print(f"P(O|λ) = {prob:.6f}")

Output:

t=0: α = [0.48 0.12 0.03]
t=1: α = [0.0672 0.1092 0.0588]
t=2: α = [0.06048 0.02688 0.01512]

Probability of observation sequence [0, 1, 0]:
P(O|λ) = 0.102480

Problem 3 (Difficulty: medium)

In a bigram language model, estimate P("learning"|"machine") using maximum likelihood estimation from the following corpus.

I love machine learning
machine learning is fun
I study machine learning
deep learning is great
Solution

Answer:

Counts:

Maximum likelihood estimation:

$$ P(\text{learning} | \text{machine}) = \frac{C(\text{machine}, \text{learning})}{C(\text{machine})} = \frac{3}{3} = 1.0 $$

Answer: 1.0 (100%)

In this corpus, "machine" is always followed by "learning".

Problem 4 (Difficulty: hard)

Using a GMM with 2 components (K=2), cluster the following 1D data and find the parameters (mean, variance, weight) of each component.

data = np.array([1.2, 1.5, 1.8, 2.0, 2.1, 8.5, 9.0, 9.2, 9.5, 10.0])
Solution
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Using a GMM with 2 components (K=2), cluster the following 1

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt

data = np.array([1.2, 1.5, 1.8, 2.0, 2.1, 8.5, 9.0, 9.2, 9.5, 10.0])
X = data.reshape(-1, 1)

# GMM clustering
gmm = GaussianMixture(n_components=2, random_state=42)
gmm.fit(X)

# Parameters
print("=== GMM Parameters ===")
print(f"\nComponent 1:")
print(f"  Mean: {gmm.means_[0][0]:.3f}")
print(f"  Variance: {gmm.covariances_[0][0][0]:.3f}")
print(f"  Weight: {gmm.weights_[0]:.3f}")

print(f"\nComponent 2:")
print(f"  Mean: {gmm.means_[1][0]:.3f}")
print(f"  Variance: {gmm.covariances_[1][0][0]:.3f}")
print(f"  Weight: {gmm.weights_[1]:.3f}")

# Cluster labels
labels = gmm.predict(X)
print(f"\nCluster labels:")
for i, (val, label) in enumerate(zip(data, labels)):
    print(f"  Data{i+1}: {val:.1f} → Cluster{label}")

# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(data, np.zeros_like(data), c=labels, cmap='viridis',
            s=100, alpha=0.6, edgecolors='black')
plt.scatter(gmm.means_, [0, 0], c='red', s=200, marker='X',
            edgecolors='black', linewidths=2, label='Centers')
plt.xlabel('Value')
plt.title('1D Data Clustering with GMM', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Example output:

=== GMM Parameters ===

Component 1:
  Mean: 1.720
  Variance: 0.124
  Weight: 0.500

Component 2:
  Mean: 9.240
  Variance: 0.294
  Weight: 0.500

Cluster labels:
  Data1: 1.2 → Cluster0
  Data2: 1.5 → Cluster0
  Data3: 1.8 → Cluster0
  Data4: 2.0 → Cluster0
  Data5: 2.1 → Cluster0
  Data6: 8.5 → Cluster1
  Data7: 9.0 → Cluster1
  Data8: 9.2 → Cluster1
  Data9: 9.5 → Cluster1
  Data10: 10.0 → Cluster1

Problem 5 (Difficulty: hard)

In an ASR system that integrates acoustic and language models, explain the impact of varying the language model weight (LM weight) on recognition results. Also, describe how the optimal weight should be determined.

Solution

Answer:

Impact of LM weight:

Combined score:

$$ \text{Score}(W) = \log P(X|W) + \lambda \log P(W) $$

LM weight $\lambda$ Impact Recognition tendency
Small (close to 0) Acoustic model priority Selects acoustically similar words, grammatically unnatural
Appropriate Balanced Considers both acoustic and grammar, best recognition accuracy
Large Language model priority Grammatically correct but acoustically incorrect, biased toward frequent words

Methods for determining optimal weight:

  1. Grid search on development set

    lambda_values = [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
    best_lambda = None
    best_wer = float('inf')
    
    for lam in lambda_values:
        wer = evaluate_asr(dev_set, lm_weight=lam)
        if wer < best_wer:
            best_wer = wer
            best_lambda = lam
    
    print(f"Optimal LM weight: {best_lambda}")
    
  2. Considering domain dependency

    • Read speech: High acoustic quality → smaller $\lambda$
    • Spontaneous speech: Noisy → larger $\lambda$
    • Many technical terms: Low language model reliability → smaller $\lambda$
  3. Dynamic adjustment

    • Dynamically adjust based on confidence scores
    • Adjust according to acoustic quality (SNR)

Example:

Speech: "I scream" (ice cream)

λ = 0.1 (acoustic priority):
  → "I scream" (acoustically accurate)

λ = 5.0 (language priority):
  → "ice cream" (grammatically natural, high frequency in language model)

λ = 1.0 (balanced):
  → Appropriate selection depending on context

Conclusion: The optimal LM weight depends on domain, acoustic conditions, and language model quality, and requires experimental tuning on a development set.


References

  1. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286.
  2. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  3. Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed.). Draft.
  4. Gales, M., & Young, S. (2008). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195-304.
  5. Heafield, K. (2011). KenLM: Faster and smaller language model queries. Proceedings of the Sixth Workshop on Statistical Machine Translation, 187-197.

Disclaimer