Chapter 5: Speech and Audio Applications

This chapter focuses on practical applications of Speech and Audio Applications. You will learn differences between speaker identification, Use speaker embeddings with i-vector, and Build speech emotion recognition systems.

Learning Objectives

By reading this chapter, you will be able to:

✅ Understand the differences between speaker identification and verification, and implement them
✅ Use speaker embeddings with i-vector and x-vector
✅ Build speech emotion recognition systems
✅ Implement speech enhancement and noise reduction techniques
✅ Understand fundamental technologies in music information processing
✅ Develop end-to-end speech AI applications

5.1 Speaker Recognition and Verification

Overview of Speaker Recognition

Speaker Recognition is a technology that identifies speakers from their voice. It is mainly classified into two types:

Task	Description	Example
Speaker Identification	Identifies a speaker from multiple candidates	"Whose voice is this?"
Speaker Verification	Verifies whether the speaker is the claimed person	"Is this Mr. Yamada's voice?"

Approaches to Speaker Recognition

graph TD A[Audio Input] --> B[Feature Extraction] B --> C{Method Selection} C --> D[i-vector] C --> E[x-vector] C --> F[Deep Speaker] D --> G[Speaker Embedding] E --> G F --> G G --> H[Classification/Verification] H --> I[Speaker ID] style A fill:#ffebee style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#e3f2fd style E fill:#e3f2fd style F fill:#e3f2fd style G fill:#e8f5e9 style H fill:#fce4ec style I fill:#c8e6c9

Implementation Example: Basic Speaker Recognition

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - seaborn>=0.12.0

import numpy as np
import librosa
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

# Function to extract speaker features
def extract_speaker_features(audio_path, n_mfcc=20):
    """
    Extract features for speaker recognition

    Parameters:
    -----------
    audio_path : str
        Path to audio file
    n_mfcc : int
        Number of MFCC dimensions

    Returns:
    --------
    features : np.ndarray
        Statistical feature vector
    """
    # Load audio
    y, sr = librosa.load(audio_path, sr=16000)

    # Extract MFCC
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)

    # Delta MFCC (first derivative)
    mfcc_delta = librosa.feature.delta(mfcc)

    # Delta-Delta MFCC (second derivative)
    mfcc_delta2 = librosa.feature.delta(mfcc, order=2)

    # Calculate statistics (mean and standard deviation)
    features = np.concatenate([
        np.mean(mfcc, axis=1),
        np.std(mfcc, axis=1),
        np.mean(mfcc_delta, axis=1),
        np.std(mfcc_delta, axis=1),
        np.mean(mfcc_delta2, axis=1),
        np.std(mfcc_delta2, axis=1)
    ])

    return features

# Generate sample data (use actual dataset in practice)
def generate_sample_speaker_data(n_speakers=5, n_samples_per_speaker=20):
    """
    Generate demo speaker data
    """
    np.random.seed(42)
    X = []
    y = []

    for speaker_id in range(n_speakers):
        # Generate data with speaker-specific features
        speaker_mean = np.random.randn(120) * 0.5 + speaker_id

        for _ in range(n_samples_per_speaker):
            # Add noise to create variation
            sample = speaker_mean + np.random.randn(120) * 0.3
            X.append(sample)
            y.append(speaker_id)

    return np.array(X), np.array(y)

# Generate data
X, y = generate_sample_speaker_data(n_speakers=5, n_samples_per_speaker=20)

# Split training and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train speaker identification model with SVM
model = SVC(kernel='rbf', C=1.0, gamma='scale', probability=True)
model.fit(X_train_scaled, y_train)

# Evaluation
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

print("=== Speaker Identification System ===")
print(f"Number of speakers: {len(np.unique(y))}")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Feature dimensions: {X.shape[1]}")
print(f"\nIdentification accuracy: {accuracy:.3f}")
print(f"\nDetailed report:")
print(classification_report(y_test, y_pred,
                          target_names=[f'Speaker {i}' for i in range(5)]))

# Visualize confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=[f'S{i}' for i in range(5)],
            yticklabels=[f'S{i}' for i in range(5)])
plt.xlabel('Predicted Speaker')
plt.ylabel('True Speaker')
plt.title('Speaker Identification Confusion Matrix', fontsize=14)
plt.tight_layout()
plt.show()

Output:

=== Speaker Identification System ===
Number of speakers: 5
Training samples: 70
Test samples: 30
Feature dimensions: 120

Identification accuracy: 0.967

Detailed report:
              precision    recall  f1-score   support

   Speaker 0       1.00      1.00      1.00         6
   Speaker 1       1.00      0.83      0.91         6
   Speaker 2       0.86      1.00      0.92         6
   Speaker 3       1.00      1.00      1.00         6
   Speaker 4       1.00      1.00      1.00         6

Speaker Embedding with x-vector

x-vector is a method that embeds speaker characteristics into fixed-length vectors using deep neural networks.

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import torch.nn.functional as F

class XVectorNetwork(nn.Module):
    """
    x-vector extraction network

    Architecture:
    - TDNN (Time Delay Neural Network) layers
    - Statistics pooling
    - Embedding layers
    """
    def __init__(self, input_dim=40, embedding_dim=512):
        super(XVectorNetwork, self).__init__()

        # TDNN layers
        self.tdnn1 = nn.Conv1d(input_dim, 512, kernel_size=5, dilation=1)
        self.tdnn2 = nn.Conv1d(512, 512, kernel_size=3, dilation=2)
        self.tdnn3 = nn.Conv1d(512, 512, kernel_size=3, dilation=3)
        self.tdnn4 = nn.Conv1d(512, 512, kernel_size=1, dilation=1)
        self.tdnn5 = nn.Conv1d(512, 1500, kernel_size=1, dilation=1)

        # After statistics pooling: 1500 * 2 = 3000 dimensions
        # Segment-level layers
        self.segment1 = nn.Linear(3000, embedding_dim)
        self.segment2 = nn.Linear(embedding_dim, embedding_dim)

        # Batch normalization
        self.bn1 = nn.BatchNorm1d(512)
        self.bn2 = nn.BatchNorm1d(512)
        self.bn3 = nn.BatchNorm1d(512)
        self.bn4 = nn.BatchNorm1d(512)
        self.bn5 = nn.BatchNorm1d(1500)

    def forward(self, x):
        """
        Forward pass

        Parameters:
        -----------
        x : torch.Tensor
            Input features (batch, features, time)

        Returns:
        --------
        embedding : torch.Tensor
            Speaker embedding vector (batch, embedding_dim)
        """
        # TDNN layers
        x = F.relu(self.bn1(self.tdnn1(x)))
        x = F.relu(self.bn2(self.tdnn2(x)))
        x = F.relu(self.bn3(self.tdnn3(x)))
        x = F.relu(self.bn4(self.tdnn4(x)))
        x = F.relu(self.bn5(self.tdnn5(x)))

        # Statistics pooling: mean + std
        mean = torch.mean(x, dim=2)
        std = torch.std(x, dim=2)
        stats = torch.cat([mean, std], dim=1)

        # Segment-level layers
        x = F.relu(self.segment1(stats))
        embedding = self.segment2(x)

        return embedding

# Initialize model
model = XVectorNetwork(input_dim=40, embedding_dim=512)
print("=== x-vector Network ===")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

# Test with sample input
batch_size = 4
n_features = 40
n_frames = 100

sample_input = torch.randn(batch_size, n_features, n_frames)
with torch.no_grad():
    embeddings = model(sample_input)

print(f"\nInput shape: {sample_input.shape}")
print(f"Embedding shape: {embeddings.shape}")
print(f"Sample embedding vector:")
print(embeddings[0, :10])

Output:

=== x-vector Network ===
Total parameters: 5,358,336

Input shape: torch.Size([4, 40, 100])
Embedding shape: torch.Size([4, 512])
Sample embedding vector:
tensor([-0.2156,  0.1834, -0.0923,  0.3421, -0.1567,  0.2891, -0.0456,  0.1234,
        -0.3012,  0.0789])

Speaker Verification System

from scipy.spatial.distance import cosine

class SpeakerVerification:
    """
    Speaker verification system
    Verifies identity by calculating similarity between embedding vectors
    """
    def __init__(self, threshold=0.5):
        self.threshold = threshold
        self.enrolled_speakers = {}

    def enroll_speaker(self, speaker_id, embedding):
        """
        Enroll a speaker

        Parameters:
        -----------
        speaker_id : str
            Speaker ID
        embedding : np.ndarray
            Speaker's embedding vector
        """
        self.enrolled_speakers[speaker_id] = embedding

    def verify(self, speaker_id, test_embedding):
        """
        Verify a speaker

        Parameters:
        -----------
        speaker_id : str
            Speaker ID to verify
        test_embedding : np.ndarray
            Test audio's embedding vector

        Returns:
        --------
        is_verified : bool
            Whether the speaker is verified
        similarity : float
            Similarity score
        """
        if speaker_id not in self.enrolled_speakers:
            raise ValueError(f"Speaker {speaker_id} is not enrolled")

        enrolled_embedding = self.enrolled_speakers[speaker_id]

        # Calculate cosine similarity (complement of distance)
        similarity = 1 - cosine(enrolled_embedding, test_embedding)

        is_verified = similarity > self.threshold

        return is_verified, similarity

# Demonstration
np.random.seed(42)

# Initialize speaker verification system
verifier = SpeakerVerification(threshold=0.7)

# Enroll speakers
speaker_a_embedding = np.random.randn(512)
speaker_b_embedding = np.random.randn(512)

verifier.enroll_speaker("Alice", speaker_a_embedding)
verifier.enroll_speaker("Bob", speaker_b_embedding)

print("=== Speaker Verification System ===")
print(f"Enrolled speakers: {list(verifier.enrolled_speakers.keys())}")
print(f"Threshold: {verifier.threshold}")

# Test case 1: Alice's genuine voice (high similarity)
test_alice_genuine = speaker_a_embedding + np.random.randn(512) * 0.1
is_verified, similarity = verifier.verify("Alice", test_alice_genuine)
print(f"\nTest 1 - Alice (genuine):")
print(f"  Verification result: {'✓ Accepted' if is_verified else '✗ Rejected'}")
print(f"  Similarity: {similarity:.3f}")

# Test case 2: Alice impersonation (Bob's voice)
is_verified, similarity = verifier.verify("Alice", speaker_b_embedding)
print(f"\nTest 2 - Alice (impersonation):")
print(f"  Verification result: {'✓ Accepted' if is_verified else '✗ Rejected'}")
print(f"  Similarity: {similarity:.3f}")

# Test case 3: Bob's genuine voice
test_bob_genuine = speaker_b_embedding + np.random.randn(512) * 0.1
is_verified, similarity = verifier.verify("Bob", test_bob_genuine)
print(f"\nTest 3 - Bob (genuine):")
print(f"  Verification result: {'✓ Accepted' if is_verified else '✗ Rejected'}")
print(f"  Similarity: {similarity:.3f}")

Important: In actual systems, multiple enrollment utterances are averaged, or more advanced similarity calculations such as PLDA (Probabilistic Linear Discriminant Analysis) are used.

5.2 Speech Emotion Recognition

What is Speech Emotion Recognition

Speech Emotion Recognition (SER) is a technology that estimates a speaker's emotional state from their voice.

Features for Emotion Recognition

Feature	Description	Relationship to Emotion
Prosodic Features	Pitch, energy, speaking rate	Anger→high pitch, Sadness→low energy
Acoustic Features	MFCC, spectrum	Capture voice quality changes
Temporal Features	Utterance duration, pauses	Tension→fast speech, Sadness→long pauses

Major Emotion Datasets

Dataset	Description	Emotion Categories
RAVDESS	Acted emotional speech	8 emotions (joy, sadness, anger, fear, etc.)
IEMOCAP	Conversational emotional speech	5 emotions + dimensional model (arousal, valence)
EMO-DB	German emotional speech	7 emotions

Implementation Example: Emotion Recognition System

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - seaborn>=0.12.0

import numpy as np
import librosa
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns

def extract_emotion_features(audio_path):
    """
    Extract comprehensive features for emotion recognition

    Returns:
    --------
    features : np.ndarray
        Feature vector
    """
    y, sr = librosa.load(audio_path, sr=22050)

    features = []

    # 1. MFCC (acoustic features)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    features.extend(np.mean(mfcc, axis=1))
    features.extend(np.std(mfcc, axis=1))

    # 2. Chroma features (pitch)
    chroma = librosa.feature.chroma_stft(y=y, sr=sr)
    features.extend(np.mean(chroma, axis=1))
    features.extend(np.std(chroma, axis=1))

    # 3. Mel spectrogram
    mel = librosa.feature.melspectrogram(y=y, sr=sr)
    features.extend(np.mean(mel, axis=1))
    features.extend(np.std(mel, axis=1))

    # 4. Spectral contrast
    contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
    features.extend(np.mean(contrast, axis=1))
    features.extend(np.std(contrast, axis=1))

    # 5. Tonal centroid (Tonnetz)
    tonnetz = librosa.feature.tonnetz(y=y, sr=sr)
    features.extend(np.mean(tonnetz, axis=1))
    features.extend(np.std(tonnetz, axis=1))

    # 6. Zero crossing rate
    zcr = librosa.feature.zero_crossing_rate(y)
    features.append(np.mean(zcr))
    features.append(np.std(zcr))

    # 7. RMS energy
    rms = librosa.feature.rms(y=y)
    features.append(np.mean(rms))
    features.append(np.std(rms))

    # 8. Pitch (fundamental frequency)
    pitches, magnitudes = librosa.piptrack(y=y, sr=sr)
    pitch_values = []
    for t in range(pitches.shape[1]):
        index = magnitudes[:, t].argmax()
        pitch = pitches[index, t]
        if pitch > 0:
            pitch_values.append(pitch)

    if len(pitch_values) > 0:
        features.append(np.mean(pitch_values))
        features.append(np.std(pitch_values))
    else:
        features.extend([0, 0])

    return np.array(features)

# Generate sample data (use RAVDESS etc. in practice)
def generate_emotion_dataset(n_samples_per_emotion=50):
    """
    Generate demo emotion data
    """
    np.random.seed(42)

    emotions = ['neutral', 'happy', 'sad', 'angry', 'fear']
    n_features = 194  # Same dimensions as above feature extraction

    X = []
    y = []

    for emotion_id, emotion in enumerate(emotions):
        # Generate data with emotion-specific patterns
        base_features = np.random.randn(n_features) + emotion_id * 2

        for _ in range(n_samples_per_emotion):
            # Add variation
            sample = base_features + np.random.randn(n_features) * 0.5
            X.append(sample)
            y.append(emotion_id)

    return np.array(X), np.array(y), emotions

# Generate data
X, y, emotion_labels = generate_emotion_dataset(n_samples_per_emotion=50)

# Split training and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Emotion classification with Random Forest
model = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=20)
model.fit(X_train_scaled, y_train)

# Evaluation
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

print("=== Speech Emotion Recognition System ===")
print(f"Emotion categories: {emotion_labels}")
print(f"Feature dimensions: {X.shape[1]}")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"\nClassification accuracy: {accuracy:.3f}")
print(f"\nDetailed report:")
print(classification_report(y_test, y_pred, target_names=emotion_labels))

# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='YlOrRd',
            xticklabels=emotion_labels,
            yticklabels=emotion_labels)
plt.xlabel('Predicted Emotion')
plt.ylabel('True Emotion')
plt.title('Emotion Recognition Confusion Matrix', fontsize=14)
plt.tight_layout()
plt.show()

# Feature importance
feature_importance = model.feature_importances_
plt.figure(figsize=(12, 6))
plt.bar(range(len(feature_importance)), feature_importance, alpha=0.7)
plt.xlabel('Feature Index')
plt.ylabel('Importance')
plt.title('Feature Importance', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Emotion Recognition with Deep Learning

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

class EmotionCNN(nn.Module):
    """
    CNN model for emotion recognition
    Takes spectrogram as input
    """
    def __init__(self, n_emotions=5):
        super(EmotionCNN, self).__init__()

        # Convolutional layers
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)

        self.pool = nn.MaxPool2d(2, 2)
        self.dropout = nn.Dropout(0.3)

        # Fully connected layers
        self.fc1 = nn.Linear(128 * 16 * 16, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, n_emotions)

        self.bn1 = nn.BatchNorm2d(32)
        self.bn2 = nn.BatchNorm2d(64)
        self.bn3 = nn.BatchNorm2d(128)

    def forward(self, x):
        # Conv block 1
        x = self.pool(F.relu(self.bn1(self.conv1(x))))
        x = self.dropout(x)

        # Conv block 2
        x = self.pool(F.relu(self.bn2(self.conv2(x))))
        x = self.dropout(x)

        # Conv block 3
        x = self.pool(F.relu(self.bn3(self.conv3(x))))
        x = self.dropout(x)

        # Flatten
        x = x.view(x.size(0), -1)

        # FC layers
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)

        return x

# Initialize model
model = EmotionCNN(n_emotions=5)
print("=== Emotion Recognition CNN Model ===")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

# Test with sample input (spectrogram: 128x128)
sample_input = torch.randn(4, 1, 128, 128)
with torch.no_grad():
    output = model(sample_input)

print(f"\nInput shape: {sample_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Output logits (sample):")
print(output[0])

# Simple training demo
def train_emotion_model(model, X_train, y_train, epochs=10, batch_size=32):
    """
    Train emotion recognition model
    """
    # Convert data to Tensors
    X_tensor = torch.FloatTensor(X_train).unsqueeze(1).unsqueeze(2)
    y_tensor = torch.LongTensor(y_train)

    # Create DataLoader
    dataset = TensorDataset(X_tensor, y_tensor)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    # Loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Training loop
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for batch_X, batch_y in dataloader:
            # Preprocess: resize data to appropriate shape
            batch_X_resized = F.interpolate(batch_X, size=(128, 128))

            optimizer.zero_grad()
            outputs = model(batch_X_resized)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        if (epoch + 1) % 2 == 0:
            print(f"Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.4f}")

    return model

print("\n=== Model Training (Demo) ===")
trained_model = train_emotion_model(model, X_train_scaled, y_train, epochs=5)
print("✓ Training complete")

5.3 Speech Enhancement and Noise Reduction

Purpose of Speech Enhancement

Speech Enhancement is a technology that extracts target speech from noisy audio and improves quality.

Major Techniques

Technique	Principle	Characteristics
Spectral Subtraction	Estimate and subtract noise spectrum	Simple, real-time capable
Wiener Filter	Minimum mean square error filter	Statistically optimal
Deep Learning	Estimate mask with DNN	High performance, requires training data

Implementation Example: Spectral Subtraction

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

import numpy as np
import librosa
import matplotlib.pyplot as plt
from scipy.signal import wiener

def spectral_subtraction(noisy_signal, sr, noise_estimate_duration=0.5):
    """
    Noise reduction using spectral subtraction

    Parameters:
    -----------
    noisy_signal : np.ndarray
        Noisy speech signal
    sr : int
        Sampling rate
    noise_estimate_duration : float
        Duration of initial segment for noise estimation (seconds)

    Returns:
    --------
    enhanced_signal : np.ndarray
        Enhanced speech signal
    """
    # STFT
    n_fft = 2048
    hop_length = 512

    D = librosa.stft(noisy_signal, n_fft=n_fft, hop_length=hop_length)
    magnitude = np.abs(D)
    phase = np.angle(D)

    # Estimate noise spectrum (using initial segment)
    noise_frames = int(noise_estimate_duration * sr / hop_length)
    noise_spectrum = np.mean(magnitude[:, :noise_frames], axis=1, keepdims=True)

    # Spectral subtraction
    alpha = 2.0  # Subtraction coefficient
    enhanced_magnitude = magnitude - alpha * noise_spectrum

    # Clip negative values to 0
    enhanced_magnitude = np.maximum(enhanced_magnitude, 0)

    # Restore phase and inverse STFT
    enhanced_D = enhanced_magnitude * np.exp(1j * phase)
    enhanced_signal = librosa.istft(enhanced_D, hop_length=hop_length)

    return enhanced_signal

# Generate sample audio
sr = 22050
duration = 3.0
t = np.linspace(0, duration, int(sr * duration))

# Clean speech signal (combination of sine waves)
clean_signal = (
    np.sin(2 * np.pi * 440 * t) +  # A4 note
    0.5 * np.sin(2 * np.pi * 880 * t)  # A5 note
)

# Add noise
noise = np.random.randn(len(clean_signal)) * 0.3
noisy_signal = clean_signal + noise

# Apply spectral subtraction
enhanced_signal = spectral_subtraction(noisy_signal, sr)

# Calculate SNR
def calculate_snr(signal, noise):
    signal_power = np.mean(signal ** 2)
    noise_power = np.mean(noise ** 2)
    snr = 10 * np.log10(signal_power / noise_power)
    return snr

snr_before = calculate_snr(clean_signal, noisy_signal - clean_signal)
snr_after = calculate_snr(clean_signal, enhanced_signal[:len(clean_signal)] - clean_signal)

print("=== Noise Reduction with Spectral Subtraction ===")
print(f"SNR (before): {snr_before:.2f} dB")
print(f"SNR (after): {snr_after:.2f} dB")
print(f"Improvement: {snr_after - snr_before:.2f} dB")

# Visualization
fig, axes = plt.subplots(3, 2, figsize=(15, 12))

# Time domain waveforms
axes[0, 0].plot(t[:1000], clean_signal[:1000], alpha=0.7)
axes[0, 0].set_title('Clean Signal', fontsize=12)
axes[0, 0].set_xlabel('Time (seconds)')
axes[0, 0].set_ylabel('Amplitude')
axes[0, 0].grid(True, alpha=0.3)

axes[1, 0].plot(t[:1000], noisy_signal[:1000], alpha=0.7, color='orange')
axes[1, 0].set_title('Noisy Signal', fontsize=12)
axes[1, 0].set_xlabel('Time (seconds)')
axes[1, 0].set_ylabel('Amplitude')
axes[1, 0].grid(True, alpha=0.3)

axes[2, 0].plot(t[:len(enhanced_signal)][:1000], enhanced_signal[:1000],
                alpha=0.7, color='green')
axes[2, 0].set_title('Enhanced Signal (After Spectral Subtraction)', fontsize=12)
axes[2, 0].set_xlabel('Time (seconds)')
axes[2, 0].set_ylabel('Amplitude')
axes[2, 0].grid(True, alpha=0.3)

# Spectrograms
D_clean = librosa.stft(clean_signal)
D_noisy = librosa.stft(noisy_signal)
D_enhanced = librosa.stft(enhanced_signal)

axes[0, 1].imshow(librosa.amplitude_to_db(np.abs(D_clean), ref=np.max),
                  aspect='auto', origin='lower', cmap='viridis')
axes[0, 1].set_title('Clean (Spectrogram)', fontsize=12)
axes[0, 1].set_ylabel('Frequency')

axes[1, 1].imshow(librosa.amplitude_to_db(np.abs(D_noisy), ref=np.max),
                  aspect='auto', origin='lower', cmap='viridis')
axes[1, 1].set_title('Noisy (Spectrogram)', fontsize=12)
axes[1, 1].set_ylabel('Frequency')

axes[2, 1].imshow(librosa.amplitude_to_db(np.abs(D_enhanced), ref=np.max),
                  aspect='auto', origin='lower', cmap='viridis')
axes[2, 1].set_title('Enhanced (Spectrogram)', fontsize=12)
axes[2, 1].set_xlabel('Time Frame')
axes[2, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

Using noisereduce Library

import noisereduce as nr

# Noise reduction using noisereduce
reduced_noise_signal = nr.reduce_noise(
    y=noisy_signal,
    sr=sr,
    stationary=True,
    prop_decrease=1.0
)

# Calculate SNR
snr_noisereduce = calculate_snr(clean_signal,
                                reduced_noise_signal[:len(clean_signal)] - clean_signal)

print("\n=== noisereduce Library ===")
print(f"SNR (after): {snr_noisereduce:.2f} dB")
print(f"Improvement: {snr_noisereduce - snr_before:.2f} dB")

# Comparison visualization
plt.figure(figsize=(15, 8))

plt.subplot(4, 1, 1)
plt.plot(t[:1000], clean_signal[:1000])
plt.title('Clean Signal', fontsize=12)
plt.ylabel('Amplitude')
plt.grid(True, alpha=0.3)

plt.subplot(4, 1, 2)
plt.plot(t[:1000], noisy_signal[:1000], color='orange')
plt.title(f'Noisy Signal (SNR: {snr_before:.1f} dB)', fontsize=12)
plt.ylabel('Amplitude')
plt.grid(True, alpha=0.3)

plt.subplot(4, 1, 3)
plt.plot(t[:len(enhanced_signal)][:1000], enhanced_signal[:1000], color='green')
plt.title(f'Spectral Subtraction (SNR: {snr_after:.1f} dB)', fontsize=12)
plt.ylabel('Amplitude')
plt.grid(True, alpha=0.3)

plt.subplot(4, 1, 4)
plt.plot(t[:len(reduced_noise_signal)][:1000], reduced_noise_signal[:1000],
         color='red')
plt.title(f'noisereduce (SNR: {snr_noisereduce:.1f} dB)', fontsize=12)
plt.xlabel('Time (seconds)')
plt.ylabel('Amplitude')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Note: The noisereduce library can be installed with pip install noisereduce.

5.4 Music Information Processing

Overview of Music Information Retrieval (MIR)

Music Information Retrieval (MIR) is a technology that extracts and analyzes information from music signals.

Major Tasks

Task	Description	Application Example
Beat Tracking	Detect rhythm beats	Auto DJ, dance games
Chord Recognition	Estimate chord progressions	Auto transcription, music theory analysis
Genre Classification	Identify music genres	Music recommendation, playlist generation
Source Separation	Separate by instrument	Remixing, karaoke

Implementation Example: Beat Tracking

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

def beat_tracking_demo():
    """
    Beat tracking demonstration
    """
    # Generate sample music signal (drum beat style)
    sr = 22050
    duration = 8.0
    t = np.linspace(0, duration, int(sr * duration))

    # 120 BPM (2 beats per second)
    bpm = 120
    beat_interval = 60.0 / bpm

    # Generate kick drum-like sound at beat positions
    signal = np.zeros(len(t))
    for beat_time in np.arange(0, duration, beat_interval):
        beat_sample = int(beat_time * sr)
        if beat_sample < len(signal):
            # Simulate kick drum (decaying low frequency)
            kick_duration = int(0.1 * sr)
            kick_t = np.linspace(0, 0.1, kick_duration)
            kick = np.sin(2 * np.pi * 80 * kick_t) * np.exp(-kick_t * 30)

            end_idx = min(beat_sample + kick_duration, len(signal))
            signal[beat_sample:end_idx] += kick[:end_idx - beat_sample]

    # Add slight noise
    signal += np.random.randn(len(signal)) * 0.05

    # Beat detection
    tempo, beat_frames = librosa.beat.beat_track(y=signal, sr=sr)
    beat_times = librosa.frames_to_time(beat_frames, sr=sr)

    print("=== Beat Tracking ===")
    print(f"Estimated tempo: {tempo:.1f} BPM")
    print(f"Detected beats: {len(beat_times)}")
    print(f"Beat interval: {np.mean(np.diff(beat_times)):.3f} seconds")

    # Calculate onset strength
    onset_env = librosa.onset.onset_strength(y=signal, sr=sr)
    times = librosa.frames_to_time(np.arange(len(onset_env)), sr=sr)

    # Visualization
    fig, axes = plt.subplots(3, 1, figsize=(14, 10))

    # Waveform and beat positions
    axes[0].plot(t, signal, alpha=0.6)
    axes[0].vlines(beat_times, -1, 1, color='r', alpha=0.8,
                   linestyle='--', label='Detected Beats')
    axes[0].set_xlabel('Time (seconds)')
    axes[0].set_ylabel('Amplitude')
    axes[0].set_title(f'Audio Waveform and Beat Detection (Estimated Tempo: {tempo:.1f} BPM)', fontsize=12)
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)

    # Onset strength
    axes[1].plot(times, onset_env, alpha=0.7, color='green')
    axes[1].vlines(beat_times, 0, onset_env.max(), color='r',
                   alpha=0.8, linestyle='--')
    axes[1].set_xlabel('Time (seconds)')
    axes[1].set_ylabel('Strength')
    axes[1].set_title('Onset Strength and Beat Positions', fontsize=12)
    axes[1].grid(True, alpha=0.3)

    # Tempogram
    tempogram = librosa.feature.tempogram(y=signal, sr=sr)
    axes[2].imshow(tempogram, aspect='auto', origin='lower', cmap='magma')
    axes[2].set_xlabel('Time Frame')
    axes[2].set_ylabel('Tempo (BPM)')
    axes[2].set_title('Tempogram', fontsize=12)

    plt.tight_layout()
    plt.show()

    return signal, sr, tempo, beat_times

# Execute
signal, sr, tempo, beat_times = beat_tracking_demo()

Implementation Example: Music Genre Classification

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

def extract_music_features(audio, sr):
    """
    Extract features for music genre classification
    """
    features = []

    # 1. MFCC statistics
    mfcc = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=20)
    features.extend(np.mean(mfcc, axis=1))
    features.extend(np.std(mfcc, axis=1))

    # 2. Chroma features
    chroma = librosa.feature.chroma_stft(y=audio, sr=sr)
    features.extend(np.mean(chroma, axis=1))
    features.extend(np.std(chroma, axis=1))

    # 3. Spectral features
    spectral_centroids = librosa.feature.spectral_centroid(y=audio, sr=sr)[0]
    features.append(np.mean(spectral_centroids))
    features.append(np.std(spectral_centroids))

    spectral_rolloff = librosa.feature.spectral_rolloff(y=audio, sr=sr)[0]
    features.append(np.mean(spectral_rolloff))
    features.append(np.std(spectral_rolloff))

    # 4. Zero crossing rate
    zcr = librosa.feature.zero_crossing_rate(audio)[0]
    features.append(np.mean(zcr))
    features.append(np.std(zcr))

    # 5. Tempo
    tempo, _ = librosa.beat.beat_track(y=audio, sr=sr)
    features.append(tempo)

    # 6. Harmonic-percussive components
    y_harmonic, y_percussive = librosa.effects.hpss(audio)
    harmonic_ratio = np.sum(y_harmonic**2) / (np.sum(audio**2) + 1e-6)
    features.append(harmonic_ratio)

    return np.array(features)

# Genre classification demo
def music_genre_classification():
    """
    Music genre classification demonstration
    """
    np.random.seed(42)

    # Generate virtual genre data
    genres = ['Classical', 'Jazz', 'Rock', 'Electronic', 'Hip-Hop']
    n_samples_per_genre = 30

    X = []
    y = []

    for genre_id, genre in enumerate(genres):
        # Generate data with genre-specific patterns
        base_features = np.random.randn(51) + genre_id * 1.5

        for _ in range(n_samples_per_genre):
            sample = base_features + np.random.randn(51) * 0.4
            X.append(sample)
            y.append(genre_id)

    X = np.array(X)
    y = np.array(y)

    # Train and evaluate model (cross-validation)
    model = GradientBoostingClassifier(n_estimators=100, random_state=42)
    scores = cross_val_score(model, X, y, cv=5)

    print("\n=== Music Genre Classification ===")
    print(f"Genres: {genres}")
    print(f"Number of samples: {len(X)}")
    print(f"Feature dimensions: {X.shape[1]}")
    print(f"\nCross-validation accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

    # Train model on all data
    model.fit(X, y)

    # Feature importance (top 10)
    feature_importance = model.feature_importances_
    top_10_idx = np.argsort(feature_importance)[-10:]

    plt.figure(figsize=(10, 6))
    plt.barh(range(10), feature_importance[top_10_idx], alpha=0.7)
    plt.xlabel('Importance')
    plt.ylabel('Feature Index')
    plt.title('Important Features (Top 10)', fontsize=14)
    plt.yticks(range(10), top_10_idx)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

    return model, genres

model, genres = music_genre_classification()

5.5 End-to-End Speech AI Applications

Integrated Speech Processing System

Real-world applications combine multiple speech processing technologies.

graph LR A[Audio Input] --> B[Noise Reduction] B --> C[Speaker Verification] C --> D{Verified?} D -->|Yes| E[Emotion Recognition] D -->|No| F[Access Denied] E --> G[Speech Recognition] G --> H[Response Generation] H --> I[Speech Synthesis] I --> J[Output] style A fill:#ffebee style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#e3f2fd style E fill:#e8f5e9 style F fill:#ffcdd2 style G fill:#c8e6c9 style H fill:#b2dfdb style I fill:#b2ebf2 style J fill:#c5cae9

Implementation Example: Integrated Audio Processing Pipeline

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np
import librosa
from dataclasses import dataclass
from typing import Tuple, Optional

@dataclass
class AudioProcessingResult:
    """Stores audio processing results"""
    is_verified: bool
    speaker_similarity: float
    emotion: Optional[str]
    emotion_confidence: float
    enhanced_audio: np.ndarray
    processing_time: float

class IntegratedAudioPipeline:
    """
    Integrated audio processing pipeline

    Features:
    1. Noise reduction
    2. Speaker verification
    3. Emotion recognition
    """
    def __init__(self, verification_threshold=0.7):
        self.verification_threshold = verification_threshold
        self.enrolled_speakers = {}

        # Initialize models (load in practice)
        self.emotion_labels = ['neutral', 'happy', 'sad', 'angry', 'fear']

    def preprocess_audio(self, audio, sr):
        """
        Audio preprocessing
        1. Resampling
        2. Noise reduction
        """
        # Resample to 16kHz
        if sr != 16000:
            audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
            sr = 16000

        # Noise reduction (simplified version)
        try:
            import noisereduce as nr
            audio_enhanced = nr.reduce_noise(y=audio, sr=sr, stationary=True)
        except:
            # If noisereduce is not available, use as is
            audio_enhanced = audio

        return audio_enhanced, sr

    def extract_embedding(self, audio, sr):
        """
        Extract speaker embedding vector
        """
        # MFCC-based simple embedding
        mfcc = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=20)
        mfcc_delta = librosa.feature.delta(mfcc)

        embedding = np.concatenate([
            np.mean(mfcc, axis=1),
            np.std(mfcc, axis=1),
            np.mean(mfcc_delta, axis=1),
            np.std(mfcc_delta, axis=1)
        ])

        return embedding

    def verify_speaker(self, audio, sr, speaker_id):
        """
        Speaker verification
        """
        if speaker_id not in self.enrolled_speakers:
            return False, 0.0

        # Extract embedding
        test_embedding = self.extract_embedding(audio, sr)
        enrolled_embedding = self.enrolled_speakers[speaker_id]

        # Cosine similarity
        from scipy.spatial.distance import cosine
        similarity = 1 - cosine(test_embedding, enrolled_embedding)

        is_verified = similarity > self.verification_threshold

        return is_verified, similarity

    def recognize_emotion(self, audio, sr):
        """
        Emotion recognition
        """
        # Feature extraction (simplified version)
        mfcc = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13)
        chroma = librosa.feature.chroma_stft(y=audio, sr=sr)

        features = np.concatenate([
            np.mean(mfcc, axis=1),
            np.std(mfcc, axis=1),
            np.mean(chroma, axis=1)
        ])

        # Simple emotion classification (use model in practice)
        # Here we randomly select
        emotion_idx = np.random.randint(0, len(self.emotion_labels))
        confidence = np.random.uniform(0.7, 0.95)

        return self.emotion_labels[emotion_idx], confidence

    def process(self, audio, sr, speaker_id=None):
        """
        Integrated processing pipeline

        Parameters:
        -----------
        audio : np.ndarray
            Input audio
        sr : int
            Sampling rate
        speaker_id : str, optional
            Speaker ID to verify

        Returns:
        --------
        result : AudioProcessingResult
            Processing result
        """
        import time
        start_time = time.time()

        # 1. Preprocessing (noise reduction)
        enhanced_audio, sr = self.preprocess_audio(audio, sr)

        # 2. Speaker verification
        is_verified = True
        similarity = 1.0
        if speaker_id is not None:
            is_verified, similarity = self.verify_speaker(enhanced_audio, sr, speaker_id)

        # 3. Emotion recognition (only if verification passed)
        emotion = None
        emotion_confidence = 0.0
        if is_verified:
            emotion, emotion_confidence = self.recognize_emotion(enhanced_audio, sr)

        processing_time = time.time() - start_time

        result = AudioProcessingResult(
            is_verified=is_verified,
            speaker_similarity=similarity,
            emotion=emotion,
            emotion_confidence=emotion_confidence,
            enhanced_audio=enhanced_audio,
            processing_time=processing_time
        )

        return result

    def enroll_speaker(self, speaker_id, audio, sr):
        """
        Enroll speaker
        """
        audio_enhanced, sr = self.preprocess_audio(audio, sr)
        embedding = self.extract_embedding(audio_enhanced, sr)
        self.enrolled_speakers[speaker_id] = embedding
        print(f"✓ Enrolled speaker '{speaker_id}'")

# Pipeline demonstration
print("=== Integrated Audio Processing Pipeline ===\n")

# Initialize pipeline
pipeline = IntegratedAudioPipeline(verification_threshold=0.7)

# Generate sample audio
sr = 16000
duration = 3.0
t = np.linspace(0, duration, int(sr * duration))

# Speaker A's voice
audio_speaker_a = np.sin(2 * np.pi * 300 * t) + 0.3 * np.random.randn(len(t))
# Speaker B's voice
audio_speaker_b = np.sin(2 * np.pi * 500 * t) + 0.3 * np.random.randn(len(t))

# Enroll speakers
pipeline.enroll_speaker("Alice", audio_speaker_a, sr)
pipeline.enroll_speaker("Bob", audio_speaker_b, sr)

print(f"\nEnrolled speakers: {list(pipeline.enrolled_speakers.keys())}\n")

# Test 1: Alice's genuine voice
print("【Test 1】Alice (genuine) voice")
test_audio_alice = audio_speaker_a + 0.1 * np.random.randn(len(audio_speaker_a))
result = pipeline.process(test_audio_alice, sr, speaker_id="Alice")

print(f"  Speaker verification: {'✓ Accepted' if result.is_verified else '✗ Rejected'}")
print(f"  Similarity: {result.speaker_similarity:.3f}")
print(f"  Emotion: {result.emotion} (confidence: {result.emotion_confidence:.2%})")
print(f"  Processing time: {result.processing_time*1000:.1f} ms")

# Test 2: Alice impersonation (Bob's voice)
print("\n【Test 2】Alice (impersonation: Bob) voice")
result = pipeline.process(audio_speaker_b, sr, speaker_id="Alice")

print(f"  Speaker verification: {'✓ Accepted' if result.is_verified else '✗ Rejected'}")
print(f"  Similarity: {result.speaker_similarity:.3f}")
print(f"  Emotion: {result.emotion if result.emotion else 'N/A'}")
print(f"  Processing time: {result.processing_time*1000:.1f} ms")

# Test 3: Bob's genuine voice
print("\n【Test 3】Bob (genuine) voice")
test_audio_bob = audio_speaker_b + 0.1 * np.random.randn(len(audio_speaker_b))
result = pipeline.process(test_audio_bob, sr, speaker_id="Bob")

print(f"  Speaker verification: {'✓ Accepted' if result.is_verified else '✗ Rejected'}")
print(f"  Similarity: {result.speaker_similarity:.3f}")
print(f"  Emotion: {result.emotion} (confidence: {result.emotion_confidence:.2%})")
print(f"  Processing time: {result.processing_time*1000:.1f} ms")

print("\n" + "="*50)
print("Integrated pipeline processing complete")
print("="*50)

Considerations for Real-Time Processing

Element	Challenge	Countermeasure
Latency	Processing delay affects user experience	Lightweight models, frame-wise processing
Memory	Constraints on embedded devices	Quantization, pruning
Accuracy	Trade-off between real-time and accuracy	Adaptive processing, staged analysis

5.6 Chapter Summary

What We Learned

Speaker Recognition and Verification
- Differences between speaker identification and verification
- Speaker embeddings with i-vector and x-vector
- Verification systems based on similarity calculation
Speech Emotion Recognition
- Emotion estimation from prosodic and acoustic features
- Datasets like RAVDESS and IEMOCAP
- Deep learning approaches with CNN/LSTM
Speech Enhancement and Noise Reduction
- Spectral subtraction and Wiener filter
- Enhancement with deep learning
- Using the noisereduce library
Music Information Processing
- Beat tracking and tempo estimation
- Chord recognition and genre classification
- Musical feature extraction
Integrated Systems
- Combining multiple technologies
- End-to-end pipelines
- Real-time processing optimization

Real-World Applications

Domain	Applications
Security	Voice authentication, fraud detection
Healthcare	Emotion monitoring, diagnostic support
Call Centers	Customer emotion analysis, quality improvement
Entertainment	Music recommendation, auto DJ, karaoke
Call Quality	Noise cancellation, speech enhancement

For Further Learning

Datasets: VoxCeleb, LibriSpeech, GTZAN, MusicNet
Libraries: pyannote.audio, speechbrain, essentia
Latest Methods: WavLM, Conformer, U-Net for audio
Evaluation Metrics: EER (Equal Error Rate), DER (Diarization Error Rate)

Exercises

Problem 1 (Difficulty: easy)

Explain the differences between Speaker Identification and Speaker Verification, and provide application examples for each.

Sample Answer

Answer:

Speaker Identification:

Definition: Task of determining which speaker from multiple enrolled speakers an input voice belongs to
Question: "Whose voice is this?"
Classification: N-way classification problem to select one person from N people
Application examples:
- Speaker recognition in meetings (automatic minute taking)
- Speaker labeling in TV programs
- User identification in voice assistants

Speaker Verification:

Definition: Task of determining whether an input voice belongs to a specific claimed speaker
Question: "Is this voice from Mr. Yamada?"
Classification: Binary classification problem (Yes/No)
Application examples:
- Voice-based authentication (smartphone unlocking)
- Identity verification in telephone banking
- Access control for security systems

Key Differences:

Item	Speaker Identification	Speaker Verification
Problem Setting	N-way classification	Binary classification
Output	Speaker ID	Genuine/Impostor
Enrolled Speakers	Multiple required	Can work with only one
Difficulty	Depends on number of speakers	Threshold setting is crucial

Problem 2 (Difficulty: medium)

In speech emotion recognition, explain the relationship between prosodic features (pitch, energy, speaking rate) and each emotion (joy, sadness, anger, fear).

Sample Answer

Answer:

Relationship Between Emotions and Prosodic Features:

Emotion	Pitch	Energy	Speaking Rate	Other Features
Joy	High, high variability	High	Fast	Clear articulation, wide pitch range
Sadness	Low, monotonous	Low	Slow	Long pauses, low energy variability
Anger	High, emphasized	High	Fast or slow	Strong stress, wide spectral bandwidth
Fear	High, unstable	Medium to high	Fast	Voice tremor, high pitch variability
Neutral	Medium, stable	Medium	Normal	No characteristic patterns

Detailed Explanation:

Pitch (Fundamental Frequency):
- High arousal emotions (joy, anger, fear) → Higher pitch
- Low arousal emotions (sadness) → Lower pitch
- Emotion intensity correlates with pitch variability
Energy (Volume):
- Positive emotions (joy), aggressive emotions (anger) → High energy
- Negative passive emotions (sadness) → Low energy
- Measured by RMS (root mean square)
Speaking Rate:
- Excited states (joy, fear) → Fast
- Depressed state (sadness) → Slow
- Anger has high individual variability (both fast and slow)

Implementation Considerations:

Speaker normalization is important due to large individual differences
Consider differences in expression due to cultural background
Use combinations of multiple features
Context (conversation flow) is also an important cue

Problem 3 (Difficulty: medium)

Explain the principle of spectral subtraction for noise reduction, and describe its advantages and disadvantages.

Sample Answer

Answer:

Principle of Spectral Subtraction:

Basic Concept:
- Noisy speech = Clean speech + Noise
- Estimate and subtract noise spectrum in frequency domain
Processing Steps:
1. Apply STFT (Short-Time Fourier Transform) to noisy speech
2. Estimate noise spectrum from silent portions
3. Subtract noise spectrum in each frequency bin
4. Clip negative values to 0 (half-wave rectification)
5. Restore phase and apply inverse STFT

Mathematical Expression:

$$ |\hat{S}(\omega, t)| = \max(|Y(\omega, t)| - \alpha |\hat{N}(\omega)|, \beta |Y(\omega, t)|) $$

$Y(\omega, t)$: Noisy speech spectrum
$\hat{N}(\omega)$: Estimated noise spectrum
$\alpha$: Subtraction coefficient (typically 1-3)
$\beta$: Spectral floor (typically 0.01-0.1)

Advantages:

✓ Simple implementation
✓ Low computational cost
✓ Real-time processing capable
✓ Effective for stationary noise
✓ Easy parameter tuning

Disadvantages:

✗ Musical noise generation
- Subtraction processing creates residual noise that sounds "sparkly"
- Can be perceptually unpleasant
✗ Weak against non-stationary noise
- Difficult to estimate time-varying noise
- Limited effectiveness for impulsive noise
✗ Speech component distortion
- Excessive subtraction degrades speech quality
- Particularly noticeable in low SNR environments
✗ Dependent on noise estimation accuracy
- Difficult to estimate without silent portions
- Performance degrades when noise characteristics change

Improvement Techniques:

Multi-band spectral subtraction: Adjust subtraction coefficient per frequency band
Nonlinear spectral subtraction: Prevent over-subtraction
Post-processing filter: Reduce musical noise
Adaptive noise estimation: Update avoiding speech segments

Problem 4 (Difficulty: hard)

Explain the architecture of the x-vector network and describe its advantages compared to traditional i-vector. Also explain the role of the Statistics Pooling layer.

Sample Answer

Answer:

x-vector Network Architecture:

Overall Structure:
- Input: Speech feature sequence (MFCC, filterbank, etc.)
- TDNN (Time Delay Neural Network) layers
- Statistics Pooling layer
- Segment-level fully connected layers
- Output: Fixed-length embedding vector (typically 512 dimensions)
TDNN Layers:
- 1D convolutions with different delays (dilation) in time axis
- Capture contexts at different time scales
- Typical configuration:
  - Layer 1: kernel=5, dilation=1
  - Layer 2: kernel=3, dilation=2
  - Layer 3: kernel=3, dilation=3
  - Layer 4-5: kernel=1, dilation=1
Statistics Pooling Layer:
- Important layer that converts variable-length input to fixed-length output
- Computes statistics along time axis: $$ \text{output} = [\mu, \sigma] $$
  - $\mu = \frac{1}{T}\sum_{t=1}^{T} h_t$ (mean)
  - $\sigma = \sqrt{\frac{1}{T}\sum_{t=1}^{T} (h_t - \mu)^2}$ (standard deviation)
- Input: (batch, features, time)
- Output: (batch, features * 2)
Segment-level Layers:
- Fully connected layers after Statistics Pooling
- Generate speaker embeddings
- Trained on classification task, embeddings extracted

Comparison of i-vector vs x-vector:

Item	i-vector	x-vector
Approach	Statistical (GMM-UBM)	Deep Learning (DNN)
Feature Extraction	Baum-Welch statistics	TDNN (convolution)
Training Data Amount	Works with small amount	Requires large amount
Computational Cost	Low	High (during training)
Performance	Medium	High
Short Duration Speech	Somewhat weak	Robust
Noise Robustness	Medium	High
Implementation Difficulty	High (UBM training)	Medium (framework usage)

Advantages of x-vector:

High Discrimination Performance:
- Deep learning learns complex speaker characteristics
- Significant performance improvement with large-scale data training
Robustness to Short Duration Speech:
- High accuracy even with 2-3 seconds of speech
- i-vector prefers long duration speech (30+ seconds)
Noise Robustness:
- Improved robustness through training data augmentation
- Statistics Pooling absorbs temporal variations
End-to-End Training:
- Simultaneous optimization from feature extraction to classification
- i-vector requires separate UBM training
Easy Transfer Learning:
- Fine-tune pre-trained models
- Can adapt with small amount of data

Role of Statistics Pooling:

Variable to Fixed-Length Conversion:
- Converts different length speech to same dimensional embedding
- Allows classifier to receive consistent input
Acquiring Time Invariance:
- Mean and standard deviation are independent of temporal order
- Summarizes speaker characteristics along time axis
Utilizing Second-Order Statistics:
- Uses not only mean (first-order) but also standard deviation (second-order)
- Enables richer speaker representation
Similarity to i-vector:
- i-vector also uses zeroth and first-order statistics
- x-vector computes statistics of deep features

Implementation Example (Statistics Pooling):

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Implementation Example (Statistics Pooling):

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import torch
import torch.nn as nn

class StatisticsPooling(nn.Module):
    def forward(self, x):
        # x: (batch, features, time)
        mean = torch.mean(x, dim=2)  # (batch, features)
        std = torch.std(x, dim=2)    # (batch, features)
        stats = torch.cat([mean, std], dim=1)  # (batch, features*2)
        return stats

Problem 5 (Difficulty: hard)

List the main considerations when designing an integrated speech processing pipeline, and explain optimization techniques to achieve real-time processing.

Sample Answer

Answer:

1. Main Considerations:

A. Functional Requirements

Processing Tasks:
- Noise reduction, speaker recognition, emotion recognition, speech recognition, etc.
- Task priorities and dependencies
Accuracy Requirements:
- Acceptable error rates for each application
- Balance between security and usability
Target Scenarios:
- Quiet environment vs noisy environment
- Clear audio vs quality degradation

B. Non-Functional Requirements

Latency:
- Real-time: < 100ms (telephony)
- Near real-time: < 500ms (assistant)
- Batch: > 1s (analysis)
Throughput:
- Number of streams that can be processed simultaneously
- CPU/GPU/memory resource constraints
Scalability:
- Handling increased user numbers
- Horizontal/vertical scaling
Reliability:
- Error handling
- Fallback mechanisms

C. System Design

Modularization:
- Each process as independent module
- Improved reusability and maintainability
Pipeline Configuration:
- Serial vs parallel processing
- Conditional branching (e.g., skip subsequent steps if speaker verification fails)
Data Flow:
- Buffer management
- Streaming vs batch

2. Real-Time Processing Optimization Techniques:

A. Model-Level Optimization

Model Compression:

Quantization:

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Model Compression:

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import torch

# FP32 → INT8
model_int8 = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)
# Memory: 1/4, Speed: 2-4x

Pruning:

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Model Compression:

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import torch.nn.utils.prune as prune

# Remove 50% of weights
prune.l1_unstructured(module, name='weight', amount=0.5)

Knowledge Distillation:
- Transfer knowledge from large model to small model
- Reduce size while maintaining accuracy

Choosing Lightweight Architectures:
- MobileNet family: Depthwise Separable Convolution
- SqueezeNet: Compression with Fire Module
- EfficientNet: Balance between accuracy and size
Efficient Operations:
- Convolution optimization (Winograd, FFT)
- Batching matrix operations
- Utilizing SIMD instructions

B. System-Level Optimization

Frame-wise Processing:

frame_length = 512  # About 23ms @ 22kHz
hop_length = 256    # About 12ms @ 22kHz

# Streaming processing
buffer = []
for frame in audio_stream:
    buffer.append(frame)
    if len(buffer) >= frame_length:
        process_frame(buffer[:frame_length])
        buffer = buffer[hop_length:]

Parallel Processing:

Multi-threading:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = [
        executor.submit(noise_reduction, audio),
        executor.submit(feature_extraction, audio)
    ]
    results = [f.result() for f in futures]

GPU Utilization:

# Maximize GPU efficiency with batch processing
batch_audio = torch.stack(audio_list).cuda()
with torch.no_grad():
    embeddings = model(batch_audio)

Caching:
- Cache speaker embeddings
- Reuse intermediate features
- Pre-load models

Adaptive Processing:

Confidence-based skipping:

if speaker_confidence > 0.95:
    # Skip detailed processing if high confidence
    return quick_result
else:
    # Detailed analysis if low confidence
    return detailed_analysis()

Staged processing (Early Exit)

Memory Management:
- Using circular buffers
- Object pool pattern
- Explicit memory deallocation

C. Algorithm-Level Optimization

Online Processing:
- Streaming MFCC computation
- Online normalization
- Incremental statistics update
Approximate Algorithms:
- FFT approximation (NFFT)
- Approximate nearest neighbor search (ANN)
- Low-rank approximation
Feature Selection:
- Prioritize low computational cost features
- Remove redundant features
- Dimensionality reduction with PCA/LDA

3. Implementation Example: Optimized Pipeline:

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0

"""
Example: 3. Implementation Example: Optimized Pipeline:

Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 10-30 seconds
Dependencies: None
"""

import torch
import numpy as np
from queue import Queue
from threading import Thread

class OptimizedAudioPipeline:
    def __init__(self):
        # Model quantization
        self.model = torch.quantization.quantize_dynamic(
            load_model(), {torch.nn.Linear}, dtype=torch.qint8
        )
        self.model.eval()

        # Cache
        self.speaker_cache = {}

        # Stream processing buffer
        self.audio_buffer = Queue(maxsize=100)

        # Worker threads
        self.workers = [
            Thread(target=self._process_worker)
            for _ in range(4)
        ]
        for w in self.workers:
            w.start()

    def process_stream(self, audio_chunk):
        """Streaming processing"""
        # Add non-blocking
        if not self.audio_buffer.full():
            self.audio_buffer.put(audio_chunk)

    def _process_worker(self):
        """Worker thread processing"""
        while True:
            chunk = self.audio_buffer.get()

            # 1. Fast noise reduction
            clean_chunk = self._fast_denoise(chunk)

            # 2. Feature extraction (GPU)
            with torch.no_grad():
                features = self._extract_features(clean_chunk)

            # 3. Cache check
            speaker_id = self._identify_speaker_cached(features)

            # 4. Return results
            self._emit_result(speaker_id, features)

    def _fast_denoise(self, audio):
        """Lightweight noise reduction"""
        # Spectral subtraction (minimal FFT)
        return spectral_subtract_fast(audio)

    def _identify_speaker_cached(self, features):
        """Speaker identification with cache"""
        # Feature hash
        feat_hash = hash(features.tobytes())

        if feat_hash in self.speaker_cache:
            return self.speaker_cache[feat_hash]

        # New computation
        speaker_id = self.model(features)
        self.speaker_cache[feat_hash] = speaker_id

        return speaker_id

# Usage example
pipeline = OptimizedAudioPipeline()

# Real-time processing
for chunk in audio_stream:
    pipeline.process_stream(chunk)

4. Performance Metrics and Monitoring:

Latency: Time from input to output
Throughput: Number of processes per unit time
CPU/GPU Usage: Resource efficiency
Memory Usage: Peak and baseline
Accuracy: Measuring degradation from optimization

Summary:

Achieving real-time processing requires optimization at model, system, and algorithm levels. Particularly important are:

Compression (quantization, pruning)
Parallel processing (multi-threading, GPU)
Streaming processing (frame-wise)
Caching (computation reuse)
Adaptive processing (context-aware optimization)

References

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. ICASSP 2018.
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). PLOS ONE.
Loizou, P. C. (2013). Speech Enhancement: Theory and Practice (2nd ed.). CRC Press.
Müller, M. (2015). Fundamentals of Music Processing. Springer.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing.
Schuller, B., Steidl, S., & Batliner, A. (2009). The INTERSPEECH 2009 emotion challenge. INTERSPEECH 2009.
Boll, S. F. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing.
Tzanetakis, G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing.

Learning Objectives

5.1 Speaker Recognition and Verification

Overview of Speaker Recognition

Approaches to Speaker Recognition

Implementation Example: Basic Speaker Recognition

Speaker Embedding with x-vector

Speaker Verification System

5.2 Speech Emotion Recognition

What is Speech Emotion Recognition

Features for Emotion Recognition

Major Emotion Datasets

Implementation Example: Emotion Recognition System

Emotion Recognition with Deep Learning

5.3 Speech Enhancement and Noise Reduction

Purpose of Speech Enhancement

Major Techniques

Implementation Example: Spectral Subtraction

Using noisereduce Library

5.4 Music Information Processing

Overview of Music Information Retrieval (MIR)

Major Tasks

Implementation Example: Beat Tracking

Implementation Example: Music Genre Classification

5.5 End-to-End Speech AI Applications

Integrated Speech Processing System

Implementation Example: Integrated Audio Processing Pipeline

Considerations for Real-Time Processing

5.6 Chapter Summary

What We Learned

Real-World Applications

For Further Learning

Exercises

Problem 1 (Difficulty: easy)

Problem 2 (Difficulty: medium)

Problem 3 (Difficulty: medium)

Problem 4 (Difficulty: hard)

Problem 5 (Difficulty: hard)

A. Functional Requirements

B. Non-Functional Requirements

C. System Design

A. Model-Level Optimization

B. System-Level Optimization

C. Algorithm-Level Optimization

References

Disclaimer