🌐 EN | πŸ‡―πŸ‡΅ JP | Last sync: 2025-11-16

Chapter 1: Fundamentals of Audio Signal Processing

From Digital Audio to Spectrograms - Understanding and Processing Acoustic Signals

πŸ“– Reading Time: 35-40 minutes πŸ“Š Difficulty: Beginner to Intermediate πŸ’» Code Examples: 10 πŸ“ Exercises: 5

This chapter covers the fundamentals of Fundamentals of Audio Signal Processing, which fundamentals of digital audio. You will learn fundamental concepts of digital audio (sampling, audio analysis methods in time, and differences between spectrograms.

Learning Objectives

By completing this chapter, you will be able to:


1.1 Fundamentals of Digital Audio

What is an Audio Signal?

An Audio Signal is the conversion of air vibrations into electrical signals or numerical data. To handle audio in machine learning, we need to digitize continuous analog signals.

Understanding that "audio is a function of time" is the first step in audio processing.

The Digitization Process

graph LR A[Analog Audio
Continuous Signal] --> B[Sampling
Discretization] B --> C[Quantization
Numerical Conversion] C --> D[Digital Audio
Numerical Array] D --> E[Audio File
WAV/MP3/FLAC] style A fill:#ffebee style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#e3f2fd style E fill:#e8f5e9

1.1.1 Sampling Rate

The Sampling Rate indicates how many times per second audio is recorded, measured in Hz (Hertz).

$$ f_s = \frac{\text{Number of Samples}}{\text{Seconds}} $$

Nyquist Theorem: To accurately reproduce a signal, the sampling rate must be at least twice the highest frequency.

$$ f_s \geq 2 \times f_{\max} $$

Common Sampling Rates

Sampling Rate Application Frequency Range
8,000 Hz Telephone audio 0-4 kHz
16,000 Hz Speech recognition 0-8 kHz
22,050 Hz Low-quality music 0-11 kHz
44,100 Hz CD-quality music 0-22 kHz
48,000 Hz Professional audio 0-24 kHz

1.1.2 Quantization and Bit Depth

Quantization is the process of converting continuous amplitude values into discrete numerical values. Bit Depth indicates how many bits are used to represent each sample.

$$ \text{Number of Representable Levels} = 2^{\text{Bit Depth}} $$

Bit Depth Number of Levels Dynamic Range Application
8-bit 256 48 dB Low quality
16-bit 65,536 96 dB CD quality
24-bit 16,777,216 144 dB Professional recording
32-bit float - 1,528 dB Processing

1.1.3 Audio File Formats

Format Compression Quality Application
WAV Uncompressed Highest Recording/Processing
FLAC Lossless Highest Archival
MP3 Lossy Medium-High Distribution/Playback
AAC Lossy High Streaming
OGG Lossy Medium-High Open format

1.1.4 Basics of librosa

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: 1.1.4 Basics of librosa

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

# Load audio file
# librosa defaults to mono and resamples to 22050Hz
audio_path = "sample.wav"
y, sr = librosa.load(audio_path)

print("=== Audio File Information ===")
print(f"Sampling Rate: {sr} Hz")
print(f"Number of Samples: {len(y)}")
print(f"Duration: {len(y) / sr:.2f} seconds")
print(f"Data Type: {y.dtype}")
print(f"Amplitude Range: [{y.min():.3f}, {y.max():.3f}]")

# Load with original sampling rate
y_orig, sr_orig = librosa.load(audio_path, sr=None)
print(f"\nOriginal Sampling Rate: {sr_orig} Hz")

# Load with specific sampling rate
y_16k, sr_16k = librosa.load(audio_path, sr=16000)
print(f"16kHz Resampled: {len(y_16k)} samples")

# Load as stereo
y_stereo, sr = librosa.load(audio_path, mono=False)
print(f"Stereo: shape = {y_stereo.shape}")

Output:

=== Audio File Information ===
Sampling Rate: 22050 Hz
Number of Samples: 66150
Duration: 3.00 seconds
Data Type: float32
Amplitude Range: [-0.523, 0.487]

Original Sampling Rate: 44100 Hz
16kHz Resampled: 48000 samples
Stereo: shape = (2, 66150)

1.2 Time Domain Processing

What is Time Domain?

The Time Domain treats audio signals as a function of time. The horizontal axis represents time, and the vertical axis represents amplitude.

1.2.1 Amplitude and Energy

Amplitude represents the loudness of sound. Energy is the sum of squared amplitudes.

$$ E = \sum_{n=1}^{N} |x[n]|^2 $$

RMS (Root Mean Square) represents average amplitude:

$$ \text{RMS} = \sqrt{\frac{1}{N} \sum_{n=1}^{N} x[n]^2} $$

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: $$
\text{RMS} = \sqrt{\frac{1}{N} \sum_{n=1}^{N} x[n]^2}
$$

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import numpy as np
import matplotlib.pyplot as plt

# Generate sample audio (440Hz sine wave + noise)
sr = 22050
duration = 2.0
t = np.linspace(0, duration, int(sr * duration))
y = 0.5 * np.sin(2 * np.pi * 440 * t) + 0.1 * np.random.randn(len(t))

print("=== Amplitude and Energy ===")

# Amplitude statistics
print(f"Maximum Amplitude: {np.max(np.abs(y)):.3f}")
print(f"Mean Amplitude: {np.mean(np.abs(y)):.3f}")

# Energy
energy = np.sum(y**2)
print(f"Total Energy: {energy:.2f}")

# RMS
rms = np.sqrt(np.mean(y**2))
print(f"RMS: {rms:.3f}")

# Frame-wise RMS calculation
frame_length = 2048
hop_length = 512
rms_frames = librosa.feature.rms(y=y, frame_length=frame_length, hop_length=hop_length)[0]

print(f"\nNumber of Frames: {len(rms_frames)}")
print(f"RMS Range: [{rms_frames.min():.3f}, {rms_frames.max():.3f}]")

# Visualization
fig, axes = plt.subplots(2, 1, figsize=(12, 6))

# Waveform
axes[0].plot(t, y, linewidth=0.5)
axes[0].set_xlabel('Time (seconds)')
axes[0].set_ylabel('Amplitude')
axes[0].set_title('Audio Waveform')
axes[0].grid(True, alpha=0.3)

# RMS
times = librosa.frames_to_time(np.arange(len(rms_frames)), sr=sr, hop_length=hop_length)
axes[1].plot(times, rms_frames)
axes[1].set_xlabel('Time (seconds)')
axes[1].set_ylabel('RMS')
axes[1].set_title('RMS Energy')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('amplitude_rms.png', dpi=150, bbox_inches='tight')
print("\nFigure saved: amplitude_rms.png")

1.2.2 Zero Crossing Rate

Zero Crossing Rate (ZCR) represents the frequency at which a signal crosses zero. It's useful for distinguishing voice types (voiced/unvoiced) and musical instruments.

$$ \text{ZCR} = \frac{1}{T-1} \sum_{t=1}^{T-1} \mathbb{1}_{x[t] \cdot x[t-1] < 0} $$

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: $$
\text{ZCR} = \frac{1}{T-1} \sum_{t=1}^{T-1} \mathbb{1}_{x

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import numpy as np
import matplotlib.pyplot as plt

# Load sample audio
y, sr = librosa.load('sample.wav', duration=5.0)

# Calculate Zero Crossing Rate
zcr = librosa.feature.zero_crossing_rate(y, frame_length=2048, hop_length=512)[0]

print("=== Zero Crossing Rate ===")
print(f"ZCR Mean: {np.mean(zcr):.4f}")
print(f"ZCR Range: [{zcr.min():.4f}, {zcr.max():.4f}]")

# Time axis
times = librosa.frames_to_time(np.arange(len(zcr)), sr=sr, hop_length=512)

# Visualization
fig, axes = plt.subplots(2, 1, figsize=(12, 6))

# Waveform
librosa.display.waveshow(y, sr=sr, ax=axes[0])
axes[0].set_title('Audio Waveform')
axes[0].set_ylabel('Amplitude')

# ZCR
axes[1].plot(times, zcr)
axes[1].set_xlabel('Time (seconds)')
axes[1].set_ylabel('ZCR')
axes[1].set_title('Zero Crossing Rate')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('zcr_analysis.png', dpi=150, bbox_inches='tight')
print("Figure saved: zcr_analysis.png")

# Interpretation: High ZCR = unvoiced sound/noise, Low ZCR = voiced sound/music
print("\nInterpretation:")
print("- High ZCR (>0.1): Unvoiced sounds, white noise, cymbals, etc.")
print("- Low ZCR (<0.05): Voiced sounds, music, low-frequency components")

1.2.3 Audio Loading and Visualization

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: 1.2.3 Audio Loading and Visualization

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Load audio file
audio_path = 'sample.wav'
y, sr = librosa.load(audio_path, sr=None)

print("=== Audio Data Visualization ===")
print(f"Sampling Rate: {sr} Hz")
print(f"Duration: {len(y)/sr:.2f} seconds")

# Visualization
fig, axes = plt.subplots(3, 1, figsize=(14, 10))

# 1. Waveform display
librosa.display.waveshow(y, sr=sr, ax=axes[0])
axes[0].set_title('Audio Waveform (Full)', fontsize=14)
axes[0].set_ylabel('Amplitude')
axes[0].grid(True, alpha=0.3)

# 2. Zoomed waveform (first 0.1 seconds)
n_samples = int(0.1 * sr)
time_zoom = np.arange(n_samples) / sr
axes[1].plot(time_zoom, y[:n_samples])
axes[1].set_title('Audio Waveform (Zoomed: First 0.1 seconds)', fontsize=14)
axes[1].set_xlabel('Time (seconds)')
axes[1].set_ylabel('Amplitude')
axes[1].grid(True, alpha=0.3)

# 3. Envelope
hop_length = 512
rms = librosa.feature.rms(y=y, hop_length=hop_length)[0]
times = librosa.frames_to_time(np.arange(len(rms)), sr=sr, hop_length=hop_length)

axes[2].plot(times, rms, label='RMS Energy', color='red', linewidth=2)
axes[2].set_title('Audio Envelope (RMS)', fontsize=14)
axes[2].set_xlabel('Time (seconds)')
axes[2].set_ylabel('RMS Amplitude')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('audio_visualization.png', dpi=150, bbox_inches='tight')
print("Figure saved: audio_visualization.png")

# Statistics
print("\nAudio Statistics:")
print(f"  Maximum Amplitude: {np.max(np.abs(y)):.3f}")
print(f"  RMS: {np.sqrt(np.mean(y**2)):.3f}")
print(f"  Dynamic Range: {20*np.log10(np.max(np.abs(y))/np.min(np.abs(y[y!=0]))):.1f} dB")

1.3 Frequency Domain Processing

What is Frequency Domain?

The Frequency Domain represents audio signals in terms of their frequency components. The Fourier Transform converts from time domain to frequency domain.

1.3.1 Fourier Transform (FFT)

The Discrete Fourier Transform (DFT) converts time domain signals to frequency domain:

$$ X[k] = \sum_{n=0}^{N-1} x[n] \cdot e^{-i 2\pi k n / N} $$

The Fast Fourier Transform (FFT) is an algorithm for computing DFT efficiently.

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: TheFast Fourier Transform (FFT)is an algorithm for computing

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import numpy as np
import matplotlib.pyplot as plt

# Sample audio (multiple frequency sine waves)
sr = 22050
duration = 1.0
t = np.linspace(0, duration, int(sr * duration))

# 440Hz (A4) + 880Hz (A5) + 1320Hz (E6)
y = (0.5 * np.sin(2 * np.pi * 440 * t) +
     0.3 * np.sin(2 * np.pi * 880 * t) +
     0.2 * np.sin(2 * np.pi * 1320 * t))

print("=== Fourier Transform (FFT) ===")

# FFT computation
n_fft = 2048
fft_result = np.fft.fft(y[:n_fft])
magnitude = np.abs(fft_result)
phase = np.angle(fft_result)

# Frequency axis
frequencies = np.fft.fftfreq(n_fft, 1/sr)

# Positive frequencies only (due to symmetry)
positive_frequencies = frequencies[:n_fft//2]
positive_magnitude = magnitude[:n_fft//2]

print(f"FFT Size: {n_fft}")
print(f"Frequency Resolution: {sr/n_fft:.2f} Hz")

# Peak frequency detection
peak_indices = np.argsort(positive_magnitude)[-3:][::-1]
print("\nDetected Peak Frequencies:")
for i, idx in enumerate(peak_indices, 1):
    print(f"  {i}. {positive_frequencies[idx]:.1f} Hz (Amplitude: {positive_magnitude[idx]:.1f})")

# Visualization
fig, axes = plt.subplots(3, 1, figsize=(12, 10))

# Time domain
axes[0].plot(t[:1000], y[:1000])
axes[0].set_xlabel('Time (seconds)')
axes[0].set_ylabel('Amplitude')
axes[0].set_title('Time Domain Signal')
axes[0].grid(True, alpha=0.3)

# Frequency domain (magnitude spectrum)
axes[1].plot(positive_frequencies, positive_magnitude)
axes[1].set_xlabel('Frequency (Hz)')
axes[1].set_ylabel('Amplitude')
axes[1].set_title('Magnitude Spectrum')
axes[1].set_xlim([0, 2000])
axes[1].grid(True, alpha=0.3)

# Logarithmic scale
axes[2].plot(positive_frequencies, 20*np.log10(positive_magnitude + 1e-10))
axes[2].set_xlabel('Frequency (Hz)')
axes[2].set_ylabel('Amplitude (dB)')
axes[2].set_title('Magnitude Spectrum (Logarithmic Scale)')
axes[2].set_xlim([0, 2000])
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('fft_analysis.png', dpi=150, bbox_inches='tight')
print("\nFigure saved: fft_analysis.png")

1.3.2 Spectrogram

A Spectrogram is a time-frequency representation of audio. It uses the Short-Time Fourier Transform (STFT).

$$ \text{STFT}(t, f) = \sum_{n=-\infty}^{\infty} x[n] \cdot w[n-t] \cdot e^{-i 2\pi f n} $$

where $w[n]$ is a window function (Hamming window, Hanning window, etc.).

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: where $w[n]$ is a window function (Hamming window, Hanning w

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Load audio file
y, sr = librosa.load('sample.wav', duration=5.0)

print("=== Spectrogram ===")

# STFT computation
n_fft = 2048
hop_length = 512
D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)

# Amplitude spectrogram
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

print(f"Spectrogram Shape: {S_db.shape}")
print(f"  Frequency Bins: {S_db.shape[0]}")
print(f"  Time Frames: {S_db.shape[1]}")
print(f"  Frequency Resolution: {sr/n_fft:.2f} Hz")
print(f"  Time Resolution: {hop_length/sr*1000:.2f} ms")

# Visualization
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Waveform
librosa.display.waveshow(y, sr=sr, ax=axes[0])
axes[0].set_title('Audio Waveform', fontsize=14)
axes[0].set_ylabel('Amplitude')

# Spectrogram
img = librosa.display.specshow(S_db, sr=sr, hop_length=hop_length,
                                x_axis='time', y_axis='hz', ax=axes[1], cmap='viridis')
axes[1].set_title('Spectrogram', fontsize=14)
axes[1].set_ylabel('Frequency (Hz)')
axes[1].set_xlabel('Time (seconds)')
fig.colorbar(img, ax=axes[1], format='%+2.0f dB')

plt.tight_layout()
plt.savefig('spectrogram.png', dpi=150, bbox_inches='tight')
print("Figure saved: spectrogram.png")

# Power spectrogram
S_power = np.abs(D)**2
print(f"\nPower Range: [{S_power.min():.2e}, {S_power.max():.2e}]")

1.3.3 Mel-Spectrogram

A Mel-Spectrogram uses a frequency scale that considers human auditory characteristics.

The Mel Scale is a measure of perceived pitch:

$$ m = 2595 \log_{10}\left(1 + \frac{f}{700}\right) $$

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: $$
m = 2595 \log_{10}\left(1 + \frac{f}{700}\right)
$$

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Load audio
y, sr = librosa.load('sample.wav', duration=5.0)

print("=== Mel-Spectrogram ===")

# Compute mel-spectrogram
n_fft = 2048
hop_length = 512
n_mels = 128

S_mel = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=n_fft,
                                        hop_length=hop_length, n_mels=n_mels)
S_mel_db = librosa.power_to_db(S_mel, ref=np.max)

print(f"Mel-Spectrogram Shape: {S_mel_db.shape}")
print(f"  Mel Bands: {n_mels}")
print(f"  Time Frames: {S_mel_db.shape[1]}")

# Regular spectrogram
D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

# Comparison visualization
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Regular spectrogram
img1 = librosa.display.specshow(S_db, sr=sr, hop_length=hop_length,
                                  x_axis='time', y_axis='hz', ax=axes[0], cmap='viridis')
axes[0].set_title('Regular Spectrogram (Linear Frequency)', fontsize=14)
axes[0].set_ylabel('Frequency (Hz)')
fig.colorbar(img1, ax=axes[0], format='%+2.0f dB')

# Mel-spectrogram
img2 = librosa.display.specshow(S_mel_db, sr=sr, hop_length=hop_length,
                                  x_axis='time', y_axis='mel', ax=axes[1], cmap='viridis')
axes[1].set_title('Mel-Spectrogram (Mel Scale)', fontsize=14)
axes[1].set_ylabel('Frequency (Mel)')
axes[1].set_xlabel('Time (seconds)')
fig.colorbar(img2, ax=axes[1], format='%+2.0f dB')

plt.tight_layout()
plt.savefig('mel_spectrogram.png', dpi=150, bbox_inches='tight')
print("Figure saved: mel_spectrogram.png")

# Mel scale characteristics
print("\nMel Scale Characteristics:")
print("- Low frequencies: High frequency resolution")
print("- High frequencies: Low frequency resolution")
print("- Closer to human auditory characteristics")

1.4 Acoustic Features

What are Acoustic Features?

Acoustic Features are characteristic numerical representations extracted from audio signals. They are used as inputs to machine learning models.

1.4.1 MFCC (Mel-Frequency Cepstral Coefficients)

MFCC is the most widely used feature in speech recognition. It applies logarithm and Discrete Cosine Transform (DCT) to mel-spectrograms.

graph LR A[Audio Signal] --> B[Pre-emphasis] B --> C[Framing] C --> D[Windowing] D --> E[FFT] E --> F[Mel Filter Bank] F --> G[Logarithm] G --> H[DCT] H --> I[MFCC Coefficients] style A fill:#ffebee style I fill:#e8f5e9
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: MFCCis the most widely used feature in speech recognition. I

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Load audio
y, sr = librosa.load('sample.wav', duration=5.0)

print("=== MFCC (Mel-Frequency Cepstral Coefficients) ===")

# Compute MFCC
n_mfcc = 13  # Typically 13 or 20
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)

print(f"MFCC Shape: {mfccs.shape}")
print(f"  MFCC Coefficients: {n_mfcc}")
print(f"  Time Frames: {mfccs.shape[1]}")

# Statistics
print(f"\nMFCC Statistics:")
print(f"  Mean: {np.mean(mfccs, axis=1)[:5]}")  # First 5 coefficients
print(f"  Standard Deviation: {np.std(mfccs, axis=1)[:5]}")

# Delta features (first derivative)
mfccs_delta = librosa.feature.delta(mfccs)
print(f"\nDelta MFCC Shape: {mfccs_delta.shape}")

# Delta-delta features (second derivative)
mfccs_delta2 = librosa.feature.delta(mfccs, order=2)
print(f"Delta-Delta MFCC Shape: {mfccs_delta2.shape}")

# Visualization
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# MFCC
img1 = librosa.display.specshow(mfccs, sr=sr, x_axis='time', ax=axes[0], cmap='coolwarm')
axes[0].set_title('MFCC', fontsize=14)
axes[0].set_ylabel('MFCC Coefficients')
fig.colorbar(img1, ax=axes[0])

# Delta MFCC
img2 = librosa.display.specshow(mfccs_delta, sr=sr, x_axis='time', ax=axes[1], cmap='coolwarm')
axes[1].set_title('Delta MFCC (First Derivative)', fontsize=14)
axes[1].set_ylabel('Delta MFCC Coefficients')
fig.colorbar(img2, ax=axes[1])

# Delta-Delta MFCC
img3 = librosa.display.specshow(mfccs_delta2, sr=sr, x_axis='time', ax=axes[2], cmap='coolwarm')
axes[2].set_title('Delta-Delta MFCC (Second Derivative)', fontsize=14)
axes[2].set_ylabel('Delta-Delta MFCC Coefficients')
axes[2].set_xlabel('Time (seconds)')
fig.colorbar(img3, ax=axes[2])

plt.tight_layout()
plt.savefig('mfcc_features.png', dpi=150, bbox_inches='tight')
print("\nFigure saved: mfcc_features.png")

# Combine feature vectors (common in speech recognition)
combined_features = np.vstack([mfccs, mfccs_delta, mfccs_delta2])
print(f"\nCombined Features Shape: {combined_features.shape}")
print(f"  (13 MFCC + 13 Delta + 13 Delta-Delta = 39 dimensions)")

1.4.2 Chroma Features

Chroma Features represent tonality and chords in music. They calculate energy for each of the 12 pitch classes (C, C#, D, ..., B).

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Chroma Featuresrepresent tonality and chords in music. They 

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Load music file
y, sr = librosa.load('music.wav', duration=10.0)

print("=== Chroma Features ===")

# Compute chromagram
chromagram = librosa.feature.chroma_stft(y=y, sr=sr)

print(f"Chromagram Shape: {chromagram.shape}")
print(f"  Pitch Classes: 12 (C, C#, D, D#, E, F, F#, G, G#, A, A#, B)")
print(f"  Time Frames: {chromagram.shape[1]}")

# CQT-based chroma
chromagram_cqt = librosa.feature.chroma_cqt(y=y, sr=sr)
print(f"\nCQT Chromagram Shape: {chromagram_cqt.shape}")

# Visualization
fig, axes = plt.subplots(3, 1, figsize=(14, 10))

# Waveform
librosa.display.waveshow(y, sr=sr, ax=axes[0])
axes[0].set_title('Audio Waveform', fontsize=14)
axes[0].set_ylabel('Amplitude')

# STFT-based chroma
img1 = librosa.display.specshow(chromagram, sr=sr, x_axis='time',
                                  y_axis='chroma', ax=axes[1], cmap='coolwarm')
axes[1].set_title('Chromagram (STFT)', fontsize=14)
axes[1].set_ylabel('Pitch Class')
fig.colorbar(img1, ax=axes[1])

# CQT-based chroma
img2 = librosa.display.specshow(chromagram_cqt, sr=sr, x_axis='time',
                                  y_axis='chroma', ax=axes[2], cmap='coolwarm')
axes[2].set_title('Chromagram (CQT)', fontsize=14)
axes[2].set_ylabel('Pitch Class')
axes[2].set_xlabel('Time (seconds)')
fig.colorbar(img2, ax=axes[2])

plt.tight_layout()
plt.savefig('chroma_features.png', dpi=150, bbox_inches='tight')
print("Figure saved: chroma_features.png")

# Mean energy for each pitch class
pitch_classes = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
mean_energy = np.mean(chromagram, axis=1)

print("\nMean Energy for Each Pitch Class:")
for pc, energy in zip(pitch_classes, mean_energy):
    print(f"  {pc}: {energy:.3f}")

1.4.3 Spectral Features

Spectral Features capture characteristics of frequency distribution.

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Spectral Featurescapture characteristics of frequency distri

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import matplotlib.pyplot as plt
import numpy as np

# Load audio
y, sr = librosa.load('sample.wav', duration=5.0)

print("=== Spectral Features ===")

# 1. Spectral Centroid
spectral_centroids = librosa.feature.spectral_centroid(y=y, sr=sr)[0]
print(f"Spectral Centroid: Mean {np.mean(spectral_centroids):.2f} Hz")

# 2. Spectral Rolloff
spectral_rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)[0]
print(f"Spectral Rolloff: Mean {np.mean(spectral_rolloff):.2f} Hz")

# 3. Spectral Bandwidth
spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr)[0]
print(f"Spectral Bandwidth: Mean {np.mean(spectral_bandwidth):.2f} Hz")

# 4. Spectral Contrast
spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
print(f"Spectral Contrast Shape: {spectral_contrast.shape}")

# 5. Spectral Flatness
spectral_flatness = librosa.feature.spectral_flatness(y=y)[0]
print(f"Spectral Flatness: Mean {np.mean(spectral_flatness):.4f}")

# Time axis
frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames, sr=sr)

# Visualization
fig, axes = plt.subplots(5, 1, figsize=(14, 14))

# Spectrogram (background)
D = librosa.stft(y)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

# 1. Spectral Centroid
librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='hz', ax=axes[0], cmap='gray_r', alpha=0.5)
axes[0].plot(t, spectral_centroids, color='red', linewidth=2, label='Spectral Centroid')
axes[0].set_title('Spectral Centroid', fontsize=14)
axes[0].set_ylabel('Frequency (Hz)')
axes[0].legend()

# 2. Spectral Rolloff
librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='hz', ax=axes[1], cmap='gray_r', alpha=0.5)
axes[1].plot(t, spectral_rolloff, color='blue', linewidth=2, label='Spectral Rolloff')
axes[1].set_title('Spectral Rolloff', fontsize=14)
axes[1].set_ylabel('Frequency (Hz)')
axes[1].legend()

# 3. Spectral Bandwidth
axes[2].plot(t, spectral_bandwidth, color='green', linewidth=2)
axes[2].set_title('Spectral Bandwidth', fontsize=14)
axes[2].set_ylabel('Bandwidth (Hz)')
axes[2].grid(True, alpha=0.3)

# 4. Spectral Contrast
img = librosa.display.specshow(spectral_contrast, sr=sr, x_axis='time', ax=axes[3], cmap='coolwarm')
axes[3].set_title('Spectral Contrast', fontsize=14)
axes[3].set_ylabel('Frequency Band')
fig.colorbar(img, ax=axes[3])

# 5. Spectral Flatness
axes[4].plot(t, spectral_flatness, color='purple', linewidth=2)
axes[4].set_title('Spectral Flatness (0=Tonal, 1=Noisy)', fontsize=14)
axes[4].set_ylabel('Flatness')
axes[4].set_xlabel('Time (seconds)')
axes[4].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('spectral_features.png', dpi=150, bbox_inches='tight')
print("\nFigure saved: spectral_features.png")

print("\nInterpretation:")
print("- Centroid: \"Brightness\" of sound")
print("- Rolloff: Energy concentration")
print("- Bandwidth: Frequency spread")
print("- Contrast: Contrast between frequency bands")
print("- Flatness: Tonal (0) or noisy (1)")

1.4.4 Feature Extraction Pipeline

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import librosa
import numpy as np

def extract_audio_features(audio_path, sr=22050):
    """
    Extract comprehensive features from audio file
    """
    # Load audio
    y, sr = librosa.load(audio_path, sr=sr)

    # Feature dictionary
    features = {}

    # 1. Basic statistics
    features['duration'] = len(y) / sr
    features['rms_mean'] = float(np.mean(librosa.feature.rms(y=y)))
    features['zcr_mean'] = float(np.mean(librosa.feature.zero_crossing_rate(y)))

    # 2. MFCC (statistics)
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    for i in range(13):
        features[f'mfcc_{i}_mean'] = float(np.mean(mfccs[i]))
        features[f'mfcc_{i}_std'] = float(np.std(mfccs[i]))

    # 3. Spectral features
    features['spectral_centroid_mean'] = float(np.mean(librosa.feature.spectral_centroid(y=y, sr=sr)))
    features['spectral_rolloff_mean'] = float(np.mean(librosa.feature.spectral_rolloff(y=y, sr=sr)))
    features['spectral_bandwidth_mean'] = float(np.mean(librosa.feature.spectral_bandwidth(y=y, sr=sr)))
    features['spectral_flatness_mean'] = float(np.mean(librosa.feature.spectral_flatness(y=y)))

    # 4. Chroma features
    chroma = librosa.feature.chroma_stft(y=y, sr=sr)
    features['chroma_mean'] = float(np.mean(chroma))
    features['chroma_std'] = float(np.std(chroma))

    # 5. Tempo
    tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
    features['tempo'] = float(tempo)

    return features

# Usage example
print("=== Audio Feature Extraction Pipeline ===")
audio_file = 'sample.wav'
features = extract_audio_features(audio_file)

print(f"\nNumber of Extracted Features: {len(features)}")
print("\nKey Features:")
for key in list(features.keys())[:10]:
    print(f"  {key}: {features[key]:.4f}")

# Get as feature vector
feature_vector = np.array(list(features.values()))
print(f"\nFeature Vector Dimensionality: {len(feature_vector)}")
print(f"Feature Vector (first 10 dimensions): {feature_vector[:10]}")

# Extract features from multiple files
audio_files = ['sample1.wav', 'sample2.wav', 'sample3.wav']
all_features = []

print("\nFeature Extraction from Multiple Files:")
for audio_file in audio_files:
    try:
        feats = extract_audio_features(audio_file)
        all_features.append(list(feats.values()))
        print(f"  βœ“ {audio_file}: {len(feats)} features")
    except Exception as e:
        print(f"  βœ— {audio_file}: Error ({e})")

if all_features:
    # Convert to NumPy array (use as input to machine learning model)
    X = np.array(all_features)
    print(f"\nFeature Matrix Shape: {X.shape}")
    print(f"  (Number of Samples, Number of Features)")

1.5 Audio Data Preprocessing

Importance of Preprocessing

Audio data preprocessing significantly impacts model performance. It includes noise reduction, normalization, data augmentation, etc.

1.5.1 Noise Reduction

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: 1.5.1 Noise Reduction

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import wiener

# Load audio
y, sr = librosa.load('sample.wav')

print("=== Noise Reduction ===")

# 1. Wiener filter
y_wiener = wiener(y)
print(f"Wiener filter applied")

# 2. Spectral gating (simplified version)
# Estimate noise profile (assume first 0.5 seconds is noise)
noise_sample = y[:int(0.5 * sr)]
noise_spec = np.abs(librosa.stft(noise_sample))
noise_profile = np.mean(noise_spec, axis=1, keepdims=True)

# Full spectrogram
D = librosa.stft(y)
magnitude = np.abs(D)
phase = np.angle(D)

# Noise reduction (spectral subtraction)
magnitude_clean = np.maximum(magnitude - 2.0 * noise_profile, 0.0)
D_clean = magnitude_clean * np.exp(1j * phase)
y_clean = librosa.istft(D_clean)

print(f"Spectral subtraction applied")

# 3. High-pass filter (remove low-frequency noise)
from scipy.signal import butter, filtfilt

def highpass_filter(data, cutoff, fs, order=5):
    nyquist = 0.5 * fs
    normal_cutoff = cutoff / nyquist
    b, a = butter(order, normal_cutoff, btype='high', analog=False)
    return filtfilt(b, a, data)

y_highpass = highpass_filter(y, cutoff=80, fs=sr)
print(f"High-pass filter applied (cut below 80Hz)")

# Visualization
fig, axes = plt.subplots(4, 1, figsize=(14, 12))

# Original
librosa.display.waveshow(y, sr=sr, ax=axes[0], color='blue')
axes[0].set_title('Original Audio', fontsize=14)
axes[0].set_ylabel('Amplitude')

# Wiener filter
librosa.display.waveshow(y_wiener, sr=sr, ax=axes[1], color='green')
axes[1].set_title('After Wiener Filter', fontsize=14)
axes[1].set_ylabel('Amplitude')

# Spectral subtraction
librosa.display.waveshow(y_clean, sr=sr, ax=axes[2], color='red')
axes[2].set_title('After Spectral Subtraction', fontsize=14)
axes[2].set_ylabel('Amplitude')

# High-pass filter
librosa.display.waveshow(y_highpass, sr=sr, ax=axes[3], color='purple')
axes[3].set_title('After High-pass Filter', fontsize=14)
axes[3].set_ylabel('Amplitude')
axes[3].set_xlabel('Time (seconds)')

plt.tight_layout()
plt.savefig('noise_reduction.png', dpi=150, bbox_inches='tight')
print("\nFigure saved: noise_reduction.png")

# SNR calculation (simplified)
def calculate_snr(signal, noise):
    signal_power = np.mean(signal**2)
    noise_power = np.mean(noise**2)
    return 10 * np.log10(signal_power / noise_power)

print(f"\nSNR Improvement:")
print(f"  Original: Baseline")
print(f"  Wiener: +{calculate_snr(y_wiener, y-y_wiener):.2f} dB")
print(f"  Spectral Subtraction: +{calculate_snr(y_clean, y-y_clean):.2f} dB")

1.5.2 Normalization

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: 1.5.2 Normalization

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import numpy as np
import matplotlib.pyplot as plt

# Load audio
y, sr = librosa.load('sample.wav')

print("=== Normalization ===")

# 1. Peak normalization (set maximum amplitude to 1.0)
y_peak_norm = y / np.max(np.abs(y))
print(f"Peak normalization: Max value = {np.max(np.abs(y_peak_norm)):.3f}")

# 2. RMS normalization (adjust to target RMS level)
target_rms = 0.1
current_rms = np.sqrt(np.mean(y**2))
y_rms_norm = y * (target_rms / current_rms)
new_rms = np.sqrt(np.mean(y_rms_norm**2))
print(f"RMS normalization: {current_rms:.4f} β†’ {new_rms:.4f}")

# 3. Z-score normalization (mean 0, standard deviation 1)
y_zscore = (y - np.mean(y)) / np.std(y)
print(f"Z-score normalization: Mean = {np.mean(y_zscore):.6f}, Std = {np.std(y_zscore):.3f}")

# 4. Min-Max normalization (-1 to 1)
y_minmax = 2 * (y - np.min(y)) / (np.max(y) - np.min(y)) - 1
print(f"Min-Max normalization: Range = [{np.min(y_minmax):.3f}, {np.max(y_minmax):.3f}]")

# Visualization
fig, axes = plt.subplots(4, 2, figsize=(14, 12))

normalizations = [
    ('Peak Normalization', y_peak_norm),
    ('RMS Normalization', y_rms_norm),
    ('Z-score Normalization', y_zscore),
    ('Min-Max Normalization', y_minmax)
]

for idx, (title, y_norm) in enumerate(normalizations):
    # Waveform
    librosa.display.waveshow(y_norm, sr=sr, ax=axes[idx, 0])
    axes[idx, 0].set_title(f'{title} - Waveform', fontsize=12)
    axes[idx, 0].set_ylabel('Amplitude')

    # Histogram
    axes[idx, 1].hist(y_norm, bins=50, alpha=0.7, color='blue')
    axes[idx, 1].set_title(f'{title} - Amplitude Distribution', fontsize=12)
    axes[idx, 1].set_xlabel('Amplitude')
    axes[idx, 1].set_ylabel('Frequency')
    axes[idx, 1].grid(True, alpha=0.3)

axes[3, 0].set_xlabel('Time (seconds)')

plt.tight_layout()
plt.savefig('normalization.png', dpi=150, bbox_inches='tight')
print("\nFigure saved: normalization.png")

# Statistics after normalization
print("\nPost-Normalization Statistics:")
for title, y_norm in normalizations:
    print(f"{title}:")
    print(f"  Mean: {np.mean(y_norm):.6f}")
    print(f"  Standard Deviation: {np.std(y_norm):.6f}")
    print(f"  Range: [{np.min(y_norm):.3f}, {np.max(y_norm):.3f}]")

1.5.3 Trimming and Padding

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: 1.5.3 Trimming and Padding

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import numpy as np
import matplotlib.pyplot as plt

# Load audio
y, sr = librosa.load('sample.wav')

print("=== Trimming and Padding ===")

# 1. Trim silence
y_trimmed, index = librosa.effects.trim(y, top_db=20)
print(f"Trimming:")
print(f"  Original length: {len(y)/sr:.2f} seconds")
print(f"  After trimming: {len(y_trimmed)/sr:.2f} seconds")
print(f"  Reduction: {(1 - len(y_trimmed)/len(y))*100:.1f}%")

# 2. Pad or trim to fixed length
target_length = sr * 5  # 5 seconds

def pad_or_trim(audio, target_length):
    """Pad or trim to fixed length"""
    if len(audio) < target_length:
        # Padding
        pad_length = target_length - len(audio)
        audio_padded = np.pad(audio, (0, pad_length), mode='constant')
        return audio_padded, 'padded'
    else:
        # Trimming
        return audio[:target_length], 'trimmed'

y_fixed, action = pad_or_trim(y, target_length)
print(f"\nFixed Length Processing (5 seconds):")
print(f"  Action: {action}")
print(f"  Result length: {len(y_fixed)/sr:.2f} seconds")

# 3. Center padding
def center_pad(audio, target_length):
    """Center padding"""
    if len(audio) >= target_length:
        return audio[:target_length]

    pad_length = target_length - len(audio)
    pad_left = pad_length // 2
    pad_right = pad_length - pad_left
    return np.pad(audio, (pad_left, pad_right), mode='constant')

y_center_padded = center_pad(y_trimmed, target_length)
print(f"\nCenter Padding:")
print(f"  Result length: {len(y_center_padded)/sr:.2f} seconds")

# Visualization
fig, axes = plt.subplots(4, 1, figsize=(14, 10))

# Original
librosa.display.waveshow(y, sr=sr, ax=axes[0], color='blue')
axes[0].set_title('Original Audio', fontsize=14)
axes[0].set_ylabel('Amplitude')
axes[0].axvline(x=index[0]/sr, color='red', linestyle='--', alpha=0.5)
axes[0].axvline(x=index[1]/sr, color='red', linestyle='--', alpha=0.5)

# After trimming
librosa.display.waveshow(y_trimmed, sr=sr, ax=axes[1], color='green')
axes[1].set_title('After Silence Trimming', fontsize=14)
axes[1].set_ylabel('Amplitude')

# Fixed length
librosa.display.waveshow(y_fixed, sr=sr, ax=axes[2], color='red')
axes[2].set_title(f'After Fixed Length Processing (5 seconds, {action})', fontsize=14)
axes[2].set_ylabel('Amplitude')

# Center padding
librosa.display.waveshow(y_center_padded, sr=sr, ax=axes[3], color='purple')
axes[3].set_title('Center Padding', fontsize=14)
axes[3].set_ylabel('Amplitude')
axes[3].set_xlabel('Time (seconds)')

plt.tight_layout()
plt.savefig('trim_pad.png', dpi=150, bbox_inches='tight')
print("\nFigure saved: trim_pad.png")

1.5.4 Data Augmentation

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: 1.5.4 Data Augmentation

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import numpy as np
import matplotlib.pyplot as plt

# Load audio
y, sr = librosa.load('sample.wav', duration=3.0)

print("=== Data Augmentation ===")

# 1. Pitch Shifting
y_pitch_up = librosa.effects.pitch_shift(y, sr=sr, n_steps=4)  # Shift up 4 semitones
y_pitch_down = librosa.effects.pitch_shift(y, sr=sr, n_steps=-4)  # Shift down 4 semitones
print("Pitch shift: Β±4 semitones")

# 2. Time Stretching
y_fast = librosa.effects.time_stretch(y, rate=1.5)  # 1.5x speed
y_slow = librosa.effects.time_stretch(y, rate=0.8)  # 0.8x speed
print(f"Time stretch: 1.5x speed({len(y_fast)/sr:.2f}s), 0.8x speed({len(y_slow)/sr:.2f}s)")

# 3. Add noise
noise_factor = 0.005
noise = np.random.randn(len(y))
y_noise = y + noise_factor * noise
print(f"White noise added: Noise level = {noise_factor}")

# 4. Gain adjustment (volume change)
y_louder = y * 1.5
y_quieter = y * 0.5
print("Gain adjustment: 1.5x, 0.5x")

# 5. Time Shift
shift_samples = int(0.2 * sr)  # 0.2 second shift
y_shift = np.roll(y, shift_samples)
print(f"Time shift: {shift_samples/sr:.2f} seconds")

# Visualization
fig, axes = plt.subplots(6, 1, figsize=(14, 16))

augmentations = [
    ('Original', y),
    ('Pitch Shift (+4 semitones)', y_pitch_up),
    ('Time Stretch (1.5x)', y_fast),
    ('White Noise Added', y_noise),
    ('Gain Adjustment (1.5x)', y_louder),
    ('Time Shift (0.2s)', y_shift)
]

for idx, (title, y_aug) in enumerate(augmentations):
    librosa.display.waveshow(y_aug, sr=sr, ax=axes[idx])
    axes[idx].set_title(title, fontsize=14)
    axes[idx].set_ylabel('Amplitude')

axes[5].set_xlabel('Time (seconds)')

plt.tight_layout()
plt.savefig('data_augmentation.png', dpi=150, bbox_inches='tight')
print("\nFigure saved: data_augmentation.png")

# Data augmentation pipeline
def augment_audio(y, sr):
    """Apply data augmentation randomly"""
    augmentations = []

    # Pitch shift (50% probability)
    if np.random.rand() > 0.5:
        n_steps = np.random.randint(-3, 4)
        y = librosa.effects.pitch_shift(y, sr=sr, n_steps=n_steps)
        augmentations.append(f'pitch_shift({n_steps})')

    # Time stretch (50% probability)
    if np.random.rand() > 0.5:
        rate = np.random.uniform(0.8, 1.2)
        y = librosa.effects.time_stretch(y, rate=rate)
        augmentations.append(f'time_stretch({rate:.2f})')

    # Add noise (30% probability)
    if np.random.rand() > 0.7:
        noise = np.random.randn(len(y))
        y = y + 0.005 * noise
        augmentations.append('add_noise')

    # Gain adjustment (50% probability)
    if np.random.rand() > 0.5:
        gain = np.random.uniform(0.7, 1.3)
        y = y * gain
        augmentations.append(f'gain({gain:.2f})')

    return y, augmentations

# Usage example
print("\nRandom Augmentation Examples:")
for i in range(3):
    y_aug, augs = augment_audio(y, sr)
    print(f"  Trial {i+1}: {', '.join(augs) if augs else 'none'}")

1.6 Chapter Summary

What We Learned

  1. Fundamentals of Digital Audio

    • Sampling rate: Temporal discretization
    • Quantization and bit depth: Amplitude discretization
    • Audio file formats: WAV, MP3, FLAC
    • Audio loading and basic operations with librosa
  2. Time Domain Processing

    • Amplitude, energy, RMS
    • Zero Crossing Rate (ZCR)
    • Audio waveform visualization
  3. Frequency Domain Processing

    • Fourier Transform (FFT): Time β†’ Frequency
    • Spectrogram: Time-frequency representation
    • Mel-spectrogram: Considering human auditory characteristics
  4. Acoustic Features

    • MFCC: Standard feature for speech recognition
    • Chroma: Tonality representation in music
    • Spectral Features: Frequency distribution characteristics
    • Feature extraction pipeline
  5. Audio Data Preprocessing

    • Noise reduction: Wiener filter, spectral subtraction
    • Normalization: Peak, RMS, Z-score
    • Trimming and padding: Fixed-length conversion
    • Data augmentation: Pitch shift, time stretch

Example Processing Pipeline

graph LR A[Audio File] --> B[Load
librosa.load] B --> C[Preprocessing
Trim/Normalize] C --> D[Feature Extraction
MFCC/Mel-Spec] D --> E[Data Augmentation
Pitch/Time] E --> F[Model Input
NumPy Array] style A fill:#ffebee style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#e3f2fd style E fill:#e8f5e9 style F fill:#c8e6c9

Feature Selection Guidelines

Task Recommended Features Reason
Speech Recognition MFCC + Delta Captures phoneme characteristics
Speaker Identification MFCC + Pitch Speaker-specific characteristics
Music Genre Classification Chroma + Spectral Tonality and texture
Environmental Sound Recognition Mel-Spectrogram Time-frequency patterns
Emotion Recognition MFCC + Prosodic Features Speech expressiveness

Next Chapter

In Chapter 2, we will learn the Fundamentals of Speech Recognition:


Exercises

Problem 1 (Difficulty: Easy)

How many samples does a 3-second audio file recorded at a sampling rate of 44100Hz consist of? Also, based on the Nyquist theorem, what is the highest frequency that this audio file can accurately represent?

Solution

Answer:

Calculating Number of Samples:

$$ \text{Number of Samples} = \text{Sampling Rate} \times \text{Duration} $$

$$ = 44100 \text{ Hz} \times 3 \text{ seconds} = 132300 \text{ samples} $$

Calculating Maximum Frequency (Nyquist Theorem):

$$ f_{\max} = \frac{f_s}{2} = \frac{44100}{2} = 22050 \text{ Hz} $$

Answer:

Note: The human audible range is approximately 20-20,000 Hz, so a sampling rate of 44,100 Hz provides sufficient quality for music CDs.

Problem 2 (Difficulty: Medium)

Complete the following code to extract MFCCs from an audio file and calculate their statistics (mean and standard deviation).

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Complete the following code to extract MFCCs from an audio f

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: 10-30 seconds
Dependencies: None
"""

import librosa
import numpy as np

# Load audio
y, sr = librosa.load('sample.wav')

# Extract MFCC (13 coefficients)
# Implementation exercise for students

# Calculate mean and standard deviation for each MFCC coefficient
# Implementation exercise for students

# Display results
# Implementation exercise for students
Solution
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Complete the following code to extract MFCCs from an audio f

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import numpy as np

# Load audio
y, sr = librosa.load('sample.wav')

# Extract MFCC (13 coefficients)
n_mfcc = 13
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)

print(f"=== MFCC Statistics ===")
print(f"MFCC Shape: {mfccs.shape}")
print(f"  Coefficients: {n_mfcc}")
print(f"  Time Frames: {mfccs.shape[1]}")

# Calculate mean and standard deviation for each MFCC coefficient
mfcc_mean = np.mean(mfccs, axis=1)
mfcc_std = np.std(mfccs, axis=1)

# Display results
print("\nStatistics for Each MFCC Coefficient:")
print(f"{'Coefficient':<10} {'Mean':<15} {'Std Dev':<15}")
print("-" * 40)
for i in range(n_mfcc):
    print(f"MFCC-{i:<4} {mfcc_mean[i]:<15.4f} {mfcc_std[i]:<15.4f}")

# Combine as feature vector (used in machine learning)
feature_vector = np.concatenate([mfcc_mean, mfcc_std])
print(f"\nFeature Vector Dimensionality: {len(feature_vector)}")
print(f"  (13 means + 13 std devs = 26 dimensions)")

# Visualization
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# MFCC heatmap
img = librosa.display.specshow(mfccs, sr=sr, x_axis='time', ax=axes[0], cmap='coolwarm')
axes[0].set_title('MFCC', fontsize=14)
axes[0].set_ylabel('MFCC Coefficients')
fig.colorbar(img, ax=axes[0])

# Bar plot of statistics
x = np.arange(n_mfcc)
width = 0.35
axes[1].bar(x - width/2, mfcc_mean, width, label='Mean', alpha=0.8)
axes[1].bar(x + width/2, mfcc_std, width, label='Std Dev', alpha=0.8)
axes[1].set_xlabel('MFCC Coefficients')
axes[1].set_ylabel('Value')
axes[1].set_title('MFCC Statistics', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('mfcc_statistics.png', dpi=150)
print("\nFigure saved: mfcc_statistics.png")

Sample Output:

=== MFCC Statistics ===
MFCC Shape: (13, 130)
  Coefficients: 13
  Time Frames: 130

Statistics for Each MFCC Coefficient:
Coefficient  Mean             Std Dev
----------------------------------------
MFCC-0     -123.4567        45.2341
MFCC-1      32.1234         12.5678
MFCC-2      -8.9012          7.3456
MFCC-3       5.6789          6.1234
...

Feature Vector Dimensionality: 26
  (13 means + 13 std devs = 26 dimensions)

Problem 3 (Difficulty: Medium)

Explain the differences between spectrograms and mel-spectrograms, and discuss why the mel scale is useful in audio processing.

Solution

Answer:

Spectrogram vs Mel-Spectrogram:

Characteristic Spectrogram Mel-Spectrogram
Frequency Scale Linear (Hz) Mel scale (logarithmic)
Frequency Resolution Uniform across all frequencies High at low frequencies, low at high frequencies
Dimensionality High (n_fft/2 + 1) Low (typically 40-128)
Computation Method STFT STFT + Mel filter bank

Usefulness of Mel Scale:

  1. Matches Human Auditory Characteristics

    • Human ears are sensitive to changes in low frequencies, less sensitive at high frequencies
    • Mel scale mathematically models this characteristic
  2. Dimensionality Reduction

    • Linear spectrogram: 1025 dimensions (for n_fft=2048)
    • Mel-spectrogram: 40-128 dimensions
    • Improved computational and learning efficiency
  3. Suitability for Speech Recognition

    • Focuses on important information in speech (low to mid frequencies)
    • Reduces impact of high-frequency noise
  4. Conversion Formula

    $$ m = 2595 \log_{10}\left(1 + \frac{f}{700}\right) $$

    • Low frequencies (e.g., 100Hzβ†’150Hz): Large mel change
    • High frequencies (e.g., 5000Hzβ†’5100Hz): Small mel change

Practical Examples:

Problem 4 (Difficulty: Hard)

Apply the following data augmentations to an audio file and compare the MFCCs of the original and augmented audio:

  1. Pitch shift (+3 semitones)
  2. Time stretch (1.2x speed)
  3. White noise addition (SNR=20dB)
Solution
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Apply the following data augmentations to an audio file and 

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import librosa
import numpy as np
import matplotlib.pyplot as plt

# Load audio
y, sr = librosa.load('sample.wav', duration=3.0)

print("=== Data Augmentation and MFCC Comparison ===")

# Original MFCC
mfcc_original = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

# 1. Pitch shift (+3 semitones)
y_pitch = librosa.effects.pitch_shift(y, sr=sr, n_steps=3)
mfcc_pitch = librosa.feature.mfcc(y=y_pitch, sr=sr, n_mfcc=13)
print("1. Pitch shift: +3 semitones")

# 2. Time stretch (1.2x speed)
y_stretch = librosa.effects.time_stretch(y, rate=1.2)
mfcc_stretch = librosa.feature.mfcc(y=y_stretch, sr=sr, n_mfcc=13)
print("2. Time stretch: 1.2x speed")

# 3. White noise addition (SNR=20dB)
def add_noise(signal, snr_db):
    """Add white noise at specified SNR"""
    signal_power = np.mean(signal ** 2)
    snr_linear = 10 ** (snr_db / 10)
    noise_power = signal_power / snr_linear
    noise = np.sqrt(noise_power) * np.random.randn(len(signal))
    return signal + noise

y_noise = add_noise(y, snr_db=20)
mfcc_noise = librosa.feature.mfcc(y=y_noise, sr=sr, n_mfcc=13)
print("3. White noise added: SNR=20dB")

# Compare MFCC statistics
def mfcc_statistics(mfcc):
    return {
        'mean': np.mean(mfcc, axis=1),
        'std': np.std(mfcc, axis=1)
    }

stats_original = mfcc_statistics(mfcc_original)
stats_pitch = mfcc_statistics(mfcc_pitch)
stats_stretch = mfcc_statistics(mfcc_stretch)
stats_noise = mfcc_statistics(mfcc_noise)

print("\nMFCC Statistics Comparison (First 5 Coefficients):")
print(f"{'Coef':<10} {'Orig Mean':<15} {'Pitch Mean':<15} {'Stretch Mean':<15} {'Noise Mean':<15}")
print("-" * 70)
for i in range(5):
    print(f"MFCC-{i:<4} {stats_original['mean'][i]:<15.3f} {stats_pitch['mean'][i]:<15.3f} "
          f"{stats_stretch['mean'][i]:<15.3f} {stats_noise['mean'][i]:<15.3f}")

# Calculate Euclidean distance
def euclidean_distance(mfcc1, mfcc2):
    """Euclidean distance between two MFCCs"""
    # Average over time
    mean1 = np.mean(mfcc1, axis=1)
    mean2 = np.mean(mfcc2, axis=1)
    return np.linalg.norm(mean1 - mean2)

dist_pitch = euclidean_distance(mfcc_original, mfcc_pitch)
dist_stretch = euclidean_distance(mfcc_original, mfcc_stretch)
dist_noise = euclidean_distance(mfcc_original, mfcc_noise)

print("\nDistance from Original MFCC:")
print(f"  Pitch shift: {dist_pitch:.3f}")
print(f"  Time stretch: {dist_stretch:.3f}")
print(f"  Noise addition: {dist_noise:.3f}")

# Visualization
fig, axes = plt.subplots(4, 2, figsize=(16, 14))

augmentations = [
    ('Original', y, mfcc_original),
    ('Pitch Shift (+3 semitones)', y_pitch, mfcc_pitch),
    ('Time Stretch (1.2x)', y_stretch, mfcc_stretch),
    ('White Noise (SNR=20dB)', y_noise, mfcc_noise)
]

for idx, (title, audio, mfcc) in enumerate(augmentations):
    # Waveform
    librosa.display.waveshow(audio, sr=sr, ax=axes[idx, 0])
    axes[idx, 0].set_title(f'{title} - Waveform', fontsize=12)
    axes[idx, 0].set_ylabel('Amplitude')

    # MFCC
    img = librosa.display.specshow(mfcc, sr=sr, x_axis='time', ax=axes[idx, 1], cmap='coolwarm')
    axes[idx, 1].set_title(f'{title} - MFCC', fontsize=12)
    axes[idx, 1].set_ylabel('MFCC Coefficients')
    fig.colorbar(img, ax=axes[idx, 1])

axes[3, 0].set_xlabel('Time (seconds)')
axes[3, 1].set_xlabel('Time (seconds)')

plt.tight_layout()
plt.savefig('augmentation_mfcc_comparison.png', dpi=150, bbox_inches='tight')
print("\nFigure saved: augmentation_mfcc_comparison.png")

# Discussion
print("\n=== Discussion ===")
print("1. Pitch Shift:")
print("   - MFCC values change (especially lower-order coefficients)")
print("   - Speaker and pitch information changes")
print("\n2. Time Stretch:")
print("   - Number of frames changes")
print("   - MFCC statistics remain relatively similar")
print("\n3. Noise Addition:")
print("   - Higher-order MFCC coefficients are affected")
print("   - Lower-order coefficients are relatively robust")

Sample Output:

=== Data Augmentation and MFCC Comparison ===
1. Pitch shift: +3 semitones
2. Time stretch: 1.2x speed
3. White noise added: SNR=20dB

MFCC Statistics Comparison (First 5 Coefficients):
Coef       Orig Mean       Pitch Mean      Stretch Mean    Noise Mean
----------------------------------------------------------------------
MFCC-0     -123.456        -125.234        -123.789        -122.891
MFCC-1      32.123          35.678          32.456          31.234
MFCC-2      -8.901         -10.234          -8.756          -9.123
MFCC-3       5.678           6.234           5.789           5.456
MFCC-4      -3.456          -4.123          -3.567          -3.891

Distance from Original MFCC:
  Pitch shift: 12.345
  Time stretch: 2.567
  Noise addition: 4.891

=== Discussion ===
1. Pitch Shift:
   - MFCC values change (especially lower-order coefficients)
   - Speaker and pitch information changes

2. Time Stretch:
   - Number of frames changes
   - MFCC statistics remain relatively similar

3. Noise Addition:
   - Higher-order MFCC coefficients are affected
   - Lower-order coefficients are relatively robust

Problem 5 (Difficulty: Hard)

Implement an audio preprocessing pipeline. The pipeline should include:

  1. Trimming silence
  2. RMS normalization (target RMS=0.1)
  3. Fixed-length adjustment (5 seconds)
  4. Mel-spectrogram extraction
  5. Normalization (mean 0, standard deviation 1)
Solution
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

import librosa
import numpy as np
import matplotlib.pyplot as plt

class AudioPreprocessor:
    """Audio preprocessing pipeline"""

    def __init__(self, sr=22050, target_length=5.0, target_rms=0.1,
                 n_mels=128, n_fft=2048, hop_length=512):
        """
        Parameters:
        -----------
        sr : int
            Sampling rate
        target_length : float
            Target duration (seconds)
        target_rms : float
            Target RMS level
        n_mels : int
            Number of mel bands
        n_fft : int
            FFT size
        hop_length : int
            Hop length
        """
        self.sr = sr
        self.target_length = target_length
        self.target_samples = int(sr * target_length)
        self.target_rms = target_rms
        self.n_mels = n_mels
        self.n_fft = n_fft
        self.hop_length = hop_length

    def trim_silence(self, y, top_db=20):
        """1. Trim silence"""
        y_trimmed, _ = librosa.effects.trim(y, top_db=top_db)
        return y_trimmed

    def rms_normalize(self, y):
        """2. RMS normalization"""
        current_rms = np.sqrt(np.mean(y**2))
        if current_rms > 0:
            y_normalized = y * (self.target_rms / current_rms)
        else:
            y_normalized = y
        return y_normalized

    def fix_length(self, y):
        """3. Fixed-length adjustment"""
        if len(y) < self.target_samples:
            # Padding
            pad_length = self.target_samples - len(y)
            y_fixed = np.pad(y, (0, pad_length), mode='constant')
        else:
            # Trimming
            y_fixed = y[:self.target_samples]
        return y_fixed

    def extract_mel_spectrogram(self, y):
        """4. Extract mel-spectrogram"""
        mel_spec = librosa.feature.melspectrogram(
            y=y, sr=self.sr, n_mels=self.n_mels,
            n_fft=self.n_fft, hop_length=self.hop_length
        )
        # Logarithmic scale
        mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
        return mel_spec_db

    def standardize(self, mel_spec):
        """5. Normalization (mean 0, std 1)"""
        mean = np.mean(mel_spec)
        std = np.std(mel_spec)
        if std > 0:
            mel_spec_normalized = (mel_spec - mean) / std
        else:
            mel_spec_normalized = mel_spec - mean
        return mel_spec_normalized

    def process(self, audio_path, verbose=True):
        """Complete preprocessing pipeline"""
        if verbose:
            print(f"=== Audio Preprocessing Pipeline ===")
            print(f"Input: {audio_path}")

        # Load audio
        y, sr = librosa.load(audio_path, sr=self.sr)
        if verbose:
            print(f"0. Loaded: {len(y)} samples ({len(y)/sr:.2f}s)")

        # 1. Trim silence
        y = self.trim_silence(y)
        if verbose:
            print(f"1. After trimming: {len(y)} samples ({len(y)/sr:.2f}s)")

        # 2. RMS normalization
        y = self.rms_normalize(y)
        current_rms = np.sqrt(np.mean(y**2))
        if verbose:
            print(f"2. RMS normalized: {current_rms:.4f} (target: {self.target_rms})")

        # 3. Fixed-length adjustment
        y = self.fix_length(y)
        if verbose:
            print(f"3. Fixed length: {len(y)} samples ({len(y)/sr:.2f}s)")

        # 4. Extract mel-spectrogram
        mel_spec = self.extract_mel_spectrogram(y)
        if verbose:
            print(f"4. Mel-spectrogram: {mel_spec.shape}")

        # 5. Standardization
        mel_spec_normalized = self.standardize(mel_spec)
        if verbose:
            mean = np.mean(mel_spec_normalized)
            std = np.std(mel_spec_normalized)
            print(f"5. Standardized: mean={mean:.6f}, std={std:.6f}")

        return {
            'audio': y,
            'mel_spectrogram_raw': mel_spec,
            'mel_spectrogram': mel_spec_normalized,
            'shape': mel_spec_normalized.shape
        }

# Usage example
preprocessor = AudioPreprocessor(
    sr=22050,
    target_length=5.0,
    target_rms=0.1,
    n_mels=128
)

# Execute preprocessing
result = preprocessor.process('sample.wav')

# Visualization
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# 1. Waveform
librosa.display.waveshow(result['audio'], sr=preprocessor.sr, ax=axes[0])
axes[0].set_title('Preprocessed Waveform', fontsize=14)
axes[0].set_ylabel('Amplitude')

# 2. Raw mel-spectrogram
img1 = librosa.display.specshow(result['mel_spectrogram_raw'],
                                  sr=preprocessor.sr,
                                  hop_length=preprocessor.hop_length,
                                  x_axis='time', y_axis='mel',
                                  ax=axes[1], cmap='viridis')
axes[1].set_title('Mel-Spectrogram (dB scale)', fontsize=14)
axes[1].set_ylabel('Frequency (Mel)')
fig.colorbar(img1, ax=axes[1], format='%+2.0f dB')

# 3. Normalized mel-spectrogram
img2 = librosa.display.specshow(result['mel_spectrogram'],
                                  sr=preprocessor.sr,
                                  hop_length=preprocessor.hop_length,
                                  x_axis='time', y_axis='mel',
                                  ax=axes[2], cmap='viridis')
axes[2].set_title('Normalized Mel-Spectrogram', fontsize=14)
axes[2].set_ylabel('Frequency (Mel)')
axes[2].set_xlabel('Time (seconds)')
fig.colorbar(img2, ax=axes[2])

plt.tight_layout()
plt.savefig('preprocessing_pipeline.png', dpi=150, bbox_inches='tight')
print("\nFigure saved: preprocessing_pipeline.png")

# Batch processing
print("\n=== Batch Processing ===")
audio_files = ['sample1.wav', 'sample2.wav', 'sample3.wav']
batch_results = []

for audio_file in audio_files:
    try:
        result = preprocessor.process(audio_file, verbose=False)
        batch_results.append(result['mel_spectrogram'])
        print(f"βœ“ {audio_file}: {result['shape']}")
    except Exception as e:
        print(f"βœ— {audio_file}: Error ({e})")

# Convert to NumPy array (use as model input)
if batch_results:
    X = np.array(batch_results)
    print(f"\nBatch Shape: {X.shape}")
    print(f"  (Batch Size, Mel Bands, Time Frames)")
    print(f"\nThis data can be used as input to machine learning models.")

Sample Output:

=== Audio Preprocessing Pipeline ===
Input: sample.wav
0. Loaded: 66150 samples (3.00s)
1. After trimming: 55125 samples (2.50s)
2. RMS normalized: 0.1000 (target: 0.1)
3. Fixed length: 110250 samples (5.00s)
4. Mel-spectrogram: (128, 216)
5. Standardized: mean=0.000000, std=1.000000

Figure saved: preprocessing_pipeline.png

=== Batch Processing ===
βœ“ sample1.wav: (128, 216)
βœ“ sample2.wav: (128, 216)
βœ“ sample3.wav: (128, 216)

Batch Shape: (3, 128, 216)
  (Batch Size, Mel Bands, Time Frames)

This data can be used as input to machine learning models.

References

  1. Rabiner, L., & Schafer, R. (2010). Theory and Applications of Digital Speech Processing. Pearson.
  2. MΓΌller, M. (2015). Fundamentals of Music Processing. Springer.
  3. McFee, B., et al. (2015). "librosa: Audio and Music Signal Analysis in Python." Proceedings of the 14th Python in Science Conference.
  4. Davis, S., & Mermelstein, P. (1980). "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences." IEEE Transactions on ASSP.
  5. Logan, B. (2000). "Mel Frequency Cepstral Coefficients for Music Modeling." ISMIR.
  6. Oppenheim, A. V., & Schafer, R. W. (2009). Discrete-Time Signal Processing (3rd ed.). Pearson.

Disclaimer