Aeroacoustic air flow 562, 564
computational model of 570
fine structure in speech 566
for fricatives 570
for vowels 569
glottal flow waveform from 569
mechanical model of 567
mechanical model with separated jet 567
mechanical model with vortex 567, 569
multiple sources from 569, 729
Affricates 94
voiced and unvoiced 94
All-pass system 34
All-pole spectral modeling 177
in linear prediction analysis 177
for sinewave model residual 475
Analog-to-digital (A/D) converter 12
Analysis-by-synthesis 641, 648
Analytic signal 22
Anatomy of speech production 57
glottis 59
vocal folds (or vocal cords) 59
vocal tract 66
Articulation rate 97
voicing dependence of 97
Aspiration 64
calculation of frequency-domain threshold 687
in frequency domain 685
in sinusoidal coding 652
in spectral subtraction 687
in subband coding 622
in Wiener filtering 675
relation to critical band 685
using time-domain dynamics 675
Auditory masking threshold, frequency-domain calculation of 687
Auditory modeling 401
adaptation in 408
AM-FM sinewave cochlear filter outputs in 404
auditory spectrum in 407
cochlear filters in 402
FM-to-AM transduction in 582
lateral inhibition in 406
phase synchrony in 405
phasic/tonic hypothesis in 408
place theory in 403
temporal theory in 405
time-frequency resolution in 402, 407
using wavelet transform 402
Auditory scene analysis, for speech enhancement 699
Auditory wavelet transform 397
Autocorrelation function 188
in linear prediction analysis 188
properties of 189
Toeplitz property of autocorrelation matrix 190
Autocorrelation method 184
Autocorrelation method (continued)
in linear prediction analysis 184, 185
Autoregressive (AR) model 178
Bandwidth (of a discrete-time signal) 20
in terms of instantaneous amplitude and frequency 23
Bark scale, in auditory perception 686
Bilinear time-frequency distributions 549
Choi-Williams distribution 558
Cohen’s class 558
conditional average frequency of 550
conditional average variance of 550
group delay from 551
instantaneous frequency and bandwidth from 551
marginals of 549
multi-components of 552
proper distribution 549
spectrogram 553
speech analysis with 560
Wigner distribution 554
Binaural representations 682
for speech enhancement 682
Bit rate 600
Cepstral mean subtraction (CMS) 671
for mel-cepstrum 734
in speaker recognition 734, 736
Channel invariance
by formant AM-FM 747
by source onset times 746
feature property of 747
Coarticulation 94
Code-excited linear prediction (CELP) 649
Coding 598
statistical models in 598
Comb filtering, for speech enhancement 679
Companding, by μ-law transformation 610
Complex cepstrum 261
aliasing with DFT computation 269
aliasing with periodic sequences 278
linear phase contribution in 264
of minimum- and maximum-phase sequences 265
of pulse trains 266
of rational z-transforms 262
of short-time periodic sequences 276
window effects in frequency domain 280
window effects in quefrency domain 279
Concatenated tube model 137
acoustic transfer function of 147
discrete-time realization of 143
for vocal tract acoustic model 137
forward-backward traveling waves in 139
lossless vs. lossy 142
reflection coefficients of 139
Constant-Q analysis/synthesis 386
in wavelet transform 393
Continuous-time signal 11
sampling of 11
Continuous-to-discrete (C/D) converter 11
Convolution Theorem 29
Covariance method 184
for high-pitched speakers 227
in linear prediction analysis 184
Creaky voice 65
Critical band 685
relation to auditory masking 685
relation to bark scale 686
relation to mel scale 686
Delta cepstrum 735
in speaker recognition 736
Deterministic plus stochastic signal model 474
time-scale modification 478
Difference equation 33
z-transform representation of 33
Digital signal 12
relation to discrete-time signal 12
Diphthong 93
spectrogram of 93
Discrete-time Fourier transform 15
of discrete-time signals 19
properties of 16
Discrete-time signal 12
exponential sequence 13
frequency-domain representation of 15
impulse (or unit sample) sequence 13
sinusoidal sequence 13
Discrete-time system 14
causal 14
finite-impulse response (FIR) 37
frequency-domain representation of 28
infinite-impulse response (IIR) 37
linear 14
stable 14
time-invariant 14
Discrete Fourier transform (DFT) 41
Duration (of a discrete-time signal) 20
Dynamic range compression 473
of the spectrum 268
using sinewave analysis/synthesis 473
from harmonic oscillation 572
product of amplitude and frequency 572
Energy separation algorithm 577
application to speech resonance 580
constant amplitude and frequency estimation from 577
continuous-time 577
discrete-time 578
time-varying amplitude and frequency estimation from 577
Excitation waveform 55, 56, 71
sinewave model of 460
sinewave onset time of 461
sinewave phase of 460
Expectation-maximization (EM) algorithm 721, 752
for Gaussian mixture model 721, 752
Extrapolation, sinewave-based 506
Filter bank summation (FBS) method 321, 323
applied to temporally-filtered subbands 692
FBS constraint for 323
generalized FBS 324
phase adjustment in 365
reverberation in 366
sinewave output interpretation of 368
time-frequency sampling with 328
with multiplicative modification 351
FM-to-AM transduction 582
in auditory signal processing 582
frequency and bandwidth 67, 158
Frequency matching 443
birth-death process in 443
in sinusoidal analysis/synthesis 443
Fricative 71
source of 85
spectrogram of 88
Gammachirp 544
minimum uncertainty for time-scale representation 544
Gaussian function 544
minimum uncertainty for time-frequency representation 544
Gaussian mixture model 720
estimation by EM algorithm 721
for speaker identification 721
for speaker recognition 720
for speaker verification 724
Glottal closed-phase estimation 225
by formant modulation 225
by inverse filtering 225
for speaker recognition 725
Liljencrants-Fant model of 228
phases of 60
ripple in 224
Glottal flow waveform estimation 225
for high-pitch speakers 227
ripple in 225
with aspiration and ripple 228, 231
with autocorrelation method 225
with covariance method 225
with decomposition 228
with secondary glottal pulses 226
onset estimation 525
Glottis 59
aspiration at 64
whispering at 64
Group delay 551
from bilinear time-frequency distributions 551
Handset mapper 741
for speaker recognition 744, 745
Handset normalization (hnorm), for speaker recognition 744, 745
Hilbert transformer 22
Homomorphic deconvolution, see homomorphic filtering 257
Homomorphic filtering 257
in contrast to linear prediction analysis 293
lifter in 267
minimum- and zero-phase synthesis 289
of unvoiced speech 287
of voiced speech 282
spectral dynamic range compression in 268
spectral root 272
with mixed-phase synthesis 290
Homomorphic prediction 293
pole-zero estimation by 294
Homomorphic systems 256
canonic representation of 256
for convolution 257
generalized principle of superposition in 256
Instantaneous bandwidth 547
Instantaneous frequency 22, 368, 546
estimation by ridge tracking 547
from bilinear time-frequency distributions 551
from energy separation algorithm 577
from phase derivative 370
from Teager energy operator 573
of an analytic signal 22
Inverse filtering 224
for glottal flow waveform estimation 224
KING database 724
speaker recognition with 724
of vocal tract filter 200
Levinson recursion 194
association with concatenated tube 147
backward recursion 198
Liljencrants-Fant glottal flow model 228
estimation of 230
Linear channel distortion 733
effect in speaker recognition 733
Linear prediction analysis 177
autocorrelation function in 188
autocorrelation method 184, 185
covariance method 184
frequency-domain interpretation of 205, 213
frequency-domain spectral matching of 213
gain computation in 199
inverse filter in 180
Levinson recursion 194
linear predictor in 178
minimum-phase property 196
normal equations in 183
of a periodic sequence 192
pitch synchronous 224
prediction error in 179
Projection Theorem in 185
relation to lossless tube 203
speech coding by 636
stochastic formulation of 207
time-domain waveform matching of 210
with glottal flow contribution 197
with synthesis 216
Linear prediction coding (LPC) 636
line spectral frequencies in 640
multi-pulse coding 641
residual coding 640
vector quantization in 637
Line spectral frequencies (LSFs) 640
in speech coding 632, 637, 640
Magnitude-only estimation 342
from STFT magnitude 342
in time-scale modification 347
of nonlinear distortion 741
Marginals, of bilinear time-frequency distributions 549
Maximum a posteriori (MAP) estimation 682, 683, 699
for speech enhancement 683
Maximum-likelihood (ML) estimation 669, 682, 699
of STFT magnitude in additive noise 669
Mel-cepstrum 714
temporal resolution of 715
Mel scale 686
in auditory perception 686
in speaker recognition 713
Mel-scale filters 713
output energy of 713
relation to mel-cepstrum 714
Minimum-distance classifier 717
for speaker recognition 717
Minimum-mean-squared error (MMSE) estimation 682, 699
Minimum phase 34
minimum-phase sequence 34
Minimum-phase sequence, energy concentration of 35
Mismatched condition, in speaker recognition 733
Missing feature theory 747
in speaker recognition 747
Modulation 158
in formant frequency and bandwidth 158
Modulation spectrum, of filter-bank outputs 691
Morphing 456
using sinewave analysis/synthesis 456
Motor theory of speech perception 81, 100
Multiple sources 65, 66, 563, 569, 570
for speaker recognition 729
Multi-band excitation (MBE) speech representation 531
pitch and voicing estimation in 531
Multi-band excitation (MBE) vocoder 633
Multi-pulse linear prediction 641
parameter coding in 644
perceptual weighting filter in 644
Multi-resolution 388
for speech enhancement 699
in sinewave analysis 458
Nasal 84
source of 84
spectrogram of 84
Navier Stokes equation 563
NIST evaluation databases 729
speaker recognition with 730, 744, 745
Noise reduction
by optimal spectral magnitude estimation 680
by Wiener filtering 672
musicality in spectral subtraction 350
optimal filtering for 349
phase from STFTM estimation 350
spectral subtraction method for 349
STFTM synthesis in 350
STFT synthesis in 349
Nonacoustic fluid motion, see Aeroacoustic air flow 562
Nonlinear channel distortion 737
effect in speaker recognition 737
effect of handset 740
magnitude-only estimation of 741
phantom formants 737
polynomial models 738
Nonlinear filtering, for speech enhancement 699
Nonlinearity 134
in vocal fold/vocal tract coupling 134
in vocal tract 134
NTIMIT database 723
speaker recognition with 724
Numerical simulation 47
of differential equations 47
of vocal tract acoustics 134
in sinusoidal model 461, 523, 525
Overlap-add (OLA) method 325
time-frequency sampling with 328
with sinewaves 450
Peak continuation algorithm, in sinewave analysis/synthesis 475
Peak-to-rms ratio 471
minimum 471
Peak-to-rms reduction, using sinusoidal analysis/synthesis 471
Perception 99
acoustic cues 99
articulatory features in 101
models of 100
of vowels and consonants 99
Phantom formants 737
effect in speaker recognition 737
Phase coherence 363
by sinewave-based onset times 462
filter-bank-based 382
filter-bank-based onset times 383
for quasi-periodic waveforms 385
for transients 383
in sinewave analysis/synthesis 461
in sinewave-based time-scale modification 462
in sinusoidal analysis/synthesis 433, 454
shape invariance with sinewaves 461
temporal envelope with 381
Phase derivative 260
of STFT 370
Phase dispersion, optimal 471
Phase locking, see Phase synchrony
Phase interpolation, by cubic polynomial 446, 482
in sinusoidal analysis/synthesis 446, 482
Phase synchrony 383
from auditory neural discharge 405
in sinewave-based modification 467
Phase unwrapping 269
ambiguity in 259
for complex cepstrum 269
in sinusoidal analysis/synthesis 446
using the phase derivative 270
Phase vocoder 367
instantaneous invariance with 382
periodic input for 371
phase coherence in 363, 380, 381
quasi-periodic input for 372
reverberation in 363
speech coding with 375
time-scale modification with 377
Phoneme 79
relation to phone 79
Phonemics 57
Phonetics 57
acoustics 79
articulatory 79
articulatory features in 79
Piano signal, component separation of 476
Pitch 61
harmonics of 63
jitter and shimmer of 64
relation to fundamental frequency 61
Pitch estimation 504
autocorrelation-based 504
by comb filtering 508
evaluation by synthesis 522
for two voices 479
likelihood function in 507
maximum-likelihood 684
pitch-period doubling in 509, 514
time-frequency resolution in 519
using harmonic sinewave model 510, 511
using multi-band excitation speech representation 531
using waveform extrapolation 506
using wavelet transform 400
Pitch-synchronized overlap-add (PSOLA), see Synchronized overlap-add 346
Pitch synchronous analysis 224
for glottal flow waveform estimation 224
source of 88
spectrogram of 90
voice bar with 89
Pole representation of a zero 177
Pole-zero estimation 220
applied to glottal flow waveform estimation 223
applied to speech 223
Shanks method of 222
Steiglitz method of 222
Pole-zero representations 27
of discrete-time signals 27
Power density spectrum 237
of linear time-invariant system output 238
of stochastic speech signals 207
relation to autocorrelation 237
Projection Theorem 184
in linear prediction analysis 184
Prosody 95
breath groups in 96
loudness in 95
pitch variation in 95
Quadrature signal representation 22
Quality 596
articulation index of 596
diagnostic acceptability measure of 596
diagnostic rhyme test of 596
intelligibility attribute of 596
speaker identifiability attribute of 596
subjective and objective testing of 596
Quanitization, see Scalar and vector quantization 599
Random process 233
ergodic 236
mean and variance of 235
sample sequence and ensemble of 234
stationary 234
statistical independence of sample values 234
white 234
with uncorrelated sample values 236
Real cepstrum 261
of minimum- and maximum-phase sequences 265
quefrency for 261
RelAtive SpecTrAl (RASTA) processing 695
auditory motivation for 695
compared to cepstral mean subtraction (CMS) 695
for additive noise 697
for convolutional distortion 695
for mel-cepstrum 734
in speaker recognition 734, 736
Sampling Theorem 43
downsampling with 45
upsampling with 45
Scalar quantization 599
adaptive quantization in 610
bit rate in 600
differential quantization in 613
optimal (Max) quantizer 606
quantization noise in 602
signal-to-noise ratio (SNR) in 604
using companding 609
using μ-law transformation 610
Semi-vowel 93
glide 93
liquid 93
Sequence, see discrete-time signal 12
Short-time Fourier transform 310, 543
application to noise reduction 349
application to time-scale modification 345
basis representation 543
discrete form of 310
FBS synthesis with multiplicative modification 338
FBS synthesis from 321
filtering view of 314
Fourier tranform view of 310
invertibility of 320
least-squared-error (LSE) synthesis with modification 340
OLA synthesis from 325
OLA synthesis with multiplicative modification 338
signal estimation from 337, 741
synthesis equation for 320
time-frequency resolution of 318
Short-time Fourier transform magnitude 330, 741
invertibility of 332
iterative least-squared-error (LSE) synthesis from 342
least-squared-error (LSE) synthesis with modification 342
sequential extrapolation from 334
time-frequency sampling with 334
Sinewave analysis/synthesis adaptive phase smoothing in 473
Sinewave analysis/synthesis (continued)
birth-death process in 443, 451
compared with phase vocoder 380
constant-Q 458
cubic phase interpolation in 446, 482
frequency matching in 443
harmonic reconstruction 522
linear amplitude interpolation in 445
magnitude-only reconstruction 454
minimum-phase synthesis 530
morphing with 456
of unvoiced speech 439
of voiced speech 436
overlap-add synthesis 450
peak continuation algorithm in 475
peak-picking in 438
peak-to-rms reduction by 471
phase coherence in 433, 454, 461
phase dispersion by 472
phase unwrapping in 446
pitch estimation in 510
pitch modification by 729
relative onset times for time-scale modification 466
sound splicing with 456
spectral warping by 729
speech coding by 625
time-frequency resolution of 457
time-scale modification by 456, 729
time-scale modification with phase coherence 464, 465
time-varying time-scale modification 468
wavelet transform 458
whispering with 729
window shift requirement in 441
Sinusoidal analysis/synthesis, see Sinewave analysis/synthesis
Sinewave-based time-scale modification, with phase coherence 469
Sinusoidal coding 625
all-pole amplitude modeling in 632
cepstral modeling of sinewave amplitudes 630
cepstral transform coding 631
minimum-phase harmonic model for 625
multi-band excitation vocoder 633
postfiltering in 627
spectral warping in 630
Sinusoidal model 430
basic 430
derivation of 479
deterministic plus stochastic 474
harmonicity 464
phase dithering 464
residual 475
source/filter phase 461
speech-dependent 430
Sinusoidal representation, see Sinusoidal model 430
Sound splicing 456
using sinewave analysis/synthesis 456
Source (of vocal tract) 70
impulsive 70
noise 70
unvoiced 71
voiced 71
Speaker identification 709
by Gaussian mixture modeling 721
with minimum-distance classifier 724
with vector quantization 724
Speaker recognition 709
features used for 711
from coded speech 748
from speech coder parameters 749
from synthesized coded speech 748
speaker identification 709
speaker verification 709
under mismatched condition 733, 744, 752
with Gaussian mixture modeling 720
with minimum-distance classifier 717
with missing feature detection 747
with pitch modified speech 730
with spectral warping 733
with speech modification 729
with time-scale modified speech 730
with vector quantization 718
with whispered speech 729, 732
Speaker verification 709
by Gaussian mixture modeling 724
Spectral envelope estimation vocoder (SEEVOC) 527
spectral envelope estimate in 527
Spectral root deconvolution, see Spectral root homomorphic filtering 272
Spectral root homomorphic filtering 272
for rational z-transforms 273
synthesis with 292
for speech enhancement 668
generalized 687
musicality artifact in 350, 671
suppression curve of 669
with auditory masking 687
Spectrogram 73
cross terms of 555
of diphthongs 93
of fricatives 88
of nasals 84
of plosives 90
of vowels 81
reading of 330
Speech coding 595
adaptive transform coding 624
by channel vocoder 625
by multi-pulse linear prediction 641
code-excited linear prediction (CELP) 649
linear prediction coding (LPC) 636
see also quantization 621
sinusoidal coding 625
subband coding 621
using analysis-by-synthesis 641, 648
using the phase vocoder 375
using wavelet transform 400
with speaker recognition 748
Speech enhancement 665
by advanced nonlinear filtering methods 699
by auditory scene analysis 699
by cepstral mean subtraction (CMS) 671
by comb filtering 679
by least-squared-error STFT magnitude estimation 680
by maximum a posteriori (MAP) estimation 683
by multi-resolution analysis 699
by RelAtive SpecTrAl (RASTA) 695
by spectral subtraction 668
by temporal processing of filter-bank outputs 691
by Wiener filtering 672
in additive noise 666
maximum-liklihood estimate of STFT magnitude 669
theoretical limit using noisy phase 667
using all-pole models 682
using binaural representations 682
using filter banks 667
using the STFT 667
with convolutional distortion 667
Speech modification 343
effect on speaker recognition 729
using STFT 345
using STFT magnitude 347
using synchronized overlap-add (SOLA) 346
with sinewave analysis/synthesis 729
Speech synthesis
alignment in mixed-phase synthesis 290
with homomorphic filtering 289
with linear prediction 218
with sinewave analysis/synthesis 442
with spectral root homomorphic filtering 292
Statistical models 598
histogram for 598
Subband coding 621
Sub-cepstrum 715
based on Teager energy 717
relation to mel-cepstrum 715
robustness of 747
temporal resolution of 715
Synchronized overlap-add SOLA 346
for time-scale modification 346
Teager energy operator 571
continuous-time 572
cross terms with multiple components 574
discrete-time 573
time-varying amplitude and frequency from 573
with sub-cepstrum 717
carbon-button 741
effect in speaker recognition 744
electret 741
magnitude-only estimation of 741
nonlinear distortion in 740
Temporal processing 690
by RelAtive SpecTrAl (RASTA) 695
by Wiener filtering along filter-bank outputs 697
modulation spectrum along subbands 691
of filter-bank outputs 691
of nonlinearly transformed filter-bank outputs 693, 697
Time and frequency resolution 176
in auditory processing 402
in speech analysis 176
Time-bandwidth product 555
for conditional uncertainty principal 555
Time-scale modification 343
non-uniform rate change with 344
pitch-synchronized OLA (PSOLA) for 346
source/filter model for 344
STFTM synthesis for 347
STFT synthesis for 345
synchronized overlap-add (SOLA) 346
using sinewave analysis/synthesis 456, 461
using the phase vocoder 377
using wavelet transform 397
Time-varying system 38, 337, 338
frequency response of 40
Green’s function 39
in speech analysis 176
sample response of 39, 337, 352
vocal tract as 134
TIMIT database 723
speaker recognition with 724
Traveling wave 112
forward-backward components of 120
rarefraction and compression in 112
wavelength and frequency of 114
from vocal fold/vocal tract interaction 155
in speech waveform 135
relation to formant frequency and bandwidth
modulation 158
TSID database 736
speaker recognition with 736
Two-voice model 494
least-squared error sinewave solution 494
Two-voice separation, pitch estimation for 479
conditional 555
discrete-time signal bandwidth 20
discrete-time signal duration 20
for time-frequency representation 20, 543
for time-scale representation 544
minimum uncertainty 543
time-frequency versus time-scale 544
Uniform tube 119
acoustic-electric analogy for 121
acoustic energy losses in 127
acoustic frequency response of 124
acoustic kinetic and potential energies in 124
acoustics of 119
with vocal tract boundary conditions 133
Vector quantization 616
comparison with scalar quantization 618
for speaker recognition 718
Vibrato 457
sinewave analysis/synthesis of 457
Vocal folds 59
pitch (or fundamental frequency) of 61
two-mass model of 60
voicing of 59
Vocal fold/vocal tract interaction 154
in frequency domain 158
nonlinear velocity/pressure relation in 154
truncation effect from 155
Vocal fry 65
Vocal tract 66
anti-resonances of 69
concatenated tube model of 137
discrete-time model of 151
lossless uniform-tube approximation of 127
lossy uniform-tube approximation of 127, 133, 134
minimum- and maximum-phase components of 153
nasal passage of 69
numerical simulation of acoustics in 134
oral tract of 66
pole-zero transfer function model of 153
resonances (formants) of 67
singing voice 69
velum of 66
Voice bar 89
Voice style 65
creaky 65
diplophonic 65
falsetto 66
stressed 96
vibratto 66
vocal fry 65
Voicing 59
vocal fold vibration in 59
Voicing detection 516
multi-band 533
sinewave-based 516
in mechanical model 567
jet flow from 71
shedding 566
Vowel 81
nasalized 69
source of 81
spectrogram of 81
Wave equation 115
approximations leading to 118, 119
derivation of 115
auditory wavelets in 397
basis representation 543
constant-Q filter bank form 393
continuous 388
discrete 392
dyadic sampling of 393
energy density 543
invertibility from continuous 390, 391
invertibility from discrete 392
pitch estimation with 400
reconstruction from magnitude 398
reconstruction from maxima 395
scalogram for 390
sinewave analysis/synthesis 458
speech coding with 400
time-scale modification with 397, 398
Whisper 88
Wiener filter 672
adapted to spectral change 676
along filter-bank outputs 697
applied to speech 678
as a comb filter 679
constrained 683
iterative refinement 677
musicality artifact in 673
suppression curve of 672
with smoothing 674
with temporal auditory masking 675
Wigner distribution 554
cross terms of 555
properties of 554
relation to spectrogram 556
uncertainty principle for 555
Windowing (Modulation) Theorem 30
z-transform 23
of discrete-time signals 25
of discrete-time systems 33
properties of 24
rational 28
region of convergence 24