Chapter 14
Speaker Recognition

14.1 Introduction

One objective in automatic speaker recognition is to decide which voice model from a known set of voice models best characterizes a speaker; this task is referred to as speaker identification [10]. In the different task of speaker verification, the goal is to decide whether a speaker corresponds to a particular known voice or to some other unknown voice. A speaker known to a speaker recognition system who is correctly claiming his/her identity is labeled a claimant and a speaker unknown to the system who is posing as a known speaker is labeled an imposter. A known speaker is also referred to as a target speaker, while an imposter is alternately called a background speaker. There are two types of errors in speaker recognition systems: false acceptances, where an imposter is accepted as a claimant, and false rejections, where claimants are rejected as imposters.

This chapter focuses on the signal processing components of speaker identification and verification algorithms, illustrating how signal processing principles developed in the text are applied in speech-related recognition problems. In addition, we apply methods of speech modification, coding, and enhancement in a variety of speaker recognition scenarios, and introduce some new signal processing tools such as methods to compensate for nonlinear, as well as linear, signal distortion. Such compensation is particularly important when differences arise over channels from which data is collected and is then used for speaker recognition. Other recognition tasks such as speech and language recognition could also have been used to illustrate the use of signal processing in recognition, but these require a more statistical framework and would take us far beyond the signal processing theme of this book; numerous tutorials exist in these areas [29],[32],[59],[85].

The first step in a speaker recognition system, whether for identification or verification, is to build a model of the voice of each target speaker, as well as a model of a collection of background speakers, using speaker-dependent features extracted from the speech waveform. For example, the oral and nasal tract length and cross-section during different sounds, the vocal fold mass and shape, and the location and size of the false vocal folds, if accurately measured from the speech waveform, could be used as features in an anatomical speaker model. We call this the training stage of the recognition system, and the associated speech data used in building a speaker model is called the training data. During the recognition or testing stage, we then match (in some sense) the features measured from the waveform of a test utterance, i.e., the test data of a speaker, against speaker models obtained during training. The particular speaker models we match against, i.e., from target and background, depends on the recognition task. An overview of these components of a speaker recognition system for the verification task is given in Figure 14.1

In practice, it is difficult to derive speech anatomy from the speech waveform.1 Rather, it is typical to use features derived from the waveform based on the various speech production and perception models that we have introduced in the text. The most common features characterize the magnitude of the vocal tract frequency response as viewed by the front-end stage of the human auditory system, assumed to consist of a linear (nearly) constant-Q filter bank. In section 14.2, we give examples of these spectral-based features and discuss time- and frequency-resolution considerations in their extraction from the speech waveform. In Section 14.3, we next describe particular approaches to training and testing in speaker recognition systems, first minimum-distance and vector-quantization (VQ) methods, and then a more statistical pattern-recognition approach based on maximum-likelihood classification. Also in this section, we illustrate the loss in recognition accuracy that can occur when training and test data suffer from various forms of degradation. In Section 14.4, we investigate a feature class based on the source to the vocal tract, rather than its spectrum, using the glottal flow derivative estimates and parameterization of Chapter 5. In this section, we also present examples of speaker recognition performance that give insight into the relative importance of the vocal tract spectrum, source characteristics, and prosody (i.e., pitch intonation and articulation rate) in speaker recognition, both by machine and by human. In this development, we alter speech characteristics using the sinewave-based modification of Chapter 9, and evaluate recognition performance from the modified speech.

1 Nevertheless, there has been considerable effort in deriving physiological models of speech production, as with application to speech coding [74], which also holds promise for speaker recognition.

Figure 14.1 Overview of speaker recognition system for speaker verification. In this task, features extracted from test speech are matched against both target and background models.

Image

In Section 4.5, we introduce the channel mismatch problem in speaker recognition that occurs when training data is collected under a different condition (or “channel”) from the test data. For example, the training data may be essentially undistorted, being recorded with a high-quality microphone, while the test data may be severely degraded, being recorded in a harsh cellular environment. Section 14.5 provides different approaches to dealing with this challenging problem, specifically, signal processing for channel compensation, calling upon enhancement methods of Chapter 13 that address additive and convolutional distortion, including spectral subtraction, cepstral mean substraction (CMS), and RelAtive SpecTrAl processing (RASTA). Also in this section, we introduce a compensation approach for nonlinear distortion, an algorithm for removing badly corrupted features, i.e., “missing features,” and the development of features that hold promise in being invariant under different degrading conditions. Finally, Section 14.6 investigates the impact of the various speech coding algorithms of Chapter 12 on speaker recognition performance, an increasingly important issue due to the broad range of digital communications from which recognition will be performed.

14.2 Spectral Features for Speaker Recognition

We have seen in Chapter 3 a variety of voice attributes that characterize a speaker. In viewing these attributes, both from the perspective of the human and the machine for recognition, we categorize speaker-dependent voice characteristics as “high-level” and “low-level.” High-level voice attributes include “clarity,” “roughness,” “magnitude,” and “animation” [60],[82]. Other high-level attributes are prosody, i.e., pitch intonation and articulation rate, and dialect. Voiers found that such high-level characteristics are perceptual cues in determining speaker identifiability [82]. On the other hand, these attributes can be difficult to extract by machine for automatic speaker recognition. In contrast, low-level attributes, being of an acoustic nature, are more measurable. In this chapter, we are interested in low-level attributes that contain speaker identifiability for the machine and, perhaps, as well as for the human. These attributes include primarily the vocal tract spectrum and, to a lesser extent, instantaneous pitch and glottal flow excitation, as well as temporal properties such as source event onset times and modulations in formant trajectories. In this section, we focus on features for speaker recognition derived from spectral measurements, and then later in the chapter branch out to a non-spectral characterization of a speaker.

14.2.1 Formulation

In selecting acoustic spectral features, we want our feature set to reflect the unique characteristics of a speaker. The short-time Fourier transform (STFT) is one basis for such features. The STFT can be written in polar form as

Image

In speaker recognition, only the magnitude component |X (n, ω)| has been used because features corresponding to the phase component are difficult to measure and are susceptible to channel distortion2. We have seen in Chapter 5 that the envelope of the speech STFT magnitude is characterized in part by the vocal tract resonances, i.e., the poles, that are speaker-dependent (as well as phoneme-dependent), determined by the vocal tract length and spatially-varying cross-section. In addition to the vocal tract resonances, its anti-resonances, i.e., its zeros, also contribute to the STFT magnitude envelope. Consider, as an example, nasalized vowels for which anti-resonances arise when energy is absorbed by a closed nasal cavity and open velum. Because the nasal cavity is fixed, the zeros during nasalized vowels introduced by this cavity may be a particularly important contribution to the envelope for speaker identifiability. A similar argument holds for nasal sounds for which the fixed nasal cavity is open and the oral cavity closed, together coupled by an open velum. Indeed, the resonances of the nasal passage, being fixed, were found by Sambur to possess considerable speaker identifiability, relative to other phoneme spectral resonances, for a certain group of speakers and utterances [69]. (A complete study of the relative importance of resonances and anti-resonances of all sounds has yet to be made.) In addition to the resonant and anti-resonant envelope contributions, the general trend of the envelope of the STFT magnitude (referred to in Example 3.2 of Chapter 3 as the “spectral tilt”) is influenced by the coarse component of the glottal flow derivative. Finally, the STFT magnitude is also characterized by speaker-dependent fine structure including pitch, glottal flow components, and the distributed aeroacoustic effects of Chapter 11.

2 We saw in Chapter 13 that speech enhancement techniques typically do not reduce phase distortion. Nonetheless, phase may have a role in speaker identifiability. Indeed, we have seen in earlier chapters that phase reflects the temporal nature of the glottal flow and the vocal tract impulse response shape. These temporal properties may be associated with unique, speaker-dependent anatomical characteristics.

Most speaker recognition systems derive their features from a smooth representation of the STFT magnitude, assumed to reflect primarily vocal tract resonances and spectral tilt. We have seen a number of smooth spectral representations in this text, including ones obtained from linear prediction all-pole modeling (Chapter 5), homomorphic filtering (Chapter 6), and the SEEVOC representation (Chapter 10). For the linear prediction method, we can use the linear prediction coefficients or some transformation thereof as a feature vector. For the homomorphic filtering estimate, a liftered real cepstrum can serve as a feature vector. Another possibility is to combine approaches. For example, the first p values of the real cepstrum of a pth-order linear predictor uniquely characterize the predictor (Exercise 6.13) and serve as a feature vector. Conversely, one can obtain an all-pole spectrum of the homomorphically deconvolved vocal tract spectrum, a method referred to in Chapter 6 as “homomorphic prediction.” Similarly, the SEEVOC representation serves as a basis for an all-pole feature set. Although such features have had some success in speaker recognition algorithms [4], they tend to be inferior to those that are derived using auditory-based principles of speech processing.

In this section, we study two different spectral-based feature sets for speaker recognition that exploit front-end auditory filter-bank models. These features are referred to as the melcepstrum and the sub-cepstrum and provide not only a means to study useful speech features for speaker recognition, but also provide an illustrative comparison of time-frequency tradeoffs in feature selection.

14.2.2 Mel-Cepstrum

The mel-cepstrum, introduced by Davies and Mermelstein [7], exploits auditory principles, as well as the decorrelating property of the cepstrum. In addition, the mel-cepstrum is amenable to compensation for convolutional channel distortion. As such, the mel-cepstrum has proven to be one of the most successful feature representations in speech-related recognition tasks. In this section, and others to follow, we study these important properties of the mel-cepstrum.

Figure 14.2 Comparison of computations required by (a) the mel-cepstrum and an equivalent structure with (b) the sub-cepstrum.

Image

The mel-cepstrum computation is illustrated in Figure 14.2. The speech waveform is first windowed with analysis window w[n] and the discrete STFT, X(n, ωk), is computed:

Image

where Image with N the DFT length. The magnitude of X(n, ωk) is then weighted by a series of filter frequency responses whose center frequencies and bandwidths roughly match those of the auditory critical band filters. We saw in Chapter 13 that these filters follow the mel-scale whereby band edges and center frequencies of the filters are linear for low frequency (< 1000 Hz) and logarithmically increase with increasing frequency. We thus call these filters mel-scale filters and collectively a mel-scale filter bank. An example of a mel-scale filter bank used by Davies and Mermelstein [7] (in a speech recognition task) is illustrated in Figure 14.3; this filter bank, with 24 triangularly-shaped frequency responses, is a rough approximation to actual auditory critical-band filters covering a 4000 Hz range.

Figure 14.3 Triangular mel-scale filter bank used by Davies and Mermelstein [7] in determining spectral log-energy features for speech recognition. The 24 filters follow the mel-scale whereby band edges and center frequencies of these filters are linear for low frequency and logarithmically increase with increasing frequency, mimicking characteristics of the auditory critical bands. Filters are normalized according to their varying bandwidth.

Image

The next step in determining the mel-cepstrum is to compute the energy in the STFT weighted by each mel-scale filter frequency response. Denote the frequency response of the lth mel-scale filter as Vl(ω). The resulting energies are given for each speech frame at time n and for the lth mel-scale filter as

(14.1)

Image

where Ll and Ul denote the lower and upper frequency indices over which each filter is nonzeroand where

Image

which normalizes the filters according to their varying bandwidths so as to give equal energy for a flat input spectrum [7],[60]. For simplicity of notation, we have (as we will throughout this chapter unless needed) removed reference to the time decimation factor L.

The real cepstrum associated with Emel(n, l) is referred to as the mel-cepstrum and is computed for the speech frame at time n as

Image

where R is the number of filters and where we have used the even property of the real cepstrum to write the inverse transform in terms of the cosine basis, sometimes referred to as the discrete cosine transform.3 In the context of recognition algorithms, an advantage of the discrete cosine transform is that it is close to the Karhunen-Loeve transform [84] and thus it tends to decorrelate the original mel-scale filter log-energies. That is, the cosine basis has a close resemblence to the basis of the Karhunen-Loeve transform that results in decorrelated coefficients. Decorrelated coefficients are often more amenable to probabilistic modeling than are correlated coefficients, particularly to Gaussian mixture models that we will study shortly; indeed, speaker recognition with the Gaussian mixture model approach typically gives better performance with the melcepstrum than with the companion mel-scale filter log-energies [60].

3In the original work of Davies and Mermelstein [7], the discrete cosine transform was written as Image.

Observe that we are not accurate in calling the STFT magnitude weighting by |Vl(ωk)|, as given in Equation (14.1), a “filtering” operation on the signal x[n]. This perspective has an important implication for time resolution in our analysis. To understand this point, recall from Chapter 7 that the STFT can be viewed as a filter bank:

(14.2)

Image

Where hk[n] = w[n]ekn, i.e., for each frequency ωk the STFT is the output of a filter whose response is the analysis window modulated to frequency ωk (Figure 14.2). It follows from the filtering view of the STFT in Equation (14.2) that the speech signal x[n] is first filtered by hk[n] that has a fixed, narrow bandwidth. The length of the filter hk[n] equals the length of the analysis window w[n], typically about 20 ms in duration. The temporal resolution of the mel-scale filter energy computation in Equation (14.1), therefore, is limited by the STFT filters hk[n], together with the STFT magnitude operation, in spite of the logarithmically increasing bandwidths of the mel-scale filter frequency responses Vl(ω).4

4Viewing the mel-scale filter weighting as a multiplicative modification to the STFT, i.e., Vlk)X(n,ωk), and using the filter bank summation method of Chapter 7, we obtain a sequence x[n] * (w[n]vl[n]). In synthesis, then, assuming vl[n] is causal with length less than the window length, we recover temporal resolution through the product w[n]vl[n] (corresponding to summing over multiple filters hk[n] = w[n]ekn ). Nevertheless, this resolution is not realized in mel-filter energies because the STFT phase is discarded in the magnitude operation.

14.2.3 Sub-Cepstrum

An alternate method to compute spectral features, addressing the limited temporal resolution of the mel-scale filter energies and better exploiting of auditory principles, is to convolve the mel-scale filter impulse responses directly with the waveform x[n] [14], rather than applying the mel-scale frequency response as a weighting to the STFT magnitude. The result of this convolution can be expressed as

(14.3)

Image

where υl[n] denotes the impulse response corresponding to the frequency response,5 Vl(ω), of the lth mel-scale filter centered at frequency ωl. The mel-scale filter impulse responses are constructed in analytic signal form [Chapter 2 (Section 2.5)] by requiring Vl(ω) = 0 for –π ≤ ω < 0 so that Image provides a temporal envelope for each mel-scale filter output. When invoking the convolution in Equation (14.3), we refer to υl[n] as a subband filter. The energy of the output of the lth subband filter can be taken as simply | Image or as a smoothed version of | Image over time [14], i.e.,

5The phase is selected to mimic that of auditory filters, either physiological or psychoacoustic, that are typically considered to be minimum-phase [47].

Image

using a smoothing filter p[n], as illustrated in Figure 14.2, and where time is typically downsampled at the analysis frame rate, i.e., n = pL. The real cepstrum of the energies Esub(n, l) for l = 0, 1, 2 … R – 1, R equal to the number of filters, is referred to as the subband cepstrum[14] and is written as

Image

where, as with the mel-cepstrum, we have exploited the symmetry of the real cepstrum.

The energy trajectories of the subband filters, Esub(n, l), can capture more temporal characteristics of the signal than the mel-scale filter energies, particularly for high frequencies. This is because, unlike in the mel-scale filtering operation, the short-duration, high-frequency subband filters are applied directly to the signal x[n]. Therefore, subband filters operate closer to front-end auditory filters, than do mel-scale filters, and likewise give more auditory-like temporal characteristics. Although the smoothing filter p[n] gives a loss of temporal resolution, its duration can be chosen such that, together with the duration of υl[n], more auditory-like temporal resolution results than can be achieved with a typical analysis window required in the first stage of computing the mel-scale filter energies. The difference in temporal resolution of the mel-scale filter and sub-filter energies is illustrated in the following example:

Example 14.1       Figure 14.4 shows the output energies from the mel-scale and subband filters for a speech segment that includes a voiced region and a voiced/unvoiced transition. For each filter bank, the energy functions are shown for two filters, one centered at about 200 Hz and the other at about 3200 Hz. The time-domain subband filter impulse responses are minimum phase and are derived from |Vl(ω)| through the minimum-phase construction method introduced in Chapter 6 (Section 6.7.1). The Hamming window duration used in the STFT of the mel-scale filter bank is 20 ms and is about equal to the length of the subband filters in the low 1000-Hz band. The length of the subband filters above 1000 Hz decreases with increasing filter center frequency. In comparing the time-frequency resolution properties of the mel-scale and subband filter bank energy trajectories [Figure 14.4(b-c) versus Figure 14.4(d-e)], we see that subband filter energies more clearly show speech transitions, periodicity, and short-time events, particularly in the high-frequency region. In this example, the subband filter energies have not been smoothed in time. Exercise 14.14 further explores the relative limits of the temporal resolution for the two filter-bank configurations.Image

Figure 14.4 Energies from mel-scale and subband filter banks: (a) speech waveform; (b)–(c) mel-scale filter energy from filter number 22 (≈ 3200 Hz) and filter number 2 (≈ 200 Hz); (d)–(e) subband filter energy from filter number 22 (≈ 3200 Hz) and filter number 2 (≈ 200 Hz).

Image

Other variations of the mel-cepstrum and sub-cepstrum based on the mel-scale filter and subband filter energy representations are possible. For example, a cepstral feature set is based on the Teager energy operator [30],[36],[37] [see Chapter 11 (Section 11.5)] applied to the subbandfilter outputs [25]. The primary difference between the sub-cepstrum and the Teager-energy-based sub-cepstrum is the replacement of the conventional mean-squared amplitude energy with that of the Teager energy computed with the 3-point operator Ψ{x[n]} = x2[n]–x[n–1]x[n+1]. The Teager energy, being the product of amplitude and frequency for an AM-FM sinewave input, may provide a more “information-rich” feature representation than that of conventional energy that involves only amplitude modulation [26],[27].

14.3 Speaker Recognition Algorithms

In this section, we describe three approaches to speaker recognition building in their complexity and accuracy [5],[60].

14.3.1 Minimum-Distance Classifier

Suppose in a speaker recognizer, we obtain a set of features on each analysis frame from training and test data. We refer to the feature set on each frame as a feature vector. One of the simplest approaches to speaker recognition is to compute the average of feature vectors over multiple analysis frames for speakers from testing and training data and then find the distance (in some sense) between these average test and training vectors [5],[44],[60]. In speaker verification, we set a distance threshold below which we “detect” the claimant speaker; in identification, we pick the target speaker with the smallest distance from that of the test speaker.

As an example of a feature vector, consider the average of mel-cepstral features for the test and training data:

Image

and

Image

where the superscripts ts and tr denote test and training data, respectively, M is the number of analysis frames which differs in training and testing, and L is the frame length. We can then form, as a distance measure, the mean-squared difference between the average testing and training feature vectors expressed as

Image

where R is the number of mel-cepstral coefficients, i.e., the length of our feature vector, and where the 0th value of the cepstral difference is not included due to its sensitivity to scale changes. In the speaker verification task, the speaker is detected when D exceeds some threshold, i.e.,

if D > T, then speaker present.

We call this recognition algorithm the minimum distance classifier, a nomenclature that more appropriately applies to the alternate speaker identification task in which a speaker is chosen from a set of target speakers to have minimum distance to the average feature vector of the test speaker.

14.3.2 Vector Quantization

A problem with the minimum-distance classifier is that it does not distinguish between acoustic speech classes, i.e., it uses an average of feature vectors per speaker computed over all sound classes. Individual speech events are blurred. It seems reasonable then that we could do better if we average feature vectors over distinct sound classes, e.g., quasi-periodic, noise-like, and impulse-like sounds, or even finer phonetic categorization within these sound classes, compute a distance with respect to each sound class, and then average the distances over classes. This would reduce, for example, the phonetic differences in the feature vectors, to which we alluded earlier, and help focus on speaker differences. We illustrate this categorization, again using the example of mel-cepstral features.

Suppose we know a priori the speech segments corresponding to the sound classes in both the training and test data. We then form averages for the ith class as

Image

and

Image

where for convenience we have removed the “mel” notation. We then compute a Euclidean distance with respect to each class as

Image

Finally, we average over all classes as

Image

where I is the number of classes. To form this distance measure, we must identify class distinctions for the training and test data.

One approach to segmenting a speech signal in terms of sound classes is through speech recognition on a phoneme or word level. A Hidden Markov Model (HMM) speech recognizer,6 for example, yields a segmentation of the training and testing utterances in terms of desired acoustic phonetic or word classes [8],[22],[29],[59], and can achieve good speaker recognition performance [71]. An alternative approach is to obtain acoustic classes without segmentation of an utterance and without connecting sound classes to specific phoneme or word categories; there is no identifying or labeling of acoustic classes. One technique to achieve such a sound categorization is through the vector-quantization (VQ) method described in Chapter 12, using the k-nearest neighbor clustering algorithm [5],[21],[60],[77]. Each centroid in the clustering is derived from training data and represents an acoustic class, but without identification or labeling. The distance measure used in the clustering is given by the Euclidean distance (as above) between a feature vector and a centroid. In recognition, i.e., in testing, we pick a class for each feature vector by finding the minimum distance with respect to the various centroids from the training stage. We then compute the average of the minimum distances over all test feature vectors. In speaker identification, for example, we do this for each known speaker and then pick a speaker with the smallest average minimum distance. The training (clustering) and testing (recognition) stages of the VQ approach are illustrated in Figure 14.5

6 The highly popular and successful HMM speech recognizer is based on probabilistic modeling of “hidden” acoustic states (classes) and transitions between states through Markovian class constraints [8],[22],[29],[59]. The states are hidden because only measured feature vectors associated with underlying states are observed. Unlike a Gaussian Mixture Model (GMM) to be described in Section 14.3.3, the HMM imposes a temporal order on acoustic classes.

We can think of this VQ approach as making “hard” decisions in that a single class is selected for each feature vector in testing. An alternative is to make “soft” decisions by introducing probabilistic models using a multi-dimensional probability density function (pdf) of feature vectors. The components of the pdf, together sometimes referred to as a pdf mixture model, represent the possible classes. Each feature vector is given a “soft” decision with respect to each mixture component. In the next section, we look at the Gaussian pdf mixture model commonly used in a maximum-likelihood approach to recognition.

14.3.3 Gaussian Mixture Model (GMM)

Speech production is not “deterministic” in that a particular sound (e.g., a phone) is never produced by a speaker with exactly the same vocal tract shape and glottal flow, due to context, coarticulation, and anatomical and fluid dynamical variations. One way to represent this variability is probabilistically through a multi-dimensional Gaussian pdf [60],[61]. The Gaussian pdf is state-dependent in that there is assigned a different Gaussian pdf for each acoustic sound class. We can think of these states at a very broad level such as quasi-periodic, noise-like, and impulse-like sounds or on a very fine level such as individual phonemes. The Gaussian pdf of a feature vector Image for the ith state is written as

Image

Figure 14.5 Pictorial representation of (a) training and (b) testing in speaker recognition by VQ. In this speaker identification example, in testing, the average minimum distance of test vectors to centroids is lower for speaker A than for speaker B.

Image

where Image is the state mean vector, Σi is the state covariance matrix, and R is the dimension of the feature vector. The vector Image denotes the matrix transpose of Image and |Σi| and Image indicate the determinant and inverse, respectively, of matrix Σi. The mean vector Image is the expected value of elements of the feature vector Image, while the covariance matrix Σi represents the cross-correlations (off-diagonal terms) and the variance (diagonal terms) of the elements of the feature vector.

The probability of a feature vector being in any one of I states (or acoustic classes), for a particular speaker model, denoted by λ, is represented by the union, or mixture, of different Gaussian pdfs:

(14.4)

Image

where the bi(x) are the component mixture densities and pi are the mixture weights, as illustrated in Figure 14.6. Given that each individual Gaussian pdf integrates to unity, then we constrain Image to ensure that the mixture density represents a true pdf (Exercise14.1). The speaker model λ then represents the set of GMM mean, covariance, and weight parameters, i.e.,

Image

We can interpret the GMM as a “soft” representation of the various acoustic classes that make up the sounds of the speaker; each component density can be thought of as the distribution of possible feature vectors associated with each of the I acoustic classes, each class representing possibly one speech sound (e.g., a particular phoneme) or a set of speech sounds (e.g., voiced or unvoiced). Because only the measured feature vectors are available, we can think of the acoustic classes as being “hidden” processes,7 each feature vector being generated from a particular class i with probability pi on each analysis frame [60]. However, for some specified number of mixtures in the pdf model, generally one cannot make a strict relation between component densities and specific acoustic classes [60].

7Therefore, conceptually, the GMM is similar to an HMM, except that there is no constrained temporal relation of acoustic classes in time through a Markovian model of the process [59]. In other words, the GMM does not account for temporal ordering of feature vectors.

Figure 14.6 The Gaussian mixture model is a union of Gaussian pdfs assigned to each acoustic state.

Image

Regardless of an acoustic class association, however, a second interpretation of the GMM is a functional representation of a pdf. The GMM, being a linear combination of Gaussian pdfs, has the capability to form an approximation to an arbitrary pdf for a large enough number of mixture components. For speech features, typically having smooth pdfs, a finite number of Gaussians (e.g., 8–64) is sufficient to form a smooth approximation to the pdf. The modeling of a pdf takes place through appropriate selection of the means, (co)variances, and probability weights of the GMM. Diagonal covariance matrices Σi are often sufficient for good approximations and significantly reduce the number of unknown variables to be estimated.

Now that we have a probabilistic model for the feature vectors, we must train, i.e., estimate, parameters for each speaker model, and then classify the test utterances (for speaker verification or identification). As we did with VQ classification, we form clusters within the training data, but now we go a step further and represent each cluster with a Gaussian pdf, the union of Gaussian pdfs being the GMM. One approach to estimate the GMM model parameters is by maximum-likelihood estimation (Appendix 13.A), i.e., maximizing, with respect to λ, the conditional probability p(X|λ) where the vector Image is the collection of all feature vectors for a particular speaker. An important property of maximum-likelihood estimation is that for a large enough set of training feature vectors, the model estimate converges (as the data length grows) to the true model parameters; however, there is no closed-form solution for the GMM representation, thus requiring an iterative approach to solution [40],[60].

This solution is performed using the expectation-maximization (EM) algorithm [9] (Appendix 14.A). The EM algorithm iteratively improves on the GMM parameter estimates by increasing (on each iteration) the probability that the model estimate λ matches the observed feature vectors, i.e., on each iteration p(X|λk+1) > p(X|λk), k being the iteration number. The EM algorithm is similar, at least in spirit, to other iterative estimation algorithms we have seen throughout the text, such as spectral magnitude-only signal reconstruction Chapter 7(Section 7.5.3)], where the estimate is iteratively refined to best match the observed data. Having trained the GMM model, we can now perform speaker recognition.

Speaker Identification — GMM-based speaker verification and identification systems are shown in Figure 14.7. We begin with the identification task. Suppose we have estimated S target speaker models λj for j = 1, 2 … S. Then for each test utterance, features at frame time n, Image, are calculated. One common approach to identification is to compute the probability of each speaker model given the features, i.e., Image, and then choose the speaker with the highest probability. This technique is called maximum a posterior probability (MAP) classification (Appendix13.A). To express Image in terms of a known pdf, for example, the GMM model, by Bayes’ rule [11] we write

Image

Because P(Image) is constant, it is sufficient to maximize the quantity Image where P(λj) is the a priori probability, also referred to as a prior probability, of speaker λj being the source of Image. If the prior probabilities P(λj) are assumed equal, which is customary, the problem becomes finding the λj that maximizes Image, which is simply the GMM speaker model derived in the training procedure, evaluated at the input feature vector Image.

It is important to note that, in practice, there is not one Image for a given testing utterance, but instead a stream of Image generated at the front end of the recognizer at a frame interval L. Thus we must maximize Image, where M is the number of feature vectors for the utterance. (Observe that the vector index here and to follow denotes the frame number rather than absolute time.) This calculation is typically made by assuming that frames are independent; therefore, the likelihood for an utterance is simply the product of likelihoods for each frame:

Image

Figure 14.7 GMM-based speaker recognition systems: (a) speaker identification in which a speaker target model is selected with maximum probability; (b) speaker verification in which the speaker claimant is detected when the log-likelihood ratio between the target and the background pdf exceeds a threshold.

SOURCE: D.A. Reynolds, “Automatic Speaker Recognition Using Gaussian Mixture Speaker Models” [68]. ©1995, MIT Lincoln Laboratory Journal. Used by permission.

Image

The assumption of frame independence is a very strong one; an implication is that both the speaker models λj and the likelihoods calculated above do not depend on the order of the feature vectors. Dynamics of feature vectors over time are thus not considered in the GMM, as we alluded to earlier. By applying the logarithm, we can write the speaker identification solution as

Image

which is illustrated in Figure 14.7

Reynolds has evaluated the performance of GMM models for speaker identification, using several speech databases [63],[64]. Speaker identification results with the TIMIT and NTIMIT databases are particularly interesting because they show differences in recognition performance in the face of channel distortion. The TIMIT speech database, a standard in recognition experiments, consists of 8-kHz bandwidth read (not conversational) speech recorded in a quiet environment without channel distortion using a high-quality Sennheiser microphone [17]. TIMIT has 630 speakers (438 males and 192 females) with 10 utterances per speaker, each 3 seconds long on average. It has balanced coverage of speech phonemes. The NTIMIT database [28] has the same properties of TIMIT except that NTIMIT was created by transmitting all TIMIT utterances over actual telephone channels. A primary difference in NTIMIT from TIMIT is that the recording microphone is of the carbon-button type that introduces a variety of distortions, most notably nonlinear distortion. Additional distortions in the NTIMIT channel include bandlimiting (3300 Hz), linear (convolutional) channel transmission, and additive noise. Because NTIMIT utterances are identical to TIMIT utterances except for the effects of the telephone channel (and because the two databases are time-aligned), we can compare speech processing algorithms between “clean” speech and “dirty” telephone speech, as illustrated in the following example:

Example 14.2       In this example, a GMM recognition system is developed using the TIMIT and NTIMIT databases. For each database, 8 of the 10 available utterances are used for training (about 24 s) and 2 for testing (about 6 s). Non-speech regions were removed (as in all other recognition experiments of this chapter) using an energy-based speech activity detector [61]. The GMM speaker recognition system is trained with a 19-element mel-cepstral feature vector (the 0th value being removed) obtained at a 10-ms frame interval. Because telephone speech is typically bandlimited to 300–3300 Hz, we remove the first 2 and last 2 mel-scale filter energies that are derived from the 24-element filter bank in Figure 14.3 and that fall outside this telephone passband. Each Gaussian pdf in the mixture is characterized by a diagonal covariance matrix, and parameters λ of an 8-component GMM are obtained using the EM algorithm for each target speaker. In this and other GMM experiments of this chapter, ten iterations of the EM algorithm are sufficient for parameter convergence. Figure 14.8 shows the performance of the GMM speaker identification system on both TIMIT and NTIMIT as a function of the number of speakers in the evaluation [63],[68]. Two interesting results appear. First, the speaker identification performance for clean speech (i.e., TIMIT) is near 100%, up to a population size of 630 speakers. This result suggests that the speakers are not inherently confusable with the use of mel-cepstral features; given features generated from speech produced in the best possible acoustic environment, a GMM classifier can perform almost error-free with these features. Second, speaker identification performance with telephone speech drops off considerably as population size increases. Recall that NTIMIT is the same speech as TIMIT, except that it is transmitted over a telephone network. Thus there is significant performance loss due to the sensitivity of the mel-cepstral features to telephone channel characteristics introduced by NTIMIT, in spite of both training and test data being from the same database. We will see shortly that even greater performance loss occurs when training and test data are collected under different recording conditions.Image

Figure 14.8 Performance of GMM speaker identification system on TIMIT and NTIMIT databases, as a function of number of speakers. Performance is measured as percent correct.

SOURCE: D.A. Reynolds, “Automatic Speaker Recognition

Using Gaussian Mixture Speaker Models” [68]. ©1995, MIT Lincoln Laboratory Journal. Used by permission.

Image

Reynolds has also compared the GMM method against VQ and the minimum-distance classifiers in a different speaker identification task [60]. The experiments were run with the KING database that consists of clean conversational utterances recorded locally with a high-quality Sennheiser microphone and also consists of conversational utterances recorded remotely after being transmitted over a telephone channel, including carbon-button microphones, as well as higher-quality electret microphones [33]. With the clean KING database, it was found that the GMM approach outperforms the VQ method, especially for small test speech duration, with the minimum-distance classifier being much inferior to both methods. In addition, the results of a second experiment using KING telephone speech only, as with the NTIMIT database, show a general decline in performance of all three methods while maintaining their relative performance levels.

Speaker Verification — Recall that the speaker verification problem requires that we make a binary decision (detection) based on two hypotheses, i.e., the test utterance belongs to the target speaker, hypothesis H0, or comes from an imposter, hypothesis H1. As noted in this chapter’s introduction, a speaker known to the system who is correctly claiming his or her identity is labeled as a claimant and a speaker unknown to the system who is posing as a known speaker is labeled an imposter. Suppose that we have a GMM of the target speaker and a GMM for a collection of imposters that we call the background model; we then formulate a likelihood ratio test that decides between H0 and H1. This ratio is the quotient between the probability that the collection of feature vectors Image is from the claimed speaker, P(λC|X), and the probability that the collection of feature vectors X is not from the claimed speaker, Image, i.e., from the background. Using Bayes’ rule, we can write this ratio as

(14.5)

Image

where P(X) denotes the probability of the vector stream X. Discarding the constant probability terms and applying the logarithm, we have the log-likelihood ratio (Exercise 14.3)

(14.6)

Image

that we compare with a threshold to accept or reject whether the utterance belongs to the claimed speaker, i.e.,

(14.7)

Image

One approach to generating a background pdf Image is through models of a variety of background (imposter) speakers (Exercise 14.3). Other methods also exist [61]. We will see numerous examples of speaker verification later in this chapter in different feature vector and application contexts.

14.4 Non-Spectral Features in Speaker Recognition

Up to now we have focused on spectral-based vocal tract features for speaker recognition, specifically, the mel-cepstrum derived from a smoothed STFT magnitude. Although these spectral features are likely to contain some source information, e.g., the spectral tilt of the STFT magnitude influenced by the glottal flow derivative, we have assumed that the primary contribution to the mel-cepstrum is from the vocal tract system function. In this section, in contrast, we explore the explicit use of various aspects of the speech source in speaker recognition. First, we investigate the voiced glottal flow derivative (introduced in Chapters 3 through 5) and, then, briefly source event onset times (introduced in Chapters 10 and 11). In this section, we also provide some insight into the relative importance of source, system, and prosodic (i.e., pitch intonation and articulation rate) features by controlled speech modification using the sinusoidal speech transformation system presented in Chapter 9.

14.4.1 Glottal Flow Derivative

We saw in Chapters 3 through 5 that the glottal flow derivative appears to be speaker-dependent. For example, the flow can be smooth, as when the folds never close completely, corresponding perhaps to a “soft” voice, or discontinuous, as when they close rapidly, giving perhaps a “hard” voice. The flow at the glottis may be turbulent, as when air passes near a small portion of the folds that remains partly open. When this turbulence, referred to as aspiration, occurs often during vocal cord vibration, it results in a “breathy” voice. In order to determine quantitatively whether such glottal characteristics contain speaker dependence, we want to extract features such as the general shape of the glottal flow derivative, the timing of vocal fold opening and closing, and the extent and timing of turbulence at the vocal folds.

Feature Extraction — In Chapter 5 (Sections 5.7 and 5.8), we introduced a method for extracting and characterizing the glottal flow derivative during voicing by pitch-synchronous inverse filtering and temporal parameterization of the flow under the assumption that the time interval of glottal closure is known. The coarse structure of the flow derivative was represented by seven parameters, describing the shape and timing of the components of the piecewise-functional Liljencrants-Fant (LF) model [16]. A descriptive summary of the seven parameters of the LF model was given in Table 5.1. These parameters are obtained by a nonlinear estimation method of the Newton-Gauss type. The coarse structure was then subtracted from the glottal flow derivative estimate to give its fine-structure component; this component has characteristics not captured by the general flow shape such as aspiration and a perturbation in the flow referred to (in Chapter 5) as “ripple.” Ripple is associated with first-formant modulation and is due to the time-varying and nonlinear coupling of the source and vocal tract cavity [2].

In determining fine-structure features, we defined five time intervals within a glottal cycle, introduced in Section 5.7.2 and illustrated in Figure 5.23, over which we make energy measurements on fine structure. The first three intervals correspond to the timing of the open, closed, and return glottal phase based on the LF model of coarse structure, while the last two intervals come from open and closed phase glottal timings, but timings based on formant modulation. The latter is motivated by the observation that when the vocal folds are not fully shut during the closed phase, as determined by the LF model, ripple can begin prior to the end of this closed phase estimate. The open- and closed-phase estimates using formant frequency modulation thus allow additional temporal resolution in the characterization of fine structure. Time-domain energy measures are calculated over these five time intervals for each glottal cycle and normalized by the total energy in the estimated glottal flow derivative waveform. The coarse- and fine-structure features can then be applied to speaker recognition.

Speaker Identification Experiments — In this section, we illustrate the speaker dependence of the glottal source features with a Gaussian mixture model (GMM) speaker identification system.8 As we have done previously with the GMM, each Gaussian mixture component is assumed to be characterized by a diagonal covariance matrix, and maximum-likelihood speaker model parameters are estimated using the iterative Expectation-Maximization (EM) algorithm. Although for recognition, the GMM has been used most typically with the mel-cepstrum, as noted earlier, if the number of component densities in the mixture model is not limited, we can approximate virtually any smooth density. We select 16 Gaussians in the mixture model for LF and energy source features based on experiments showing that increasing the number of Gaussians beyond 16 does not improve performance and decreasing the number below 16 hurts performance of the classifier.

8 Source information has also been shown to possess speaker identifiability in other speaker recognition systems [35],[81]. These source features, however, are not based on an explicit temporal model of the glottal flow derivative, but rather a linear prediction residual. Furthermore, the residual estimation was not pitch synchronous, i.e., it did not use glottal open- and closed-phase timing and voiced speech identification. As such, in these recognition systems, the residual is a representation of the source, primarily in the form of pitch and voicing information, and not of a glottal flow derivative measure.

Before describing the speaker identification experiments, to obtain further insight into the nature of the glottal source parameters for this application, we give an example of an enlightening statistical analysis [48].

Example 14.3       Figure 14.9 compares the histograms of parameters from the two classes of glottal features: (1) the coarse structure of the glottal flow derivative using the glottal flow shape parameter α and the open quotient equal to the ratio of TeTo to the pitch period as determined from the LF model (Table 5.1), and (2) the fine-structure energy over the closed-phase interval [0, To) also as determined from the LF model. The experiment used about 20 s of TIMIT utterances for each of two male speakers, and feature values were divided across 40 histogram bins. We see in Figure 14.9 that there is a separation of distributions of glottal features across the two speakers, particularly with the shape parameter α and the open quotient parameter. We also see in Figure 14.9 generally smooth distributions with specific energy concentrations, indicating their amenity to a GMM model. On the other hand, we see a strong asymmetry in certain histograms such as with the open quotient of Speaker B. This strong asymmetry has also been observed in the histograms of other glottal flow features (not shown), particularly with the return phase as determined from the LF model and the open quotient as determined from formant modulation. Consequently, these features may be more efficiently modeled by a non-GMM pdf such as sums of Rayleigh or Maxwell functions [46], being characterized by a sharp attack and slow decay. This sharp asymmetry may explain the need for a 16-Gaussian mixture, in contrast to the 8-Gaussian mixture used with mel-cepstra in Example 14.2. Image

Figure 14.9 A comparison of the histograms of two coarse glottal flow features, the glottal flow shape parameter α and the open quotient (ratio of TeTo to the pitch period) determined from the LF model, and one fine-structure feature, the energy over the closed-phase interval (0, To) determined from the LF model. The histograms are shown for two different male TIMIT speakers using about 20 s of data for each speaker and feature values are divided across 40 histogram bins.

SOURCE: M. Plumpe, T.F. Quatieri, and D.A. Reynolds, “Modeling of the Glottal Flow Derivative Waveform with Application to Speaker Identification” [48]. ©1999, IEEE. Used by permission.

Image

Speaker identification experiments are performed with a subset of the TIMIT database [48],[49]. The male subset contains 112 speakers, while the female subset contains 56 speakers. As before, for each speaker, eight of the available ten sentences are used for training and two are used for testing. Male and female speaker recognitions are performed separately. Speaker identification results are obtained for coarse, fine, and combined coarse and fine glottal flow derivative features. The results in Table 14.1 show that the two categories of source parameters, although giving a marked decrease from performance with mel-cepstra, contain significant speaker-dependent information, the combined feature set giving about 70% accuracy.

As an alternative to explicit use of the glottal features, one can derive features from glottal flow derivative waveforms. Specifically, as in Example 14.2, we consider a length 19 (the 0th value being removed) mel-cepstrum as a feature vector for speaker identification over a 4000-Hz bandwidth from both the glottal flow derivative waveform estimate, obtained from all-pole (pitch-synchronous) inverse filtering, and its counterpart modeled waveform, i.e., the waveform synthesized using the LF-modeled glottal flow derivative during voicing. The results are shown in Table 14.2. We observe that the seven LF parameters shown in the first row of Table 14.1 better represent the modeled glottal flow derivative than the order–19 mel-cepstral parameter representation of the corresponding waveforem. This is probably because the signal processing to obtain the mel-cepstrum smears the spectrum of the glottal flow derivative and discards its phase. On the other hand, Table 14.2 also indicates that the mel-cepstra of the seven-parameter LF-modeled glottal flow derivative waveform contains significantly less information than that of the estimated glottal flow derivative waveform; this is not surprising given that the synthesized modeled glottal flow derivative is a much reduced version of the glottal flow derivative estimate that contains additional fine structure such as aspiration and ripple. Further study of these results is given in [48],[49].

Table 14.1 Speaker identification performance (percent correct) for various combinations of the source parameters for a subset of the TIMIT database [48].

Image

Although the above results are fundamental for speech science, their potential practical importance lies in part with speaker identification in degrading environments, such as telephone speech. To test performance of the source features on degraded speech, we use the counterpart 120 male speakers and 56 female speakers from the telephone-channel NTIMIT database [48],[49]. The order–19 mel-cepstral representation of the synthesized LF-modeled source waveform, rather than the LF-model parameters themselves, is used in order to provide frame synchrony and similar feature sets for spectrum and source. While the LF-modeled source performs poorly on its own, i.e., 12.5% on males and 27.5% on females, when combined with the mel-cepstra of the speech waveform (with mel-cepstra of the speech and source concatenated into one feature vector), improvement of roughly 3.0% speaker identification for both males and females is obtained, representing a 5% error reduction from using only the mel-cepstra of the speech waveform. Nevertheless, the usefulness of this approach will require further confirmation and understanding, with larger databases and more general degrading conditions.

Table 14.2 Speaker identification performance (percent correct) for mel-cepstral representations of the glottal flow derivative (GFD) waveform and the modeled GFD waveform for a subset of the TIMIT database [48].

Image

14.4.2 Source Onset Timing

The glottal flow features used in the previous section were chosen based on common properties of the glottal source. Chapters 5 and 11, on the other hand, described numerous atypical cases. One example is multiple points of excitation within a glottal cycle (e.g., Figure 5.21). Such secondary pulses were observed in Chapter 11 (Section 11.4) to occasionally “excite” formants different from those corresponding to the primary excitation. Different hypotheses exist for explaining this phenomenon, including non-acoustic energy sources originating from a series of traveling vortices along the oral tract. The presence of such secondary source pulses may in part explain the improved speaker identification scores achieved by measuring pulse onset times in different formant bands and appending these onset times as features to mel-cepstral features [52]. This improvement was noted specifically for a limited (but confusable) subset of the telephone-channel NTIMIT database, the premise being that source pulse timings are robust to such degradations. These examples, as well as other atypical cases of glottal flow [49], point to the need for a more complete understanding of such features and their importance under adverse environmental conditions in automatic (and human) speaker recognition.

14.4.3 Relative Influence of Source, Spectrum, and Prosody

In speaker recognition, both by machine and human, we are interested in understanding the relative importance of the source, the vocal tract, and prosody (i.e., pitch intonation and articulation rate). One approach to gain insight into this relation is to modify these speech characteristics in the speech waveform and measure recognition performance from the modified speech. For example, little performance change was found in an HMM-based speaker recognizer when using synthetic speech from an all-pole noise-driven synthesis to produce the effect of a whispered speech source [39]. The sinusoidal modification system [51], introduced in Chapter 9, has also been used to study the relative importance of source, vocal tract, and prosody, both for a GMM recognizer and for the human listener [53]. In this section, we investigate the effect on speaker verification of sinusoidally modified speech, including change in articulation rate, pitch, and vocal tract spectrum, as well as a transformation to whispering.

Preliminaries — We saw in Chapter 9 that with sinewave analysis/synthesis as a basis, we can modify the articulation rate and change the pitch (see Exercise 9.11 for a development of pitch modification) of a speaker by stretching and compressing sinewave frequency trajectories in time and frequency, respectively. In addition, other transformations can also be simulated through sinewave analysis/synthesis. The effect of whispering, i.e., an aspirated excitation, can be created by replacing the excitation phases Image in Equation (9.5), along the kth sinewave frequency trajectory Ωk(t), by uniformly distributed random variables on the interval [−π, π]. The effect of lengthening or shortening the vocal tract is obtained by stretching or compressing along the frequency axis ω the amplitude and phase M(t, Ω) and Φ(t, Ω), respectively, of the vocal tract system function in Equation (9.5).

The consequence of these transformations on GMM speaker verification is studied with the use of the 2000 National Institute of Standards and Technology (NIST) evaluation database [42]. The series of NIST evaluation databases consist of large amounts of spontaneous 4000-Hz telephone speech from hundreds of speakers, collected under home, office, and college campus acoustic conditions with a wide range of telephone handsets from high-quality electret to low-quality carbon-button. The evaluation data comes from the Switchboard Corpa collected by the Linguistic Data Consortium (LDC) [33]. For the female gender, 504 target speakers are used with 3135 target test trials and 30127 imposter test trials; for the male gender, 422 target speakers are used with 2961 target test trials and 30349 imposter test trials. The target training utterances are 30 s in duration and the test utterances 1 s in duration. Training and testing use the 19 mel-cepstral coefficients at a 10-ms frame interval as described in the previous identification experiments (along with the companion delta-cepstrum and CMS and RASTA channel compensation to be described in the following Section 14.5.1).

Because we are performing speaker verification, rather than speaker identification, a background (i.e., imposter) model is needed for the pdf Image of Equation (14.6). Using the EM algorithm, this background model is trained using a large collection of utterances from the 2000 NIST evaluation database. The background GMM consists of 2048 mixtures to account for a very large array of possible imposters, hence the terminology “universal background model” (UBM) [61],[62]. The target model training again uses the EM algorithm but, in addition, each target speaker’s model is derived by adapting its 2048 mixtures from the UBM. This adaptation approach has been found to provide better performance than the more standard application of the EM algorithm for the speaker verification task. Details and motivation for the GMM recognizer using the adaptation-based target training, henceforth referred to as the GMM-UBM recognizer, can be found in [61],[62].

In evaluating recognition performance with modified speech, target and background training data are left intact and the test data (target and imposter) are modified with the sinusoidal modification system. To obtain a performance reference for these experiments, however, we first obtain the recognition performance with sinewave analysis/synthesis without modification. Performance for the speaker verification task is reported as Detection Error Tradeoff (DET) curves [38], produced by sweeping out the speaker-independent threshold of Equation (14.7) over all verification test scores and plotting estimated miss and false alarm probabilities at each threshold level. More specifically, performance is computed by separately pooling all target trial scores (cases where the speaker of the model and the test file are the same) and all nontarget (imposter) trial scores (cases where the speaker of the model and the test file are different) and sweeping out a speaker-independent threshold over each set to compute the system’s miss and false alarm probabilities. The tradeoff in these errors is then plotted as a DET curve. Figure 14.10 shows that for male speakers, sinewave analysis/synthesis (thick dashed) does not change DET performance of the recognizer relative to the original data (thick solid). A similar characteristic is also seen for females speakers, but with a slight decrease in performance.

Transformation Experiments —
Articulation Rate:
In modifying articulation rate, we expect little change in recognition performance because the stretching or compressing of the sinewave amplitude and phase functions affects only the duration of sinewave amplitude trajectories, and not their value. Thus, the melcepstrum features, derived from the STFT magnitude, should be negligibly affected. Indeed, Figure 14.10 (thin solid) shows that articulation rate change gives a small loss in performance for both male and female speakers for the case of time-scale expansion by a factor of two. A similar small loss occurs for an equivalent compression (not shown). Nevertheless, the loss, although small, indicates that articulation rate is reflected somewhat in recognition performance. This may be due to time-scale modification imparting greater or lesser stationarity to the synthetic speech than in the original time scale so that speech features are altered, particularly during event transitions and formant movement. In informal listening, however, the loss in automatic recognition performance, does not apparently correspond to a loss in aural speaker identifiability.

Figure 14.10 Speaker recognition performance on female (left panel) and male (right panel) speakers using various sinewave-based modifications on all test trial utterances [53]. Each panel shows the Detection Error Tradeoff (DET) curves (miss probability vs false alarm probability) for the original speech waveform (thick solid), sinewaves analysis/synthesis without modification (thick dashed), 60-Hz monotone pitch (thick dashed-dotted), unvoiced (aspirated) excitation (thin dashed), articulation rate slowed down by a factor of two (thin solid), 60-Hz monotone pitch and articulation rate slowed down by a factor of two (thin dashed-dotted), and spectral expansion by 20% (thick dotted). TSM denotes time-scale modification.

Image

Pitch: Common knowledge has been that a change in pitch should not affect automatic recognition performance using the mel-cepstrum because pitch does not measurably alter the mel-scale filter energies. To test this hypothesis, we first perform a pitch change by modifying the time-varying pitch contour to take on a fixed, monotone pitch. For the case of 60-Hz monotone pitch, Figure 14.10 shows the unimportance of pitch for male speakers, but for female speakers, a significant loss in performance occurs (thick dashed-dotted). A similar result is found when the original pitch is replaced by a monotone 100-Hz pitch, as well as when pitch contours of the test data are shifted and scaled by numerous values to impart a change in the pitch mean and variance (not shown). In all cases of pitch modification, the male recognition performance changes negligibly, while female performance changes significantly. In informal listening, aural speaker identifiability is essentially lost with these modifications for both genders; for male speakers, this loss by human listeners is not consistent with the negligible performance loss by automatic recognition. The result of combining the transformations of rate and pitch change (time-scale expansion by two and a 60-Hz monotone pitch) is also shown in Figure 14.10 (thin dashed-dotted). Here we see that performance is essentially equivalent to that of the monotone pitch transformation, the change in mel-cepstral coefficients being dominated by the pitch change.

To understand the gender-dependent performance loss with pitch modification, we need to understand how the mel-cepstral features change with a change in pitch. Toward this end, we compare mel-scale filter energies (from which the mel-cepstrum is derived) from a high-pitched female and a low-pitched male. Figure 14.11 shows the speech log-spectrum and mel-scale filter log-energies for one frame from each speaker. It is seen that there is clear harmonic structure, most notably in the low frequencies of the mel-scale filter energies of the female speech, due to alternating (narrow) mel-scale filters weighting peaks and nulls of the harmonic structure. In practice, we see a continuum in the degree of harmonic structure in going from high-pitched to low-pitched speakers. Mel-scale filter energies do, therefore, contain pitch information. For high-pitched speakers, a change in pitch of the test data modifies the mel-cepstrum and thus creates a “mismatch” between the training and test features.

Figure 14.11 Comparison of log-spectrum and mel-scale filter log-energies for female (a) and male (b) speakers.

SOURCE: T.F. Quatieri, R.B. Dunn, D.A. Reynolds, J.P. Campbell, and E. Singer, “Speaker Recognition Using G. 729 Speech Codec Parameters” [57]. ©2000, IEEE. Used by permission.

Image

Whispering: The next transformation is that of converting the excitation to white noise to produce the effect of whispered speech. It was stated earlier that this effect can be created with sinewave analysis/synthesis by replacing the excitation sinewave phases by uniformly distributed random phases on the interval [−π, π]. Figure 14.12 illustrates that formant information remains essentially intact with this sinewave phase randomization. The recognition result for this transformation is given in Figure 14.10 (thin dashed). We see that the performance change is very similar to that with pitch modification: a large loss for female speakers and a negligible loss for male speakers; indeed, for each gender, the DET curves are close for the two very different transformations. We speculate that, as with pitch modification, this gender-dependent performance loss is due to the increasing importance of fine structure in the mel-scale filter energies with increasing pitch; as seen in the narrowband spectrograms of Figure 14.12, pitch is completely removed with phase randomization, while gross (formant) spectral structure is preserved. For this transformation, there is a large loss in aural speaker identifiability, as there occurs with naturally whispered speech; yet there is little performance loss for male speakers in automatic speaker recognition.

Figure 14.12 Comparison of narrowband spectrograms of whispered synthetic speech (a) and its original counterpart (b) [53].

Image

Spectral Warping: The final transformation invokes a warping of the vocal tract spectral magnitude M(t, ω) and phase Φ(t, ω) along the frequency axis, which corresponds to a change in length of the vocal tract. With this transformation, we expect a large loss in recognition performance because we are displacing the speech formants and modifying their bandwidths, resulting in a severe mismatch between the training and test features. Figure 14.10 (thick dotted) confirms this expectation, showing a very large loss in performance for both males and females using a 20% expansion; in fact, the performance is nearly that of a system with random likelihood ratio scores. For this transformation, there also occurs a significant loss in aural speaker identifiability. A similar result was obtained with a 20% spectral compression.

14.5 Signal Enhancement for the Mismatched Condition

We saw in Section 14.3.3 examples using NTIMIT and KING databases of how speaker recognition performance can degrade in adverse environments. In these examples, the training and test data were matched in the sense of experiencing similar telephone channel distortions. Training and test data, however, may experience different distortions, a scenario we refer to as the mismatched condition, and this can lead to far greater performance loss than for the matched condition in degrading environments. In this section, we address the mismatch problem by calling upon enhancement methods of Chapter 13 that reduce additive and convolutional distortion, including spectral subtraction, cepstral mean substraction (CMS), and RelAtive SpecTrAl processing (RASTA). Also in this section, we introduce a compensation approach for nonlinear distortion, an algorithm for removing badly corrupted features, i.e., “missing features,” and the development of features that hold promise in being invariant under different degrading conditions.

14.5.1 Linear Channel Distortion

We saw in Chapter 13 that the methods of CMS and RASTA can reduce the effects of linear distortion on the STFT magnitude. In this section, we look at how these methods can improve speaker recognition performance. We begin, however, by first extending the use of CMS and RASTA to operate on features, in particular the mel-scale filter energies, rather than on the STFT magnitude.

Feature Enhancement — Recall from Chapter 13 (Section 13.2) that when a sequence x[n]is passed through a linear time-invariant channel distortion g[n] resulting in a sequence y[n] = x[n] * g[n], the logarithm of the STFT magnitude can be approximated as

(14.8)

Image

where we assume the STFT analysis window w[n] is long and smooth relative to the channel distortion g[n]. As a function of time at each frequency ω, the channel frequency response G(ω) is seen as a constant disturbance. In CMS, assuming that the speech component log |X(n, ω)| has zero mean in the time dimension, we can estimate and remove the channel disturbance while keeping the speech contribution intact. We recall from Chapter 13 that RASTA is similar to CMS in removing this constant disturbance, but it also attenuates a small band of low modulation frequencies. In addition, RASTA, having a passband between 1 and 16 Hz, attenuates high modulation frequencies.

For speaker recognition, however, we are interested in features derived from the STFT magnitude. We focus here specifically on the mel-scale filter energies from which the melcepstral feature vector is derived. After being modified by CMS or the RASTA filter, denoted by p[n], the mel-scale filter log-energies become

Image

where

Image

and where [Ll, Ul] is the discrete frequency range for the lth mel-scale filter. Analogous to Equation (14.8), we would like to state that the logarithm of the mel-scale filter energies of the speech and channel distortion are additive so that the CMS and RASTA operations are meaningful. This additivity is approximately valid under the assumption that the channel frequency response is “smooth” over each mel-scale filter. Specifically, we assume

G(ωk) ≈ Gl,     for k = [Ll,Ul]

where Gl represents a constant level for the lth mel-scale filter over its bandwidth. Then we can write

Image

and so

(14.9)

Image

where the second term on the right of the above expression is the desired log-energy. CMS and RASTA can now view the channel contribution to the mel-scale filter log-energies as an additive disturbance to the mel-scale filter log-energies of the speech component. We must keep in mind, however, that this additivity property holds only under the smoothness assumption of the channel frequency response, thus barring, for example, channels with sharp resonances. Before describing the use of the mel-cepstrum features, derived from the enhanced mel-scale filter log-energies,9 in speaker recognition, we introduce another feature representation that is widely used in improving recognition systems in the face of channel corruption, namely the delta cepstrum.

9 Observe that we can also apply CMS and RASTA in time (across frames) to the mel-cepstral coefficients directly. That is, due to the linearity of the Fourier transform operator used to obtain the mel-cepstrum from the mel-scale filter log-energies, enhancing the mel-cepstrum with CMS or RASTA linear filtering is equivalent to the same operation on the mel-scale filter log-energies.

An important objective in recognition systems is to find features that are channel invariant. Features that are instantaneous, in the sense of being measured at a single time instant or frame, such as the mel-cepstrum, often do not have this channel invariance property. Dynamic features that reflect change over time, however, can have this property. The delta cepstrum is one such feature set when linear stationary channel distortion is present [19],[78]. The delta cepstrum also serves as a means to obtain dynamic, temporal speech characteristics which can possess speaker identifiability. One formulation of the delta cepstrum invokes the difference between two mel-cepstral feature vectors over two consecutive analysis frames. For the mel-cepstrum, due to the linearity of the Fourier transform, this difference corresponds to the difference of mel-scale filter log-energies over two consecutive frames, i.e.,

(14.10)

Image

which can be considered as a first-backward difference approximation to the continuous-time differential of the log-energy in time. For a time-invariant channel, the channel contribution is removed by the difference operation in Equation (14.10) (Exercise 14.4). The delta cepstrum has been found to contain speaker identifiability, with or without channel distortion, but does not perform as well as the mel-cepstrum in speaker recognition [78]. Nevertheless, the melcepstrum and delta cepstrum are often combined as one feature set in order to improve recognition performance. In GMM recognizers, for example, the two feature sets have been concatenated into one long feature vector [60], while in minimum-distance and VQ recognizers, distance measures from each feature vector have been combined with a weighted average [78].

A number of variations of the delta cepstrum, proven even more useful than a simple feature difference over two consecutive frames, form a polynomial fit to the mel-scale filter log-energies (or mel-cepstrum) over multiple successive frames and use the parameters of the derivative of the polynomial as features. It is common to use a first-order polynomial that results in just the slope of a linear fit across this time span. For example, a first-order polynomial typically used in speaker recognition fits a straight line over five 10-ms frames (i.e., 50 ms); this time span represents the tradeoff between good temporal resolution and a reliable representation of temporal dynamics. Experiments in both speaker and speech recognition have shown that a smoothed but more reliable estimate of the trend of spectral dynamics gives better recognition performance than a higher resolution but less reliable estimate [78].

Application to Speaker Recognition We now look at an example of speaker recognition that illustrates use of the above channel compensation techniques in a degrading environment and under a mismatched condition. In this example we investigate speaker verification using the Tactical Speaker Identification Database (TSID) [33]. TSID consists of 35 speakers reading sentences, digits, and directions over a variety of very degraded and low bandwidth wireless radio channels, including cellular and push-to-talk. These channels are characterized by additive noise and linear and nonlinear distortions. For each such “dirty” recording, a low noise and high bandwidth “clean” reference recording was made simultaneously at the location of the transmitter. In this next example, we give performance in terms of the equal error rate (EER). The EER is the point along the DET where the % of false acceptance and % of false rejection errors are equal, and thus provides a concise means of performance comparison.

Example 14.4       The purpose of this example is to illustrate the effect of the mismatched condition, training on clean data and testing on dirty data, using TSID, and to investigate ways of improving performance by channel compensation [45]. The speaker recognition system uses the previously described 8-mixture GMM and 19 mel-cepstral coefficients over a 4000-Hz bandwidth. Details of the training and testing scenario are given in [45].

As a means of comparison to what is possible in a matched scenario of training on clean and testing on clean data, referred to as the “baseline” case, the EER is obtained at approximately 4% (Table 14.3). In contrast, the matched case of training on dirty and testing on dirty data gives an ERR of about 9%. On the other hand, the mismatched case of training on clean and testing on dirty data has an EER of approximately 50%—performance equivalent to flipping a coin. To improve this result, a variety of compensation techniques are employed, individually and in combination. In particular, the result of applying delta cepstral coefficients (DCC) based on the above 1st-order polynomial, cepstral mean subtraction (CMS), and their combination (CMS + DCC) is shown in Table 14.3 for the mismatched case. CMS alone gives greater performance than DCC alone, while the combination of CMS and DCC yields greater performance than either sole compensation method. An analogous set of experiments was performed with RASTA. While maintaining the relative performance of CMS, RASTA gave a somewhat inferior performance to CMS. The combination of CMS and RASTA gained little over CMS alone. Image

Table 14.3 Speaker verification results with various channel compensation techniques using the TSID database [45]. Results using CMS, DCC, or combined CMS and DCC show ERR performance improvements in the Clean/Dirty mismatched case.

SOURCE: M. Padilla, Applications of Missing Theory to Speaker Recognition [45]. ©2000, M. Padilla and Massachusetts Institute of Technology. Used by permission.

Image

We see in this example that RASTA performs somewhat inferior to CMS and offers little gain when combined with CMS. This is because the degrading channels are roughly stationary over the recording duration. The relative importance of these compensation techniques, however, can change with the nature of the channel. Extensive studies of the relative importance of the CMS and RASTA techniques for speaker recognition have been performed by van Vuurem and Hermansky [83] for a variety of both matched and mismatched conditions.

14.5.2 Nonlinear Channel Distortion

Although linear compensation techniques can improve recognition performance, such methods address only part of the problem, not accounting for a nonlinear contribution to distortion [65]. Nonlinear distortion cannot be removed by the linear channel compensation techniques of CMS and RASTA and can cause a major source of performance loss in speaker recognition systems, particularly when they are responsible for mismatch between training and testing data. For example, in telephone transmission, we have seen that carbon-button microphones induce nonlinear distortion, while electret microphones are more linear in their response. In this section, we develop a nonlinear channel compensation technique in the context of this particular telephone handset mismatch problem. A method is described for estimating a system nonlinearity by matching the spectral magnitude of the distorted signal to the output of a nonlinear channel model, driven by an undistorted reference [55]. This magnitude-only approach allows the model to directly match unwanted speech resonances that arise over nonlinear channels.

Theory of Polynomial Nonlinear Distortion on Speech — Telephone handset nonlinearity often introduces resonances that are not present in the original speech spectrum. An example showing such added resonances, which we refer to as “phantom formants,” is given in Figure14.13 where a comparison of all-pole spectra from a TIMIT waveform and its counterpart carbon-button microphone version from HTIMIT [33],[66] is shown. HTIMIT is a handset-dependent corpus derived by playing a subset of TIMIT through known carbon-button and electret handsets.

Figure 14.13 Illustration of phantom formants, comparing all-pole spectra from wideband TIMIT (dashed) and carbon-button HTIMIT (solid) recordings. The location of the first phantom formant (ω1 + ω2) is roughly equal to the sum of the locations of the first two original formants (ω1 and ω2). The (14th-order) all-pole spectra are derived using the autocorrelation method of linear prediction.

SOURCE: T.F. Quatieri, D.A. Reynolds, and G.C. O’Leary, “Estimation of Handset Nonlinearity with Application to Speaker Recognition” [55]. ©2000, IEEE. Used by permission.

Image

The phantom formants occur at peaks in the carbon-button spectral envelope that are not seen in that of the high-quality microphone, and, as illustrated by the all-pole spectra in Figure 14.13, can appear at multiples, sums, and differences of original formants (Exercise 14.8). Phantom formants, as well as two other spectral distortions of bandwidth widening and spectral flattening, also seen in Figure 14.13, have been consistently observed not only in HTIMIT, but also in other databases with dual wideband/telephone recordings such as the wideband/narrowband KING and the TIMIT/NTIMIT databases that were introduced in earlier examples [26],[27].

To further understand the generation of phantom formants, we investigate the effect of polynomial nonlinearities first on a single vocal tract impulse response, and then on a quasi-periodic speech waveform. The motivation is to show that finite-order polynomials can generate phantom formants that occur at multiples, sums, and differences of the original formants, consistent with measurement of distorted handset spectra.

Example 14.5       Consider a single-resonant impulse response of the vocal tract, x[n] = rn cos(ωn)u[n], with u[n] the unit step, passed through a cubic nonlinearity, i.e., y[n] = x3[n]. The output of the distortion element consists of two terms and is given by

Image

The first term is a resonance at the original formant frequency ω, and the second term is a phantom formant at three times the original frequency. We observe that the resulting resonances have larger bandwidth than the original resonance, corresponding to a faster decay. The bandwidth widening and additional term contribute to the appearance of smaller spectral tilt. More generally, for a multi-formant vocal tract, the impulse response consists of a sum of damped complex sinewaves. In this case, a polynomial nonlinearity applied in the time domain maps to convolutions of the original vocal tract spectrum with itself, thus introducing sums and differences of the original formant frequencies with wider bandwidths and with decreased spectral tilt. The presence of additional resonances and the change in bandwidth can alternatively be derived using trigonometric expansions. Image

In selecting a polynomial distortion to represent measured phantom formants, it is important to realize that both odd and even polynomial powers are necessary to match a spectrum of typical observed measurements. For polynomial distortion, odd-power terms give (anti-) symmetric nonlinearities and alone are not capable of matching even-harmonic spectral distortion; even-power terms introduce asymmetry into the nonlinearity and the necessary even harmonics. A comparison of this (anti-) symmetric/asymmetric distinction is given in Figure 14.14. The (anti-) symmetric nonlinearity of Figure 14.14a gives phantom formants beyond the highest second formant, but never between the first and second formants as would any odd-powered polynomial such as a cubic polynomial (x + x3). The polynomial of Figure 14.14b is given by y = x3 + 1.7x2 + x, the even power responsible for asymmetry and a phantom formant between the first and second formants. Asymmetric nonlinearities are important because in dual wideband/telephone recordings, a phantom formant is often observed between the first and second formants. Moreover, asymmetry is characteristic of input/output relations of carbon-button microphones derived from physical principles (Exercise 4.15) [1],[43].

We see then that phantom formants are introduced by polynomial nonlinearities on a single impulse response. Because a speech waveform is quasi-periodic, however, an analysis based on a single vocal tract impulse response holds only approximately for its periodic counterpart; in the periodic case, tails of previous responses add to the present impulse response. Nevertheless, it has been found empirically for a range of typical pitch and multi-resonant synthetic signals that the spectral envelope of the harmonics differs negligibly from the spectrum of a single impulse response distorted by a variety of polynomials [55]. Therefore, for periodic signals of interest, phantom formants can, in practice, be predicted from the original formants of the underlying response

Figure 14.14 Illustration of phantom formants for different nonlinearities: (a) piecewise-linear (anti-) symmetric; (b) polynomial asymmetric; (c) no distortion, corresponding to the original two-resonant signal.

SOURCE: T.F. Quatieri, D.A. Reynolds, and G.C. O’Leary, “Estimation of Handset Nonlinearity with Application to Speaker Recognition” [55]. ©2000, IEEE. Used by permission.

Image

Handset Models and Estimation — In this section, we study a nonlinear handset model based on the three-component system model10 consisting of a memoryless nonlinearity sandwiched between two linear filters, illustrated in Figure 14.15. The specific selections in this model are a finite-order polynomial and finite-length linear filters.

10 The primary purpose of the pre-filter is to provide scaling, spectral shaping, and dispersion. The dispersion provides memory to the nonlinearity. This is in lieu of an actual handset model that might introduce a more complex process such as hysteresis. The post-filter provides some additional spectral shaping.

The output of the nonlinear handset model is given by

(14.11)

Image

where Q is a Pth-order polynomial nonlinear operator which for an input value u has an output

(14.12)

Image

The sequence g[n] denotes a Jth-order FIR pre-filter and h[n]a K th-order FIR post-filter. To formulate the handset model parameter estimation problem, we define a vector of model parameters a = [g, q, h], where g = [g[0], g[1], … g[J − 1]], q = [q0, q1, … qP], and h = [h[0], h[1], … h[K − 1]] and represent the linear pre-filter, nonlinearity, and linear post-filter, so that the goal is to estimate the vector Image.

A time-domain estimation approach is to minimize an error criterion based on waveform matching,11 such as Image where s[n] is the measurement signal, x[n] in Equation (14.11) is called the reference signal (assumed the input to the actual nonlinear device, as well as the model), and where we have included the parameter vector Image as an argument in model output y[n]. However, because of sensitivity of waveform matching to phase dispersion and delay, e.g., typical alignment error between the model output and measurement, an alternate frequency-domain approach is considered based on the spectral magnitude. As with other speech processing problems we have encountered, an additional advantage of this approach is that speech phase estimation is not required.

11 An alternate technique for parameter estimation that exploits waveform matching is through the Volterra series representation of the polynomial nonlinearity; with this series, we can expand the model representation and create a linear estimation problem [72]. The properties and limitations of this approach for our particular problem are explored in Exercise 14.9.

Figure 14.15 Nonlinear handset model.

SOURCE: T.F. Quatieri, D.A. Reynolds, and G.C. O’Leary, “Estimation of Handset Nonlinearity with Application to Speaker Recognition” [55]. ©2000, IEEE. Used by permission.

Image

We begin by defining an error between the spectral magnitude of the measurement and nonlinearly distorted model output. Because a speech signal is nonstationary, the error function uses the spectral magnitude over multiple frames and is given by

(14.13)

Image

where S(pL, ωk) and Y(pL, ωk; a) are the discrete STFTs of s[n] and y[n; a], respectively, over an observation interval consisting of M frames, L is the frame length, and ωk = 2πk/N where N is the DFT length. Our goal is to minimize Image with respect to the unknown model coefficients Image. This is a nonlinear problem with no closed-form solution. An approach to parameter estimation is solution by iteration, one in particular being the generalized Newton method [18] and is similar in style to other iterative algorithms we have encountered throughout the text such as the iterative spectral magnitude-only estimation of Chapter 7 (Section 7.5.3) and the EM algorithm described earlier in this chapter (and in Appendix 14.A).

To formulate an iterative solution, we first define the residual vector Image by

(14.14)

Image

where Image. The error in Equation (14.13) can be interpreted as the sum of squared residuals (i.e., components of the residual vector) over all frames, i.e.,

(14.15)

Image

where T denotes matrix transpose. The gradient of Image is given by Image where J is the Jacobian matrix of first derivatives of the components of the residual vector, i.e., the elements of J are given by

Image

where ƒi is the ith element of Image (Exercise 14.10). The generalized Newton iteration, motivated by the first term of a Taylor series expansion of Image, is formulated by adding to an approximation of Image at each iteration a correction term, i.e.,

Image

where Image with Image evaluated at the current iterate Image, and where the factor μ scales the correction term to control convergence.12 For the residual vector definition of Equation (14.14), there is not a closed-form expression for the gradient. Nevertheless, an approximate gradient can be calculated by finite-element approximations, invoking, for example, a first-backward difference approximation to the partials needed in the Jacobian matrix [55].

12 When the number of equations equals the number of unknowns, the generalized Newton method reduces to using a correction Image, which is the standard Newton method.

Although this estimation approach is useful in obtaining handset characteristics, the ultimate goal in addressing the speaker recognition mismatch problem is not a handset model but rather a handset mapper for the purpose of reducing handset mismatch between high- and low-quality handsets. Consider a reference signal that is from a high-quality electret handset and a measurement from a low-quality carbon-button handset. Because we assume distortion introduced by the high-quality electret handset is linear, the model of a handset, i.e., a nonlinearity sandwiched between two linear filters, is also used for the handset mapper. We refer to this transformation as a “forward mapper,” having occasion to also invoke an “inverse mapper” from the low-quality carbon-button to the high-quality electret. We can design an inverse carbon-to-electret mapper simply by interchanging the reference and measurement waveforms. The following example illustrates the method:

Example 14.6       An example of mapping an electret handset to a carbon-button handset output is shown in Figure 14.16. The data used in estimation consist of 1.5 s of a male speaker from HTIMIT, analyzed with a 20-ms Hamming window at a 5-ms frame interval. The pre-filter and post-filter are both of length 5 and the polynomial nonlinearity is of order 7. In addition, three boundary constraints are imposed on the output of the nonlinearity, given by Q(0) = 0, Q(1) = 1, and Q(−1) = −1.

Figure 14.16 Example of electret-to-carbon-button mapping: (a) electret waveform output; (b) carbon-button waveform output; (c) electret-to-carbon mapped waveform; (d) comparison of all-pole spectra from (a) (dashed) and (b) (solid); (e) comparison of all-pole spectra from (b) (solid) and (c) (dashed). All-pole 14th order spectra are estimated by the autocorrelation method of linear prediction.

SOURCE: T.F. Quatieri, D.A. Reynolds, and G.C. O’Leary, “Estimation of Handset Nonlinearity with Application to Speaker Recognition” [55]. ©2000, IEEE. Used by permission.

Image

Applying these conditions to Equation (14.12) yields the constraint equations

Image

thus reducing the number of free variables by three.

Figures 14.16a and 14.16b show particular time slices of the electret and carbon-button handset outputs, while Figure 14.16d shows the disparity in their all-pole spectra, manifested in phantom formants, bandwidth widening, and spectral flattening. Figure 14.16c gives the corresponding waveform resulting from applying the estimated mapper to the same electret output. (The time interval was selected from a region outside of the 1.5 s interval used for estimation.) Figure 14.16e compares the carbon-button all-pole spectrum to that of mapping the electret to the carbon-button output, illustrating a close spectral match. The characteristics of the mapper estimate are shown in Figure 14.17 obtained with about 500 iterations for convergence. The post-filter takes on a bandpass characteristic, while the pre-filter (not shown) is nearly flat. The nonlinearity is convex, which is consistent with the observation (compare Figures 14.16a and 14.16b) that the carbon-button handset tends to “squelch” low-level values relative to high-level values.13

13 The mapper is designed using a particular input energy level; with a nonlinear operator, changing this level would significantly alter the character of the output. Therefore, test signals are normalized to the energy level of the data used for estimation; this single normalization, however, may over- or under-distort the data and thus further consideration of input level is needed.

The inverse mapper design is superimposed on the forward design in Figures 14.17a and 14.17b. The post-filter of the forward mapper is shown with the pre-filter of the inverse mapper because we expect these filters to be inverses of each other. One notable observation is that the inverse nonlinearity is twisting in the opposite (concave) direction to that of the (convex) forward mapper, consistent with undoing the squelching imparted by the carbon-button handset. Image

Figure 14.17 Characteristics of forward and inverse handset mappings: (a) nonlinearity of forward mapper (solid) superimposed on nonlinearity of inverse mapper (dashed); (b) post-filter of forward mapper (solid) superimposed on pre-filter of inverse mapper (dashed).

SOURCE: T.F. Quatieri, D.A. Reynolds, and G.C. O’Leary, “Estimation of Handset Nonlinearity with Application to Speaker Recognition” [55]. ©2000, IEEE. Used by permission.

Image

The previous examples illustrate both a forward and an inverse mapper design. Although an inverse mapper is sometimes able to reduce the spectral distortion of phantom formants, bandwidth widening, and spectral flattening, on the average the match is inferior to that achieved by the forward design. It is generally more difficult to undo nonlinear distortion than it is to introduce it.

Application to Speaker Recognition — One goal of handset mapper estimation is to eliminate handset mismatch, due to nonlinear distortion, between training and test data to improve speaker recognition over telephone channels. The strategy is to assume two handset classes: a high-quality electret and a low-quality carbon-button handset. We then design a forward electret-to-carbon-button mapper and apply the mapper according to handset detection [66] on target training and test data. In this section, we look at an example of a handset mapper designed on the HTIMIT database, and also compare the mapper approach to an alternative scheme that manipulates the log-likelihood scores of Equation (14.6) in dealing with the handset mismatch problem.

In speaker recognition, linear channel compensation is performed with both CMS and RASTA processing to reduce linear channel effects on both the training and test data. This compensation, however, does not remove a nonlinear distortion contribution nor a linear prefiltering that comes before a nonlinearity in the distortion model of Figure 14.15, i.e., only the linear postfiltering distortion can be removed. In exploiting the above handset mapper design, we apply an electret-to-carbon-button nonlinear mapper to make electret speech utterances appear to come from carbon-button handsets when a handset mismatch occurs between the target training and test data. The specific mapper design was developed in Example 14.6 using the HTIMIT database and illustrated in Figure 14.17. For each recognition test trial, the mapper is applied to either the test or target training utterances, or neither, according to the specific mismatched (or matched) condition between training and test utterances, as summarized in Table 14.4.

The second approach to account for differences in handset type across training and test data modifies the likelihood scores, rather than operating on the waveform, as does the handset mapper approach. This approach introduced by Reynolds [67] is referred to as hnorm, denoting handset normalization. Hnorm was motivated by the observation that, for each target model, likelihood ratio scores have different means and ranges for utterances from different handset types. The approach of hnorm is to remove these mean and range differences from the likelihood ratio scores. For each target model, we estimate the mean and standard deviation of the scores of same-sex non-target (imposter) utterances from different handset types, and then normalize out the mean and variance to obtain a zero-mean and unity-variance score distribution. For example, for the carbon-button-handset type, during testing, we use the modified score

Image

where the target’s likelihood score Λ(x) is normalized by the mean ucarb and standard deviation Image estimates, based on the handset label assigned to the feature vector x. In effect, we normalize out the effect of the model (target) on the imposter score distribution for each handset type. This normalization results in improved speaker verification performance when using a single threshold for detection evaluation, as in the generation of Detection Error Tradeoff (DET) curves. More details and motivation for hnorm can be found in [62],[67].

Table 14.4 Mapping on testing and target training data.

Image

For either hnorm or the mapper approach, we require a handset detector, i.e., we need to determine whether the utterance comes from electret or carbon-button handset type. The automatic handset detector is based on a maximum-likelihood classifier using Gaussian mixture models built to discriminate between speech originating from a carbon-button handset and speech originating from an electret handset [66] (Exercise 14.5).

Example 14.7       We use a GMM recognizer applied to the 1997 NIST evaluation database [42] to illustrate the handset mapper and hnorm concepts [55]. Specifically, speaker verification is based on a 2048-mixture GMM-UBM classifier described earlier in Section 14.4.3 and 19 mel-cepstral coefficients (with companion delta-cepstra) over a 4000-Hz bandwidth. Training the UBM uses the 1996 NIST evaluation database [42]. Both a matched and mismatched condition were evaluated. In the matched condition, the target training and test data come from the same telephone number and the same handset type for target test trials, but different telephone numbers (and possibly different handset types) for the non-target (imposter) test trials. In the mismatched condition, training and test data come from different telephone numbers and thus possibly different handset types for both target and non-target (imposter) test trials. A performance comparison of the mapper, hnorm, and baseline (neither mapped nor score-normalized) is given in Figure 14.18 for 250 male target speakers and 2 min training and 30 s test utterances. In obtaining means and standard deviations for hnorm, 1996 NIST evalution data is passed through the target models derived using the 1997 NIST evaluation database; 300 carbon-button and 300 electret test utterances were used.

For the mismatched condition, the handset mapper approach resulted in DET performance roughly equivalent to that using score normalization, i.e., hnorm. It is interesting that, for this example, two very different strategies, one operating directly on the waveform, as does the mapper, and the other operating on log-likelihood ratio scores, as does hnorm, provide essentially the same performance gain under the mismatched condition. On the other hand, as one might expect, using the handset mapper for the matched condition slightly hurt performance because it is applied only to non-target (imposter) test trials (for which there can be handset mismatch with target models), thus minimizing the channel mismatch in these trials and increasing the false alarm rate. Image

The same strategy used in the previous example has also been applied to design an optimal linear mapper, i.e., a handset mapper was designed without a nonlinear component using spectral magnitude-only estimation. The linear mapper was designed under the same conditions as the nonlinear mapper used in our previous speaker verification experiments, and provides a highpass filter characteristic. Because a linear distortion contribution is removed by CMS and RASTA, specifically the linear post-filter in our distortion model, a mapper designed with only a linear filter should not be expected to give an additional performance gain, performing inferior to the nonlinear mapper. Indeed, when this linear mapper was applied in training and testing according to the above matched and mismatched conditions, the DET performance is essentially identical to that with no mapper, and thus is inferior to that with the corresponding nonlinear mapper [55]. Finally, we note that when score normalization (hnorm) is combined with the handset mapping strategy, a small overall performance gain is obtained using the databases of Example 14.7, with some improvement over either method alone under the mismatched condition, and performance closer to that with hnorm alone under the matched condition [55]. Nevertheless, over a wide range of databases, hnorm has proven more robust than the mapping technique, indicating the need for further development of the mapper approach [55].

Figure 14.18 Comparison of DET performance using the nonlinear handset mapper and handset normalization of likelihood scores (hnorm) in speaker verification: baseline (dashed), hnorm (solid), and nonlinear mapper (dashed-dotted). Upper curves represent the mismatched condition, while lower curves represent the matched condition.

SOURCE: T.F. Quatieri, D.A. Reynolds, and G.C. O’Leary, “Estimation of Handset Nonlinearity with Application to Speaker Recognition” [55]. ©2000, IEEE. Used by permission.

Image

14.5.3 Other Approaches

The signal processing methods (in contrast to statistical approaches such as hnorm) of CMS, RASTA, and the nonlinear handset mapper all attempt to remove or “equalize” channel distortion on the speech waveform to provide enhanced features for recognition. An alternate approach is to extract features that are insensitive to channel distortion, referred to as channel invariant features. The source onset times of Section 14.4.2 provide one example of features that, to some extent, are channel invariant. These approaches to channel equalization and feature invariance represent just a few of the many possibilities for dealing with adverse and mismatched channel conditions. Below we briefly describe a small subset of the emerging techniques in these areas, based on speech signal processing and modeling concepts.

Channel Equalization —
Noise reduction:
With regard to additive noise, various methods of noise suppression preprocessing show evidence of helping recognition. For example, the generalized spectral subtraction method of Section 13.5.3 applied to the mel-scale filter energies is found to enhance performance of a GMM-based recognizer under the mismatched condition of clean training data and white-noise-corrupted test data [12]. Further performance improvement is obtained by applying noise suppression algorithms to the noise-corrupted STFT magnitude, prior to determining the mel-scale filter energies [45]. The noise suppression algorithms include generalized spectral subtraction and the adaptive Wiener filtering described in Chapter 13.

Missing feature removal: The idea in this approach is to remove badly corrupted elements of a feature vector during testing, e.g., under the mismatched condition of training on “clean” data and testing on “dirty” data, to improve recognition performance. Motivation for this approach is the observation by Lippman [34] that speech recognition performance can improve when appropriately selecting features with known corruption by filtering and noise. A challenging aspect of this approach is feature selection when the corrupting channel is unknown. Drygajlo and El-Maliki [12] developed a missing feature detector based on the generalized spectral subtraction noise threshold rules applied to the mel-scale filter energies, and then dynamically modified the probability computations performed in the GMM recognizer based on these missing features (Exercise 14.12). This entails dividing the GMM densities into “good feature” and “bad feature” contributions based on the missing feature detector. Padilla [45] provided an alternative to this approach by developing a missing feature detector based on the STFT magnitude, rather than the mel-scale filter energies, and then mapping a frequency-based decision to the mel-scale. Because the STFT magnitude provides for improved detection resolution (over frequency) for determining good and bad spectral features, significant speaker recognition performance improvement was obtained over a detector based on mel-scale filter energies.

Channel Invariance —
Formant AM-FM: We have seen in the frequency domain that both amplitude and frequency modulation of a sinusoid add energy around the center frequency of the sinusoid as, for example, in Figure 7.8; as more modulation is added, the energy spreads more away from the center frequency. Therefore, formant AM and FM are expected to contribute only locally to the power spectrum of the waveform, i.e., only around the formant frequency. Motivated by this observation, Jankowski [26] has shown that as long as the frequency response of a channel is not varying too rapidly as a function of frequency, this local property of modulation is preserved, being robust not only to linear filtering but also to typical telephone-channel nonlinear distortions. This property was determined by subjecting a synthetic speech resonance with frequency and bandwidth modulation to estimated linear and nonlinear (cubic) distortion of nine different telephone channels from the NTIMIT database. A variety of modulations include increase in bandwidth associated with the truncation effect and the varying first formant center frequency during the glottal open phase that were described in Chapters 4 and 5. Using the Teager energy-based AM-FM estimation algorithm of Chapter 11 (Section 11.5.3), Jankowski determined both analytically and empirically the change in formant AM and FM as test signals passed through each channel. The resulting AM and FM were observed to be essentially unchanged when passed through the simulated NTIMIT channels. In other words, typical formant AM and FM barely change with linear and nonlinear distortions of the NTIMIT telephone channel. Jankowski also showed that a change in first-formant bandwidth and frequency multipliers14 for several real vowels was negligible when the vowels were passed through the simulated channels. As such, features derived from formant modulations indicate promise for channel invariance in speaker recognition [26],[27].

14 Recall from Section 4.5.2 that an effect of glottal/vocal tract interaction is to modulate formant bandwidth and frequency, represented by time-dependent multipliers.

Sub-cepstrum: In a series of experiments by Erzin, Cetin, and Yardimci [14] using an HMM speech recognizer, it was shown that the sub-cepstrum could outperform the mel-cepstrum features in the presence of car noise over a large range of SNR. In a variation of this approach, Jabloun and Cetin [25] showed that replacing the standard energy measure by the Teager energy in deriving the sub-cepstrum can further enhance robustness to additive car noise with certain autocorrelation characteristics. This property is developed in Exercise 14.13. In yet another variation of the sub-cepstrum, changes in subband energies are accentuated using auditory-like nonlinear processing on the output of a gammachirp-based filter bank (Chapter 11, Section 11.2.2). [75]. A cepstrum derived from these enhanced energies again outperform the mel-cepstrum in a speech recognition task in noise. The demonstrated robustness of the sub-cepstrum and its extensions, however, to date, is not known to be related to the temporal resolution capability that we described in Section 14.2.3. Moreover, the sub-cepstrum has yet to be investigated for speaker recognition.

14.6 Speaker Recognition from Coded Speech

Due to the widespread use of digital speech communication systems, there has been increasing interest in the performance of automatic recognition systems from quantized speech resulting from the waveform, model-based, and hybrid speech coders described in Chapter 12. The question arises as to whether the speech quantization and encoding can affect the resulting mel-cepstral representations which are the basis for most recognition systems, including speech, speaker, and language recognition [41],[56]. There is also interest in performing recognition directly, using model-based or hybrid coder parameters rather than the synthesized coded speech [23],[57]. Possible advantages include reduced computational complexity and improved recognition performance in bypassing degradation incurred in the synthesis stage of speech coding. In this section, we investigate the effect of speech coding on the speaker verification task.

14.6.1 Synthesized Coded Speech

Three speech coders that are international standards and that were introduced in Chapter 12 are: G.729 (8 kbps), GSM (12.2 kbps), and G.723.1 (5.3 and 6.3 kbps). All three coders are based on the concept of short-time prediction corresponding to an all-pole vocal tract transfer function, followed by long-term prediction resulting in pitch (prediction delay and gain) and residual spectral contributions. The short-term prediction is typically low-order, e.g., order 10, and the residual represents the speech source as well as modeling deficiencies. The primary difference among the three coders is the manner of coding the residual.

Speaker verification has been performed on the speech output of these three coders using a GMM-UBM system [56]. The features consist of 19 appended mel-cepstra and delta cepstra (1st-order polynomial) derived from bandlimited (300–3300 Hz) mel-scale filter energies. Both CMS and RASTA filtering are performed on the mel-cepstrum features prior to training and testing. The training and testing utterances are taken from a subset of the 1998 NIST evaluation database [42]; in these experiments, both are limited to electret handsets, but possibly different phone numbers. 50 target speakers are used for each gender with 262 test utterances for males and 363 for females. Target training utterances are 30 s in duration and test utterances 1 s in duration. Background models are trained using the 1996 NIST evaluation database and 1996 NIST evaluation development database [42]. As with other recognition scenarios, we can consider both a matched condition in which target training, test, and background data all are coded, or a mismatched condition in which one subset of the three data sets is coded and another uncoded.

It was shown that, using speech synthesized from the three coders, GMM-UBM-based speaker verification generally degrades with coder bit rate, i.e., from GSM to G.729 to G.723.1, relative to an uncoded baseline whether in a matched or mismatched condition, the mismatched condition tending to be somewhat lower-performing [56].

14.6.2 Experiments with Coder Parameters

We mentioned above that in certain applications, we may want to avoid reconstructing the coded speech prior to recognition. In this section, we select one coder, G.729, and investigate speaker verification performance with its underlying parameters, exploiting both vocal tract and source modeling.

Vocal Tract Parameters — We begin with speaker verification using the mel-cepstrum corresponding to the mel-scale filter energies derived from the all-pole G.729 spectrum. Because G.729 transmits ten line spectral frequency (LSF) parameters [Chapter 12 (Section 12.6.3) introduces the LSF concept], conversion of the G.729 LSFs to mel-cepstra is first performed and is given by the following steps: (1) Convert the ten LSFs to ten all-pole predictor coefficients (Exercise 12.18) from which an all-pole frequency response is derived; (2) Sample the all-pole frequency response at DFT frequencies; (3) Apply the mel-scale filters and compute the mel-scale filter energies:

Image

where Image, is the all-pole transfer function estimate at frame time n and Vl(ω) is the frequency response of the lth mel-scale filter; (4) Convert mel-scale filter energies to the mel-cepstrum. In speaker verification experiments, an order 19 mel-cepstrum is used with CMS and RASTA compensation, along with the corresponding delta cepstrum.

Compared to previous speaker verification results with G.729 coded speech, performance with the above mel-cepstrum degrades significantly for both males and females. The result for the matched training and testing scenario is shown in Figure 14.19. Conditions for training and testing are identical to those used in the previous recognition experiments with synthesized coded speech. One possible explanation for the performance loss is that the mel-cepstrum derived from an all-pole spectrum is fundamentally different from the conventional mel-cepstrum: we are smoothing an already smooth all-pole envelope, rather than a high-resolution short-time Fourier transform as in the conventional scheme, thus losing spectral resolution.

An alternative is to seek a feature set that represents a more direct parameterization of the all-pole spectrum—specifically, either LSF coefficients, available within G.729, or a transformation of the LSF coefficients. In addition, we want these features to possess the property that linear channel effects are additive, as assumed in the channel compensation techniques of CMS and RASTA. It has been observed that the LSFs have this property only in the weak sense that a linear channel results in both a linear and nonlinear distortion contribution to the LSFs [57]. Cepstral coefficients, on the other hand, being a Fourier transform of a log-spectrum, have this additivity property. Moreover, it is possible to (reversibly) obtain a cepstral representation from the all-pole predictor coefficients derived from the LSFs. This is because there exists a one-to-one relation between the first p values (excluding the 0th value) of the cepstrum of the minimum-phase, all-pole spectrum, and the corresponding p all-pole predictor coefficients. This cepstral representation is distinctly different from the mel-cepstrum obtained from the mel-scale filter energies of the all-pole spectral envelope and can be computed recursively by (Exercise 6.13)

images

where a[n] represents the linear predictor coefficients in the predictor polynomial given by Image. We refer to this recursively computed cepstrum as the reccepstrum in contrast to the mel-cepstrum derived from the mel-scale filter energies of the all-pole spectrum. Although the rec-cepstrum is, in theory, infinite in length, only the first p coefficients are needed to represent a[n] (with p = 10 for G.729) because there exists a one-to-one relation between a[n] and the first p values of c[n] for n > 0 (Exercise 6.13).

The rec-cepstrum is seen in Figure 14.19 to give a performance improvement over the mel-cepstrum. Speaker verification is performed using a feature vector consisting of the first 10 rec-cepstral coefficients (dashed) in comparison to 19 mel-cepstral coefficients (thin-solid), along with companion delta cepstra in each case. Figure 14.19 shows some improvement for males and a significant improvement for females. The figure also shows that, in spite of improvements, we have not reached the performance of the standard mel-cepstrum of the synthesized (coded) speech, referred to as the baseline (dashed-dotted) in the caption of Figure 14.19. If we attempt to improve performance by increasing the number of rec-cepstral coefficients beyond ten, we find a decrease in performance. This is probably because the additional coefficients provide no new information while an increase in feature vector length typically requires more training data for the same recognition performance.

Source (Residual) Parameters — One approach to further recover performance of the G.729 synthesized speech is to append parameters that represent the G.729 residual to the G.729 spectral feature vector. This includes the pitch15 (i.e., the long-term-prediction delay), gain (i.e., long-term-prediction gain), residual codebooks, and energy [24]. To determine first the importance, incrementally, of pitch,16 we append pitch to the feature vector consisting of the rec-cepstrum and delta rec-cepstrum, yielding a 21-element vector. The resulting recognition performance is shown in Figure 14.19. We see that there is performance improvement with appending G.729 pitch (solid) relative to the use of the rec-cepstrum only (dashed), resulting in performance close to baseline. It is interesting that the extent of the relative improvement seen for female speakers is not seen for male speakers, reflecting the greater use of pitch in recognition by females, consistent with its influence on the mel-cepstrum as was illustrated in Figure 14.11. We also see in Figure 14.19 that, although we have made additional performance gains with G.729 pitch, we still have not reached the performance of using G.729 synthesized speech.17 Nevertheless, other residual characteristics have not yet been fully exploited [57].

15 For a fundamental frequency ωo, we use log(ωo + 1) rather than pitch, itself, because the logarithm operation reduces dynamic range and makes the pitch probability density function (as estimated by a histogram) more Gaussian, thus more amenable to the GMM.

16 Pitch alone has been used with some success in automatic speaker identification systems [3].

17 The LSFs and pitch in the these experiments are unquantized. With quantization, the relative performance is intact with a negligible overall decline.

Figure 14.19 G.729 DET performance under the matched condition for baseline (dashed-dotted), reccepstrum + pitch (thick-solid), rec-cepstrum (dashed), and mel-cepstrum from the all-pole envelope (thin-solid). The baseline system used the mel-cepstrum from the STFT magnitude. The rec-cepstrum is of order 10 and the mel-cepstrum is order 19. The left panel gives the performance for female speakers and the right panel for male speakers.

Image

14.7 Summary

In this chapter we applied principles of discrete-time speech signal processing to automatic speaker recognition. Our main focus was on signal processing, rather than recognition, leaving the multitude of statistical considerations in speaker recognition to other expositions [76]. Signal processing techniques were illustrated under the many different conditions in which speaker recognition is performed through a variety of real-world databases.

We began this chapter with the first step in speaker recognition: estimating features that characterize a speaker’s voice. We introduced the spectral-based mel-ceptrum and the related sub-cepstrum, both of which use an auditory-like filter bank in deriving spectral energy measures. A comparison of the mel-cepstrum and sub-cepstrum revealed issues in representing the temporal dynamics of speech in feature selection, using the filtering interpretation of the STFT. We then introduced three different approaches to speaker recognition with the Gaussian mixture model (GMM) approach being the most widely used to date. We saw that advantages of GMM lie in its “soft” assigning of features to speech sound classes and its ability to represent any feature probability density function. We next described the use of a variety of non-spectral features in speaker recognition, specifically based on source characterization, including parameters of the glottal flow derivative and onset times of a generalized distributed source. This led to gaining insight into the relative importance of source, spectrum, and prosodic features by modifying speech chactacteristics through the sinusoidal modification techniques that we developed in Chapter 9. This study gave a fascinating glimpse of the difference in how automatic (machine) recognizers and the human listener view speaker identifiability, and the need for automatic recognizers, currently spectral-based, to exploit non-spectral features. Non-spectral features will likely involve not only new low-level acoustic speech characteristics of a temporal nature, such as features derived from temporal envelopes of auditory-like subband filter outputs,18 but also high-level properties that were discussed in this chapter’s introduction.

18 In Section 14.2, we refer to the mel- and sub-cepstrum as “spectral features.” However, when we realize from Chapter 7 (Section 7.4.2) that a sequence can be recovered from its STFT magnitude (indeed, even possibly two samples of the STFT magnitude at each time instant), the distinction between “spectral” and “non-spectral”becomes nebulous.

In the second half of the chapter, we addressed speaker recognition under a variety of mismatched conditions in which training and test data are subjected to different channel distortions. We first applied the CMS and RASTA linear channel compensation methods of Chapter 13. We saw, however, that this compensation, although able to improve speaker recognition performance, addressed only part of the problem. In the typical mismatched condition, such as with wireline and cellular telephony or tactical communication environments, training and test data may arise not only from high-quality linear channels but also from low-quality nonlinear channels. This problem leads to the need for nonlinear channel compensation. Two such approaches were described: one that modifies the speech waveform with a nonlinearity derived from a magnitude-only spectral matching criterion, and the second that normalizes the log-likelihood scores of the classifier. Application of the noise reduction approaches of Chapter13 and the removal of missing features, i.e., irrevocably degraded, were also described in addressing the mismatch problem. Alternatively, one can deal with this problem through channel invariant features, with source onset times, the sub-cepstrum, and parameters of the AM-FM in formants as possibilities. In the final topic of this chapter, we addressed a different class of degrading conditions in which we performed speaker recognition from coded speech. This is an increasingly important problem as the use of digital communication systems becomes more widespread.

The problems of robustness and channel mismatch in speech-related recognition remain important signal processing challenges. For example, although signal processing has led to large performance gains under the mismatched condition for telephone communications, with the onset of Internet telephony comes different and even more difficult problems of mismatch in training and testing. Indeed, new solutions are required for such environmental conditions. Such solutions will likely rely on an understanding of the human’s ability to outperform the machine under these adverse conditions [73], and thus call upon new nonlinear auditory models [20],[75], as well as such models that are matched to nonlinearities in the speech production mechanism [80].

Appendix 14.A: Expectation-Maximization (EM) Estimation

In this appendix, we outline the steps required by the Expectation-Maximization (EM) algorithm for estimating GMM parameters. Further details can be found in [9],[60].

Consider the probability of occurrence of a collection of observed feature vectors, Image, from a particular speaker, as the union of probabilities of all possible states (e.g., acoustic classes):

Image

where Image denotes the sum over all possible (hidden) acoustic classes, where pi are the weights for each class, where the probability density function (pdf) for each state

Image

with R the dimension of the feature vector, and where we have assumed the independence of feature vectors across the M observations (analysis frames in the speech context). The function bi(x) is an R-dimensional Gaussian pdf with a state-dependent mean vector Image and covariance matrix Σi. We call p(X|λ) the Gaussian mixture model (GMM) for the feature vectors. The symbol λ represents the set of GMM mean, covariance, and weight parameters over all classes for a particular speaker, i.e.,

Image

Suppose now that we have an estimate of λ denoted by λk. Our objective is to find a new estimate λk+1 such that

p(Xk+1) ≥ p(Xk).

The EM algorithm maximizes, with respect to the unknown λk+1, the expectation of the log-likelihood function log[p(Xk+1)], given the observed feature vector X and the current estimate (iterate) λk. This expectation over all possible acoustic classes is given by [9],[13],[60]

Image

Forming this sum is considered the expectation step of the EM algorithm. It can be shown, using the above formulation, that maximizing E(log[p(Xk+1)]) over λk+1 increases the kth log-likelihood. i.e., p(Xk+1) ≥ p(Xk). Solution to this maximization problem is obtained by differentiating E(log[p(Xk+1)]) with respect to the unknown GMM mean, covariance, and weight parameters, i.e., Image, and is given by

Image

where

Image

where Image is the i the pdf mixture component on the kth iteration, and where T in the covariance expression denotes matrix transpose. This is referred to as the maximization step of the EM algorithm. Replacement of the new model parameters is then made in the GMM to obtain the next estimate of the mixture components, Image, (and weights) and the procedure is repeated.

An interesting interpretation of the EM algorithm is that it provides a “soft” clustering of the training feature vectors X into class components. In fact, we can think of the EM algorithm as a generalization of the k-means algorithm described in Chapter 12 (Section 12.4.2), where, rather than assigning each Image to one centroid, each Image is assigned probabilistically to all centroids. At each iteration in the EM algorithm, there arises a probability, p(in = i|xn, λk) that an observed individual feature vector Image comes from each of the assumed I classes. Thus we can “assign” each feature vector to a class from which the feature vector comes with the largest probability. On each iteration, the mixture component parameter estimation of the EM algorithm uses these probabilities. For example, the estimation equation for the mean vector of the ith mixture component is a weighted sum of the observed feature vectors where the weight for each feature vector is the (normalized) probability that the feature vector came from class i. Similar interpretations hold for estimation equations for the covariance matrix and the class weight associated with each Gaussian mixture component.

Exercises

14.1 Show that when each individual Gaussian mixture component of the GMM in Equation (14.4) integrates to unity, then the constraint Image ensures that the mixture density represents a true pdf, i.e., the mixture density itself integrates to unity. The scalars pi are the mixture weights.

14.2 We saw in Section 14.3.3 that GMM modeling of a pdf takes place through appropriate selection of the means Image, covariance matrices Σi, and probability weights pi of the GMM.

(a) Assuming that diagonal covariance matrices Σi are sufficient for good pdf approximations in GMM modeling, show that this simplification results in significantly reducing the number of unknown variables to be estimated. Assume a fixed number of mixtures in the GMM.

(b) Argue that a diagonal covariance matrix Σi does not imply statistical independence of the underlying feature vectors. Under what condition on the GMM are the feature vectors statistically independent?

14.3 In this problem, we investigate the log-likelihood test associated with the GMM-based speaker verification of Section 14.3.3.

(a) Discarding the constant probability terms and applying Bayes’ rule and the logarithm to Equation (14.5), show that

Image

where Image, is a collection of M feature vectors obtained from a test utterance, and λC and Image denote the claimant and background speaker models, respectively.

(b) In one scenario, the probability that a test utterance does not come from the claimant speaker is determined from a set of imposter (background) speaker models Image. In this scenario, the single background model Image is created from these individual models. Write an expression for the single (composite) background speakers’ log-probability, first in terms of Image for j = 1, 2, …, B, and then as a function of the individual densities Image. Assume equally likely background speakers. Hint: Write Image as a joint probability density that the test utterance comes from one of the B background speakers.

(c) An alternate method for background “normalization” of the likelihood function is to first compute the claimant log-probability score and then subtract the maximum of the background scores over individual background (imposter) models, i.e.,

Image

where Image denotes the individual background GMM models. Justify why this might be a reasonable approach to normalization with a variety of imposters, some close to the target and some far with respect to voice character.

(d) Observe that we could use simply the log-probability score of the claimed speaker, log[p(XC)], for speaker verification, without any form of background normalization. Argue why it is difficult to set a decision threshold in using this score for speaker detection (i.e., verification). Argue why, in contrast, the speaker identification task does not require a background normalization.

14.4 In this problem you consider the delta cepstrum in speaker recognition for removing a linear time-invariant channel degradation with frequency response G(ω), slowly varying in frequency, as given in Equation (14.9).

(a) Show that the difference of the mel-scale filter energy over two consecutive frames, i.e.,

Δ log[Emel(pL, l)] = log[Emel(pL, l)] − log[Emel((p − 1)L, l)]

has no channel contribution G(ω) under the slowly varying assumption leading to Equation (14.9).

(b) Show that differencing the mel-cepstrum over two consecutive frames to form a delta cepstrum is equivalent to the log-energy difference of part (a). Hint: Determine the Fourier transform of Δ log[Emel(pL, l)].

14.5 Design a GMM-based handset recognition system that detects when an utterance is generated from a high-quality electret handset or a low-quality carbon-button handset. Assume you are given a training set consisting of a large set of electret and carbon-button handset utterances. Hint: Use the same methodology applied in the design of a GMM-based speaker recognition system.

14.6 An important problem in speaker verification is detecting the presence of a target speaker within a multi-speaker conversation. In this problem, you are asked to develop a method to determine where the target speaker is present within such an utterance. Assume that a GMM target model λC and a GMM background model Image have been estimated using clean, undistorted training data.

(a) Suppose you estimate a feature vector Image at some frame rate over a multi-speaker utterance where n here denotes the frame index. You can then compute a likelihood score for each frame as

Image

and compare this against a threshold. Explain why this approach to speaker tracking is unreliable. Assume that there is no channel mismatch between the training and test data so that channel compensation is not needed.

(b) An alternative approach to speaker tracking is to form a running string of feature vectors over M consecutive frames and sum all frame likelihoods above a threshold, i.e.,

Image

where n represents the index of the feature vectors within the M -frame sliding “window” (denoted by I) and where Λ(xn) denotes the likelihood ratio based on target and background speaker scores from part (a). When Λ(I) falls above a fixed threshold, we say that the target speaker is present at the window center. (Note that there are many variations of this sliding-window approach.) Discuss the advantages and disadvantages of this log-likelihood statistic for tracking the presence of the target speaker. Consider the tradeoff between temporal resolution and speaker verification accuracy for various choices of the window size M, accounting for boundaries between different speakers and silence gaps.

(c) Consider again a test utterance consisting of multi-speakers, but where the speech from each talker has traversed a different linear channel. How would you apply cepstral mean subtraction (CMS) in a meaningful way on the test utterance, knowing that more than one linear channel had affected the utterance? If you were given a priori the talker segmentation, i.e., the boundaries of each talker’s speech, how might you modify your approach to cepstral mean subtraction?

(d) Repeat part (c), but where RASTA is applied to the test utterance. Does RASTA provide an advantage over CMS in the scenarios of part (c)? Explain your reasoning.

(e) You are now faced with an added complexity; not only are there present different speakers within an utterance whose speech is distorted by different linear channels, but also each speaker may talk on either an electret or a carbon-button telephone handset, and this may occur in both the training and test data. Assuming you have a handset detector that can be applied over a string of feature vectors (e.g., Exercise 14.5), design a method to account for handset differences between training and test data, as well as within the test utterance. You can use any of the approaches described in this chapter, including handset mapping and handset normalization (hnorm).

14.7 In speaker recognition, it is revealing to compare the importance of measured features for human and machine speaker recognition. A perception of breathiness, for example, may correspond to certain coarse- and fine-structure characteristics in the glottal flow derivative such as a large open quotient and aspiration, respectively, and its importance for the human may imply importance for machine recognition. On the other hand, certain features useful to machines may not be useful to humans. Propose some possible source and vocal tract system features for this latter case.

14.8 In Section 14.6.2, in illustrating the generation of phantom formants due to nonlinear handset distortion, we used all-pole spectral estimates of both undistorted and distorted waveforms. In such analysis, it is important, however, to ensure that the phantom formants are not artifactual effects from the analysis window length and position. This is because all-pole spectral estimates have been shown to be dependent on the window position and length [58].

Panels (a)-(c) of Figure 14.20 show the STFT magnitude and its all-pole (14th order) envelope of a time segment from a carbon-button waveform for different window positions and lengths. Pole estimation was performed using the autocorrelation method of linear prediction. Because the speech is from a low-pitched male, pole estimation suffers little from aliasing in the autocorrelation, as described in Chapter 5. Window specifications are (a) length of 25 ms and time zero position (i.e., the center of the time segment), (b) length of 25 ms and displaced half a pitch period from time zero, and (c) length of 15 ms and at time zero position. Panels (d) through (f) illustrate the spectra from the corresponding electret-to-carbon mapped waveform segment for the same window characteristics. The forward handset mapping of Figure 14.17 was applied. Explain the implication of the analysisin Figure 14.20 for the validity of the presence of phantom formants.

Figure 14.20 Insensitivity of phantom formants to window position and length: Panels (a) through © illustrate the STFT magnitude and its all-pole (14th order) envelope of a time segment of a carbon-button waveform taken from the HTIMIT database. Window specifications are (a) length of 25 ms and time zero position, i.e., center of the short-time segment, (b) length of 25 ms and displaced half a pitch period from time zero, and (c) length of 15 ms and at time zero position. Panels (d) through (f) illustrate the spectra from the corresponding electret-to-carbon-button mapped waveform segment for the same window characteristics.

SOURCE: T.F. Quatieri, D.A. Reynolds, and G.C.O’Leary, “Estimation of Handset Nonlinearity with Application to Speaker Recognition” [55]. ©2000, IEEE. Used by permission.

Image

14.9 Let x[n] denote an undistorted signal, to be referred to as the reference signal, and y[n] the output of a nonlinear system. The pth order Volterra series truncated at N terms is given by [72]

(14.16)

Image

which can be thought of as a generalized Taylor series with memory.

(a) Show that when the nonlinear terms are eliminated in Equation (14.16), the series reduces to the standard convolutional form of a linear system.

(b) Using a least-squared-error estimation approach, based on a quadratic error between a measurement s[n] and the Volterra series model output y[n], solve for the Volterra series parameters in Equation (14.16). You might consider a matrix formulation in your solution.

(c) In nonlinear system modeling and estimation, the Volterra series has the advantage of not requiring a specific form and introduces arbitrary memory into the model. It also has the advantage of converting an underlying nonlinear estimation problem into a linear one because the series in Equation (14.16) is linear in the unknown parameters. Nevertheless, although an arbitrary-order Volterra series in theory represents any nonlinearity, it has a number of limitations. Argue why the Volterra series has convergence problems when the nonlinearity contains discontinuities, as with a saturation operation. (Note that a Taylor series has a similar convergence problem.) Also argue that simple concatenated systems, such as a low-order FIR filter followed by a low-order polynomial nonlinearity, require a much larger-order Volterra series than the original system and thus the Volterra series can be an inefficient representation in the sense that the parameters represent a redundant expansion of the original set possibly derived from physical principles.

14.10 Show that the gradient of Image in Equation (14.15) is given by Image, where J is the Jacobian matrix of first derivatives of the components of the residual vector in Equation (14.14), i.e., the elements of J are given by

Image

where ƒi is the ith element of Image.

14.11 In general, there will not be a unique solution in fitting the nonlinear handset representation in Equation (14.11) to a measurement, even if the measurement fits the model exactly. One problem is that the reference input x[n] may be a scaled version of the actual input. Show that although the coefficients of the polynomial nonlinearity in Equation (14.11) can account for an arbitrary input scaling, estimated polynomial coefficients need not equal the underlying coefficients in spite of providing a match to the measured waveform. Hint: Consider Image, where c is an arbitrary scale factor.

14.12 In this problem, you develop a missing (mel-scale filter energy) feature detector based on the generalized spectral subtraction noise threshold rules. You then dynamically modify the probability computations performed in a GMM recognizer by dividing the GMM into “good feature” and “bad feature” contributions based on the missing feature detector.

(a) Let y[n] be a discrete-time noisy sequence

y[n] = x[n] + b[n]

where x[n] is the desired sequence and b[n] is uncorrelated background noise, with power spectra given by Sx(ω) and Sb(ω), respectively. Recall the generalized spectral subtraction suppression filter of Chapter 13 expressed in terms of relative signal level Q(ω) as

Image

where the relative signal level

Image

with |Y(n, ω)|2 the squared STFT magnitude of the measurement y[n] and with Image the background power spectrum estimate. Given the above threshold condition, Image, decide on “good” and “bad” spectral regions along the frequency axis. Then map your decisions in frequency to the mel-scale.

(b) Suppose now that we compute mel-scale filter energies from |Y(n, ω)|2. Repeat part (a) by first designing a generalized spectral subtraction algorithm based on the mel-scale filter energies. Then design a missing feature detector based on thresholding the mel-scale energy features. Discuss the relative merits of your methods in parts (a) and (b) for detecting “good” and “bad” mel-scale filter energies.

(c) Propose a method to dynamically modify the probability computations performed in the GMM recognizer, removing missing features determined from your detector of part (a) or (b). Hint: Recall that for a speaker model Image, the GMM pdf is given by

Image

where

Image

and where Image is the state mean vector and Σi is the state covariance matrix. Then write each Image as a product of pdf’s of feature vector elements xj, j = 0, 1, … R − 1, under the assumption that each covariance matrix Σi is diagonal.

(d) Discuss the relation of the above missing feature strategy to the possible use of redundancy in the human auditory system, helping to enable listeners to recognize speakers in adverse conditions.

14.13 In this problem, you develop a feature set based on the Teager energy operator of Chapter 11 applied to subband-filter outputs in place of the conventional energy operation [25]. The discrete-time three-point Teager energy operator is given by

Ψ(x[n]) = x2[n] − x[n − 1]x[n + 1]

and for a discrete-time AM-FM sinewave gives approximately the squared product of sinewave amplitude and frequency:

Ψ(x[n]) ≈ A2[n]ω2[n].

In this problem, you analyze the robustness of this energy measure to additive background noise with certain properties.

(a) Consider the sum of a desired speech signal x[n] and noise background b[n]

y[n] = x[n] + b[n]

Show that the output of the Teager energy operator can be expressed as

Image

where, for simplicity, we have removed the time index and where Ϋ[x, b], the “cross Teager energy” of x[n] and b[n], is given by

Image

(b) Show that, in a stochastic framework, with x[n] and b[n] uncorrelated, we have

E(Ψ[y]) = E(Ψ[x]) + E(Ψ[b])

where E denotes expected value. Then show that

E(Ψ[b]) = rb[0] − rb[2]

where rb[0] and rb[2] denote, respectively, the 0th and 2nd autocorrelation coefficients of the noise background, b[n]. Using this result, give a condition on the autocorrelation of the noise such that

E(Ψ[y]) = E(Ψ[x])

i.e., that in a statistical sense, the Teager energy is immune to the noise. Explain how the instantaneous nature of actual signal measurements, in contrast to ensemble signal averages, will affect this noise immunity in practice. How might you exploit signal ergodicity?

(c) Show that with the conventional (squared amplitude) definition of energy, we have

E(y2[n]) = rx[0] + rb[0]

where x[n] and b[n] are uncorrelated and where rx[0] and rb[0] denote the 0th autocorrelation coefficients of x[n] and b[n], respectively. Then argue that, unlike with the Teager energy, one cannot find a constraint on the noise autocorrelation such that the standard energy measure is immune to noise, i.e., the “noise bias” persists. How is this difference in noise immunity of the standard and Teager energy affected when the noise is white?

(d) To obtain a feature vector, suppose the Teager energy is derived from subband signals. Therefore, we want to estimate the autocorrelation coefficients of

Ψ[yl] = Ψ[yl] + Ψ[bl] + 20000[xl, bl]

where yl[n], xl[n], and bl[n] are signals associated with the lth mel-scale filter υl[n] of Figure 14.2b. Repeat parts (a)-(c) for this subband signal analysis.

14.14 (MATLAB) In this problem, you investigate the time resolution properties of the mel-scale and sub-band filter output energy representations. You will use the speech signal speech1_10k in workspace ex14M1_mat and functions make_mel_filters_m and make_sub_filters_m, all located in companion web-site directory Chap_exercises/chapter14.

(a) Argue that the subband filters, particularly for high frequencies, are capable of greater temporal resolution of speech energy fluctuations within auditory critical bands than are the mel-scale filters. Consider the ability of the energy functions to reflect speech transitions, periodicity, and short-time events such as plosives in different spectral regions. Assume that the analysis window duration used in the STFT of the mel-scale filter bank configuration is 20 ms and is about equal to the length of the filters in the low 1000-Hz band of the subband filter bank. What constrains the temporal resolution for each filter bank?

(b) Which filter bank structure functions more similar to the wavelet transform and why?

(c) Write a MATLAB routine to compute the mel-scale filter and subband filter energies. In computing the mel-scale filter energies, use a 20-ms Hamming analysis window and the 24-component mel-scale filter bank from function make_mel_filters_m, assuming a 4000-Hz bandwidth. In computing the subband-filter energies, use complex zero-phase subband filters from function make_sub_filters_m. For each filter bank, plot different low- and high-frequency filter-bank energies in time for the voiced speech signal speech1_10k in workspace ex14M1_mat. For the subband filter bank, investigate different energy smoothing filters p[n] and discuss the resulting temporal resolution differences with respect to the mel-scale filter analysis.

14.15 (MATLAB) In this problem you investigate a model for the nonlinearity of the carbon-button handset, derived from physical principles, given by [1],[43]

(14.17)

Image

where typical values of α fall in the range 0 < α < 1.

(a) Show that, when we impose the constraints that the input values of 0 and 1 map to themselves and that Q(u) = 1 for u > 1, this results in the simplified model

(14.18)

Image

thus containing one free variable and a fixed upper saturation level of +1.

(b) The nonlinearity in Equation (14.18) can take on a variety of asymmetric shapes, reflecting the physical phenomenon that the resistance of carbon-button microphone carbon granules reacts differently to positive pressure variation than to negative variation. Write a MATLAB routine to plot the nonlinearity of Equation (14.18) for a variety of values of α. Then apply your nonlinear operation to a synthetic voiced speech waveform of your choice, illustrating the ability of the handset nonlinearity model to create phantom formants. Explain why, in spite of the nonlinearity of Equation (14.17) being based on physical principles and able to create phantom formants, a polynomial model may be more effective when considering differences and age of handsets.

Bibliography

[1] M.H. Abuelma’atti, “Harmonic and Intermodulation Distortion of Carbon Microphones,” Applied Acoustics, vol. 31, pp. 233–243, 1990.

[2] T.V. Ananthapadmanabha and G. Fant, “Calculation of True Glottal Flow and its Components,” Speech Communications, vol. 1, pp. 167–184, 1982.

[3] B.S. Atal, “Automatic Recognition of Speakers from Their Voices,” IEEE Proc., vol. 64, no. 4, pp. 460–475, 1976.

[4] B.S. Atal, “Effectiveness of Linear Prediction Characteristics of the Speech Wave for Automatic Speaker Identification and Verification,” J. Acoustical Society of America, vol. 55, no. 6, pp. 1304–1312, June 1974.

[5] J.P. Campbell, “Speaker Recognition: A Tutorial,” Proc. IEEE, vol. 85, no. 9, pp. 1437–1462, Sept. 1997.

[6] F.R. Clarke and R.W. Becker, “Comparison of Techniques for Discriminating Talkers,” J. Speech and Hearing Research, vol. 12, pp. 747–761, 1969.

[7] S.B. Davies and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–28, no. 4, pp. 357–366, Aug. 1980.

[8] J.R. Deller, J.G. Proakis, and J.H.L. Hansen, Discrete-Time Processing of Speech, Macmillan Publishing Co., New York, NY, 1993.

[9] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Society, vol. 39, pp. 1–38, 1977.

[10] G.R. Doddington, “Speaker Recognition—Identifying People by Their Voices,” Proc. IEEE, vol. 73, pp. 1651–1664, 1985.

[11] A.W. Drake, Fundamentals of Applied Probability Theory, McGraw-Hill, New York, NY, 1967.

[12] A. Drygajlo and M. El-Maliki, “Use of Generalized Spectral Subtraction and Missing Feature Compensation for Robust Speaker Verification,” Speaker Recognition and Its Commercial and Forensic Applications (RLA2C), Avignon, France, April 1998.

[13] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, 2nd Edition, John Wiley and Sons, New York, NY, 2001.

[14] E. Erzin, A. Cetin, and Y. Yardimci, “Subband Analysis for Robust Speech Recognition in the Presence of Car Noise,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 417–420, Detroit, MI, May 1995.

[15] European Telecommunication Standards Institute, “European Digital Telecommunications System (Phase 2); Full Rate Speech Processing Functions (GSM 06.01),” ETSI, 1994.

[16] G. Fant, “Glottal Flow: Models and Interaction,” J. Phonetics, vol. 14, pp. 393–399, 1986.

[17] W.M. Fisher, G.R. Doddington, and K.M. Goudie-Marshall, “The DARPA Speech Recognition Research Database: Specifications and Status,” Proc. DARPA Speech Recognition Workshop, pp. 93–99, Palo Alto, CA, 1986.

[18] R. Fletcher, “Generalized Inverses for Nonlinear Equations and Optimization,” in Numerical Methods for Nonlinear Algebraic Equations, P. Rabinowitz, Gordon, and Breach Science Publishers, eds., New York, NY, 1970.

[19] S. Furui, “Cepstral Analysis Technique for Automatic Speaker Verification,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 29, no. 2, pp. 254–272, April 1981.

[20] O. Ghitza, “Auditory Models and Human Performance in Tasks Related to Speech Coding and Speech Recognition,” IEEE Trans. Speech and Audio Process, vol. 2, no. 1, pp. 115–132, Jan. 1994.

[21] H. Gish and M. Schmidt, “Text-Independent Speaker Identification,” IEEE Signal Processing Magazine, vol. 11, no. 4, pp. 18–32, Oct. 1994.

[22] B. Gold and N. Morgan, Speech and Audio Signal Processing, John Wiley and Sons, New York, NY, 2000.

[23] J.M. Huerta and R.M. Stern, “Speech Recognition from GSM Coder Parameters,” Proc. 5th Int. Conf. on Spoken Language Processing, vol. 4, pp. 1463–1466, Sydney, Australia, 1998.

[24] ITU-T Recommendation G.729, “Coding of Speech at 8 kbps Using Conjugate-Structure Algebraic-Code-Excited Linear Prediction,” June 1995.

[25] F. Jabloun and A.E. Cetin, “The Teager Energy Based Feature Parameters for Robust Speech Recognition in Noise,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 273–276, Phoenix, AZ, March 1999.

[26] C.R. Jankowski, Fine Structure Features for Speaker Identification, Ph.D. Thesis, Massachusetts Institute of Technology, Dept. Electrical Engineering and Computer Science, June 1996.

[27] C.R. Jankowski, T.F. Quatieri, and D.A. Reynolds, “Measuring Fine Structure in Speech: Application to Speaker Identification,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 325–328, Detroit, MI, May 1995.

[28] C.R. Jankowski, A. Kalyanswamy, S. Basson, and J. Spitz, “NTIMIT: A Phonetically Balanced, Continuous Speech, Telephone Bandwidth Speech Database,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 109–112, Albuquerque, NM, 1990.

[29] F. Jelinek, Statistical Methods for Speech Recognition, The MIT Press, Cambridge, MA, 1998.

[30] J.F. Kaiser, “On a Simple Algorithm to Calculate the ‘Energy’ of a Signal,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, Albuquerque, NM, pp. 381–384, April 1990.

[31] D.H. Klatt and L.C. Klatt, “Analysis, Synthesis, and Perception of Voice Quality Variations among Female and Male Talkers,” J. Acoustical Society of America, vol. 87, no. 2, pp. 820–857, 1990.

[32] W.B. Kleijn and K.K. Paliwal, eds., Speech Coding and Synthesis, Elsevier, Amsterdam, the Netherlands, 1995.

[33] Linguistic Data Consortium, http://www.ldc.upenn.edu.

[34] R.P. Lippman and B.A. Carlson, “Using Missing Feature Theory to Actively Select Features for Robust Speech Recognition with Interruptions, Filtering, and Noise,” Proc. Eurospeech97, vol. 1, pp. KN 37–40, Rhodes, Greece, Sept. 1997.

[35] J. He, L. Liu, and G. Palm, “On the Use of Features from Prediction Residual Signals in Speaker Identification,” Proc. Eurospeech95, vol. 1, pp. 313–316, Madrid, Spain, 1995.

[36] P. Maragos, J. F. Kaiser, and T. F. Quatieri, “On Amplitude and Frequency Demodulation Using Energy Operators,” IEEE Trans. Signal Processing, vol. 41, no. 4, pp. 1532–1550, April 1993.

[37] P. Maragos, J.F. Kaiser, and T.F. Quatieri, “Energy Separation in Signal Modulations with Application to Speech Analysis,” IEEE Trans. Signal Processing, vol. 41, no. 10, pp. 3025–3051, Oct. 1993.

[38] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, “The DET Curve in Assessment of Detection Task Performance,” Proc. Eurospeech97, vol. 4, pp. 1895–1898, Rhodes, Greece, 1997.

[39] T. Masuko, T. Hitotsumatsu, K. Tokuda, and T. Kobayashi, “On the Security of HMM-Based Speaker Verification Systems against Imposture Using Synthetic Speech,” Proc. Eurospeech99, vol. 3, pp. 1223–1226, Budapest, Hungary, Sept. 1999.

[40] G. McLachlan, Mixture Models, Macmillan Publishing Co., New York, NY, 1971.

[41] C. Mokbel, L. Mauuary, D. Jouvet, J. Monne, C. Sorin, J. Simonin, and K. Bartkova, “Towards Improving ASR Robustness for PSN and GSM Telephone Applications,” 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, vol. 1, pp. 73–76, 1996.

[42] National Institute of Standards and Technology (NIST), “NIST Speaker Recognition Workshop Notebook,” NIST Administered Speaker Recognition Evaluation on the Switchboard Corpus, Spring 1996–2001.

[43] H.F. Olson, Elements of Acoustical Engineering, Chapman and Hall, London, England, 1940.

[44] D. O’Shaughnessy, Speech Communication: Human and Machine, Addison-Wesley, Reading, MA, 1987.

[45] M.T. Padilla, Applications of Missing Feature Theory to Speaker Recognition, Masters Thesis, Massachusetts Institute of Technology, Dept. Electrical Engineering and Computer Science, Feb. 2000.

[46] A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw-Hill, New York, NY, 1965.

[47] A. Pickles, An Introduction to Auditory Physiology, Academic Press, 2nd Edition, New York, NY, 1988.

[48] M.D. Plumpe, T.F. Quatieri, and D.A. Reynolds, “Modeling of the Glottal Flow Derivative Waveform with Application to Speaker Identification,” IEEE Trans. Speech and Audio Processing, vol. 1, no. 5, pp. 569–586, Sept. 1999.

[49] M.D. Plumpe, Modeling of the Glottal Flow Derivative Waveform with Application to Speaker Identification, Masters Thesis, Massachusetts Institute of Technology, Dept. Electrical Engineering and Computer Science, Feb. 1997.

[50] M. Pryzybocki and A. Martin, “NIST Speaker Recognition Evaluation—1997,” Speaker Recognition and Its Commercial and Forensic Applications (RLA2C), Avignon, France, pp. 120–123, April 1998.

[51] T.F. Quatieri and R.J. McAulay, “Shape-Invariant Time-Scale and Pitch Modification of Speech,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 40, no. 3, pp. 497–510, March 1992.

[52] T.F. Quatieri, C.R. Jankowski, and D.A. Reynolds, “Energy Onset Times for Speaker Identification,” IEEE Signal Processing Letters, vol. 1, no. 11, pp. 160–162, Nov. 1994.

[53] T.F. Quatieri, R.B. Dunn, and D.A. Reynolds, “On the Influence of Rate, Pitch, and Spectrum on Automatic Speaker Recognition Performance,” Proc. Int. Conf. on Spoken Language Processing, Beijing, China, Oct. 2000.

[54] T.F. Quatieri, D.A. Reynolds, and G.C. O’Leary, “Handset Nonlinearity Estimation with Application to Speaker Recognition,” in NIST Speaker Recognition Notebook, NIST Administered Speaker Recognition Evaluation on the Switchboard Corpus, June 1997.

[55] T.F. Quatieri, D.A. Reynolds, and G.C. O’Leary, “Estimation of Handset Nonlinearity with Application to Speaker Recognition,” IEEE Trans. Speech and Audio Processing, vol. 8, no. 5, pp. 567–584, Sept. 2000.

[56] T.F. Quatieri, E. Singer, R.B. Dunn, D.A. Reynolds, and J.P. Campbell, “Speaker and Language Recognition using Speech Codec Parameters,” Proc. Eurospeech99, vol. 2, pp. 787–790, Budapest, Hungary, Sept. 1999.

[57] T.F. Quatieri, R.B. Dunn, D.A. Reynolds, J.P. Campbell, and E. Singer, “Speaker Recognition using G.729 Speech Codec Parameters,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 2., pp. 1089–1093, Istanbul, Turkey, June 2000.

[58] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice Hall, Englewood Cliffs, NJ, 1978.

[59] L.R. Rabiner and B. Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, NJ, 1993.

[60] D.A. Reynolds, A Gaussian Mixture Modeling Approach to Text-Independent Speaker Identification, Ph.D. Thesis, Georgia Institute of Technology, Atlanta, GA, 1992.

[61] D.A. Reynolds, “Speaker Identification and Verification Using Gaussian Mixture Speaker Models,” Speech Communication, vol. 17, pp. 91–108, Aug. 1995.

[62] D.A. Reynolds, T.F. Quatieri, and R.B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models,” Digital Signal Processing, Special Issue: NIST 1999 Speaker Recognition Workshop, J. Schroeder and J.P. Campbell, eds., Academic Press, vol. 10, no. 1–3, pp. 19–41, Jan./April/July 2000.

[63] D.A. Reynolds, “Effects of Population Size and Telephone Degradations on Speaker Identification Performance,” Proc. SPIE Conference on Automatic Systems for the Identification and Inspection of Humans, 1994.

[64] D.A. Reynolds, “Speaker Identification and Verification Using Gaussian Mixture Speaker Models,” Proc. ESCA Workshop on Automatic Speaker Recognition, pp. 27–30, Martigny, Switzerland, 1994.

[65] D.A. Reynolds, M.A. Zissman, T.F. Quatieri, G.C. O’Leary, and B.A. Carlson, “The Effects of Telephone Transmission Degradations on Speaker Recognition Performance,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Detroit, MI, May 1995.

[66] D.A. Reynolds, “HTIMIT and LLHDB: Speech Corpora for the Study of Handset Transducer Effects,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 2, pp. 1535–1538, Munich, Germany, April 1997.

[67] D.A. Reynolds, “Comparison of Background Normalization Methods for Text-Independent Speaker Verification,” Proc. Eurospeech97, vol. 1, pp. 963–967, Rhodes, Greece, Sept. 1997.

[68] D.A. Reynolds, “Automatic Speaker Recognition Using Gaussian Mixture Speaker Models,” MIT Lincoln Laboratory Journal, vol. 8, no. 2, pp. 173–191, Fall 1995.

[69] M. Sambur, “Selection of Acoustical Features for Speaker Identification,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–23, no. 2, pp. 176–182, April 1975.

[70] R. Salami, C. Laflamme, J.P. Adoul, A. Kataoka, S. Hayashi, T. Moriya, C. Lamblin, D. Massaloux, S. Proust, P. Kroon, and Y. Shoham, “Design and Description of CS-ACELP: A Toll Quality 8-kbps Speech Coder,” IEEE Trans. Speech and Audio Processing, vol. 6, no. 2, pp. 116–130, March 1998.

[71] M. Savic and S.K. Gupta, “Variable Parameter Speaker Verification System Based on Hidden Markov Modeling,” Proc. Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 281–284, Albuquerque, NM, 1990.

[72] M. Schetzen, The Volterra and Wiener Theories of Nonlinear Systems, John Wiley and Sons, New York, NY, 1980.

[73] A. Schmidt-Nielsen and T.H. Crystal, “Speaker Verification by Human Listeners: Experiments Comparing Human and Machine Performance Using the NIST 1998 Speaker Evaluation Data,” Digital Signal Processing, Special Issue: NIST 1999 Speaker Recognition Workshop, J. Schroeder and J.P. Campbell, eds., Academic Press, vol. 10, no. 1–3, pp. 249–266, Jan./April/July 2000.

[74] J. Schroeter and M.M. Sondhi, “Speech Coding Based on Physiological Models of Speech Production,” chapter in Advances in Speech Signal Processing, S. Furui and M.M. Sondhi, eds., Marcel Dekker, New York, NY, 1992.

[75] S. Seneff, “A Joint Synchrony/Mean-Rate Model of Auditory Speech Processing,” J. Phonetics, vol. 16, no. 1, pp. 55–76, Jan. 1988.

[76] J. Shroeder and J.P. Campbell, eds., Digital Signal Processing, Special Issue: NIST 1999 Speaker Recognition Workshop, Academic Press, vol. 10, no. 1–3, Jan./April/July 2000.

[77] F. Soong, A. Rosenberg, L. Rabiner, and B. Juang, “A Vector Quantization Approach to Speaker Recognition,” Proc. Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 387–390, Tampa, FL, 1985.

[78] F. Soong and A. Rosenberg, “On the Use of Instantaneous and Transitional Spectral Information in Speaker Recognition,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–36, no. 6, pp. 871–879, June 1988.

[79] J. Tchorz and B. Kollmeier, “A Model of Auditory Perception as Front End for Automatic Speech Recognition,” J. Acoustical Society of America, vol. 106, no. 4, Oct. 1999.

[80] H.M. Teager and S.M. Teager, “A Phenomenological Model for Vowel Production in the Vocal Tract,” chapter in Speech Science: Recent Advances, R.G. Daniloff, ed., College-Hill Press, pp. 73–109, San Diego, CA, 1985.,

[81] P. Thévenaz and H. H Imagegli, “Usefulness of the LPC-Residue in Text-Independent Speaker Verification,” Speech Communication, vol. 17, no. 1–2, pp. 145–157, Aug. 1995.

[82] W.D. Voiers, “Perceptual Bases of Speaker Identity,” J. Acoustical Society of America, vol. 36, pp. 1065–1073, 1964.

[83] S. van Vuuren and H. Hermansky, “On the Importance of Components of the Modulation Spectrum for Speaker Verification,” Proc. Int. Conf. on Spoken Language Processing, Sydney, Australia, Nov. 1998.

[84] R. Zelinski and P. Noll, “Adaptive Transform Coding of Speech Signals,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–25, no. 4, pp. 299–309, Aug. 1977.

[85] M.A. Zissman, “Comparison of Four Approaches to Automatic Language Identification of Telephone Speech,” IEEE Trans. on Speech and Audio Processing, vol. 4, no. 1, pp. 31–44, Jan. 1996.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset