Chapter 10
Frequency-Domain Pitch Estimation

10.1 Introduction

We have seen throughout the text that different approaches to speech analysis/synthesis naturally lead to different methods of pitch and voicing estimation. For example, in homomorphic analysis, the location of quefrency peaks (of lack thereof) in the cepstrum provides a pitch and voicing estimate. Likewise, the distance between primary peaks in the linear prediction error yields an estimate of the pitch period. We also saw that the wavelet transform lends itself to pitch period estimation by way of the correlation of maxima in filter-bank outputs across different scales; this “parallel processing” approach is similar in style to that of an early successful pitch estimation method conceived by Gold and Rabiner that looks across a set of impulse trains generated by different peaks and valleys in the signal [1]. These latter methods based on linear prediction, the wavelet transform, and temporal peaks and valleys, provide not only a pitch-period estimate, but also an estimate of the time occurrence of the glottal pulse, an important parameter in its own right for a variety of applications. We can think of all the above methods as time-domain approaches to pitch, voicing, and glottal pulse time estimation. In this chapter, we take an alternative view to these estimation problems in the frequency domain motivated by the sinewave representation of Chapter 9.1

1 We cannot hope to cover the vast variety of all pitch, voicing, and glottal pulse time estimators in the time and frequency domain. By focusing on a few specific classes of estimators, however, we illustrate the goals and problems common to the many approaches.

In Chapter 9, it was shown that it is possible to generate synthetic speech of very high quality using an analysis/synthesis system based on a sinusoidal speech model which, except in updating the average pitch to adjust the width of the analysis window, made no explicit use of pitch and voicing. Pitch and voicing, however, played an important role in accomplishing sinewave-based modification in Chapter 9 and will also play an important role in reducing the bit rate in sinewave-based speech coding in Chapter 12, much as they do in the speech analysis/synthesis and coding based on linear prediction. In this chapter, we will see that the sinewave representation brings new insights to the problems of pitch estimation, voicing detection, and glottal pulse time estimation. Specifically, pitch estimation can be thought of as fitting a harmonic set of sinewaves to the measured set of sinewaves, and the accuracy of the harmonic fit is an indicator of the voicing state. It is the purpose of this chapter to explore this idea in detail. The result is a powerful pitch and voicing algorithm which has become a basic component in all of the applications of the sinewave system.

We begin with a simple pitch estimator based on the autocorrelation function, which becomes the basis for transiting into the frequency domain through a sinewave representation. A variety of sinewave-based pitch estimators of increasing complexity and accuracy are then derived. We end this chapter with an application of the sinewave model to glottal pulse time estimation, and finally a generalization of the sinewave model to multi-band pitch and voicing estimation [2].

10.2 A Correlation-Based Pitch Estimator

Consider a discrete-time short-time sequence given by

sn[m] = s[m]w[nm]

where w[n] is an analysis window of duration Nw. The short-time autocorrelation function rn[τ] is defined by

Image

When s[m] is periodic with period P, rn[τ] contains peaks at or near the pitch period, P. For unvoiced speech, no clear peak occurs near an expected pitch period. Typical sequences rn[τ] for different window lengths were shown in Figure 5.6 of Chapter 5. We see that the location of a peak (or lack thereof) in the pitch period range provides a pitch estimate and voicing decision. This is similar to the strategy used in determining pitch and voicing from the cepstrum in homomorphic analysis.

It is interesting to observe that the above correlation pitch estimator can be obtained more formally by minimizing, over possible pitch periods (P > 0), the error criterion given by

(10.1)

Image

Minimizing E[P] with respect to P yields

(10.2)

Image

where P > ∈, i.e., P is sufficiently far from zero (Exercise 10.1). This alternate view of autocorrelation pitch estimation is used in the following section.

The autocorrelation function is a measure of “self-similarity,” so we expect that it peaks near P for a periodic sequence. Partly due to the presence of the window, however, the peak at the pitch period does not always have the greatest amplitude. We saw in Chapter 5 that the envelope of the short-time autocorrelation function of a periodic waveform decreases roughly linearly with increasing P (Exercise 5.2). A longer window helps in assuring the peak at τ = P is largest, but a long window causes other problems when the sequence s[m] is not exactly periodic. Peaks in the autocorrelation of the vocal tract impulse response, as well as peaks at multiple pitch periods,2 may become larger than the peak at τ = P due to time-variations in the vocal tract and pitch. Another problem arises as a result of the interaction between the pitch and the first formant. If the formant bandwidth is narrow relative to the harmonic spacing (so that the vocal tract impulse response decays very slowly within a pitch period), the correlation function may reflect the formant frequency rather than the underlying pitch. Nonlinear time-domain processing techniques using various types of waveform center-clipping algorithms have been developed to alleviate this problem [11],[12],[13].

2 Two common problems in pitch estimation are referred to as pitch-halving and pitch-doubling, whereby the pitch estimate is half or double the true pitch.

In the autocorrelation method of pitch estimation, Equation (10.2), the effective limits on the sum get smaller as the candidate pitch period P increases, i.e., as the windowed data segment shifts relative to itself, causing the autocorrelation to have the roughly linear envelope alluded to above. If we could extrapolate the periodic data segment so as to pretend that a longer periodic segment exists, we could avoid the effect of the window. Extrapolation has two advantages. First, we can use the full interval of duration Nw without assuming the data is zero outside the window duration and, second, we can make the interval long while maintaining stationarity (Figure 10.1). In the next section, we exploit this extrapolation concept together with the error criterion of Equation (10.1).

10.3 Pitch Estimation Based on a “Comb Filter”

One can imagine the design of a pitch estimator in the frequency domain by running the waveform through a comb filter with peaks at multiples of a hypothesized fundamental frequency (pitch) and selecting the comb with a pitch that matches the waveform harmonics and thus, hopefully, giving the largest energy. In this section, we derive a type of comb filter for pitch estimation, not by straightforward filtering, but based on the extrapolation and error criterion concepts of the previous section.

Figure 10.1 Extrapolation of a short-time segment from a periodic sequence.

Image

One approach to extrapolating a segment of speech beyond its analysis window as in Figure 10.1 is through the sinewave model of Chapter 9. Suppose that, rather than using the short-time segment sn[m] (where the subscript n refers to the window location), we use its sinewave representation. Then we form a sequence Image expressed as

(10.3)

Image

where the sinewave amplitudes, frequencies, and phases Ak, ωk, and θk are obtained from the short-time segment3 sn[m], but where Image is thought of as infinite in duration. The sinewave representation, therefore, can extrapolate the signal beyond the analysis window duration. In this chapter, as in Equation (10.3), it is particularly convenient to use only the complex sinewave representation; hence the real part notation introduced in Chapter 9 has been omitted.

3 The sinewave amplitudes, frequencies, and phases Ak, ωk , and θk are fixed parameters from one analysis window on one frame and not the interpolation functions used for synthesis in Chapter 9. Recall that θk is a phase offset relative to the analysis frame center, as described in Section 9.3.

Consider now the error criterion in Equation (10.1) over an extrapolated interval of length N. Substituting Equation (10.3) into Equation (10.1), we have

(10.4)

Image

where the extrapolation interval N is assumed odd and centered about the origin. Note that we could have begun with Equation (10.2), i.e., the autocorrelation perspective, rather than Equation (10.1). Observe also that, in Equation (10.4), the limits on the sum are [−(N − 1)/2, (N −1)/2] for all P and that data truncation does not occur. Rearranging terms in Equation (10.4), we obtain

Image

for which we write

Image

If we let the extrapolation interval N go to infinity, the function q(ω) approaches zero except at the origin so that

Image

then

Image

Let Image where ωo is the fundamental frequency. Then

(10.5)

Image

where for convenience we have deleted the scale factor of two in Equation (10.5). We now want to minimize E[P], as expressed in Equation (10.5), with respect to ωo To give this minimization an intuitive meaning, we rewrite Equation (10.5) as

(10.6)

Image

Minimization of E[P] is then equivalent to maximizing with respect to ωo the term

(10.7)

Image

We refer to Q(ω) as a likelihood function because as the value of the function increases then so does the likelihood of the hypothesized fundamental frequency as the true value. One way to view Q (ωo) is to first replace ωk in Equation (10.7) by a continuous ω. For each ωo, the continuous function

Figure 10.2 “Comb-filtering” interpretation of the likelihood function, Equation (10.7): (a) sampling by harmonic frequencies with the candidate pitch equals to the true pitch; (b) sampling by harmonic frequencies with the candidate pitch equal to twice the true pitch. For the latter case, the cancellation effect shown in panel (b) reduces the likelihood function. A candidate of half the true pitch, however, will not change the likelihood.

Image

Image

is sampled at ωk, Image and each Fk) is weighted by Image, and these weighted values are summed to form Equation (10.7). We can, loosely speaking, think of this as “comb-filtering” for each ωo, and we want the ωo whose comb filter has the maximum output. Figure 10.2a shows an example sampling of the function F(ω).

If the ωk’s are multiples of a fundamental frequency Image, and if Image, we have Fk) =1 (as in Figure 10.2a) and E[P] = 0 and minimization is achieved. Specifically,

Image

and thus, from Equation (10.6) E[P] = 0. In Figure 10.2b we see that the estimator is insensitive to pitch doubling due to a cancellation effect. Notice, however, a disturbing feature of this pitch estimator. A fundamental frequency estimate of Image, i.e., half the true pitch, will also yield zero error. Thus, the solution is ambiguous. These properties are illustrated in the following example:

Example 10.1       Figure 10.3a,b,d shows the result of Equation (10.7) for a true pitch of Image = 50, 100, and 200 Hz and where Ak = 1. A frequency increment of 0.5 Hz was used for the hypothesized pitch candidates (Exercise 10.12). One sees multiple peaks in the pitch likelihood function and thus the pitch estimate ambiguity. Nevertheless, the example also shows that the correct pitch is given by the last large peak, i.e., the peak of greatest frequency. There are no peaks beyond the true frequency because there is a cancellation effect for multiples of the true pitch (as in Figure 10.2b), and thus the likelihood function falls rapidly. Finally, Figure 10.3c shows the effect of white noise (added to the measured frequencies) on the likelihood function for a true pitch of 50 Hz. In this case, although the last large peak of Qo) falls at roughly 50 Hz, there is little confidence in the estimate due to the multiple peaks and, in particular, a larger peak at about 25 Hz and 50 Hz. Image

Figure 10.3 Pitch likelihood function Qo) for (a) 50, (b) 100, and (d) 200 Hz. The effect of noise on (a) is shown in (c).

This problem of pitch-halving (i.e., underestimating the pitch by a factor of two) is typical of many pitch estimators such as the homomorphic and autocorrelation pitch estimators. In fact, explicit comb filtering approaches have also been applied, but the tendency of these methods to suffer from pitch-halving has limited their use. Our goal is to design a pitch likelihood function that is characterized by a distinct peak at the true pitch. In the following Section 10.4, we propose an alternative sinewave-based pitch estimator which utilizes additional information. This new estimator is capable of resolving the above ambiguity.

10.4 Pitch Estimation Based on a Harmonic Sinewave Model

The previous sinewave-based correlation pitch estimator was derived by minimizing the mean-squared error between an estimated sinewave model and itself shifted by P samples. In this section, we fit a sinewave model, with unknown amplitudes, phases, and harmonic frequencies, to a waveform measurement [5],[8]. Although the resulting pitch estimator is prone to pitch doubling ambiguity, we show that, with a priori knowledge of the vocal tract spectral envelope, the pitch doubling problem is alleviated. This approach to pitch estimation leads naturally to a measure of the degree of voicing within a speech segment. Methods to evaluate the pitch estimator are also described.

10.4.1 Parameter Estimation for the Harmonic Sinewave Model

Consider a sinusoidal waveform model with unknown amplitudes and phases, and with harmonic frequencies:

Image

where ωo is an unknown fundamental frequency, where B and Image represent vectors of unknown amplitudes and phases, {Bk} and {Image}, respectively, and where Ko) is the number of harmonics in the speech bandwidth. (For clarity, we have changed notation from Image to exp[Image].) A reasonable estimation criterion is to seek the minimum mean-squared error between the harmonic model and the measured speech waveform:

(10.8)

Image

where we assume that the analysis window duration Nw is odd and the window is centered about the time origin. Then we can show that

(10.9)

Image

where S(ω) represents one slice of the STFT of the speech waveform over the interval [−(Nw − 1)/2, (Nw − 1)/2] and

(10.10)

Image

Minimizing Equation (10.9) with respect to Image, we see that the phase estimates Image are

(10.11)

Image

so that

(10.12)

Image

and minimizing Equation (10.12) with respect to Image:

(10.13)

Image

so that

Bk = |S(Kωo)|.

Therefore,

(10.14)

Image

The reader is asked to work through the algebraic steps of this derivation in Exercise 10.2. Thus, the optimal ωo is given by

(10.15)

Image

where we assume that ωo > ∈ (a small positive value close to zero) to avoid a bias toward a low fundamental frequency. As in the previous estimator, we see that this estimator acts like a comb filter, but here samples of the measured spectrum are at candidate harmonics. For a perfectly periodic waveform, ωo would be chosen to correspond to the harmonic peaks in |S(kωo)|. This criterion could lead, however, to a pitch-halving error similar to what we found in the sinewave-based correlation pitch estimator (Exercise 10.2).

Consider now giving ourselves some additional information; in particular, we assume a known vocal-tract spectral envelope,4 |H(ω)|. With this a priori information, we will show that the resulting error criterion can resolve the pitch-halving ambiguity [5],[8]. The following Section 10.4.2 derives the pitch estimator, while Section 10.4.4 gives a more intuitive view of the approach with respect to time-frequency resolution considerations.

4 We do not utilize the full capacity of the sinewave pitch estimator since we could have also supplied an a priori phase envelope, thus giving temporal alignment information (Exercise 10.2).

10.4.2 Parameter Estimation for the Harmonic Sinewave Model with a priori Amplitude

The goal again is to represent the speech waveform by another for which all of the frequencies are harmonic, but now let us assume a priori knowledge of the vocal tract spectral envelope, |H(ω)|. Under the assumption that the excitation amplitudes are unity, i.e., ak(t) = 1 in Equation (9.5), |H(ω)| also provides an envelope for the sinewave amplitudes Ak. The harmonic sinewave model then becomes

(10.16)

Image

where ωo is the fundamental frequency (pitch), Ko) is the number of harmonics in the speech bandwidth, Image ( ω) = |H(ω)| is the vocal tract envelope, and Image represents the phases of the harmonics. We would like to estimate the pitch ωo and the phases Image so that Image is as close as possible to the speech measurement s[n], according to some meaningful criterion. While a number of methods can be used for estimating the envelope Image(ω), for example, linear prediction or homomorphic estimation techniques, it is desirable to use a method that yields an envelope that passes through the measured sinewave amplitudes. Such a technique has been developed in the Spectral Envelope Estimation Vocoder (SEEVOC) [10]. This estimator will also be used later in this chapter as the basis of the minimum-phase analysis for estimating the source excitation onset times and so we postpone its description (Section 10.5.3).5

5 We will see that, in the application of source onset time estimation, it is appropriate to linearly interpolate between the successive sinewave amplitudes. In the application to mean-squared-error pitch estimation in this section, however, the main purpose of the envelope is to eliminate pitch ambiguities. Since the linearly interpolated envelope could affect the fine structure of the mean-squared-error criterion through its interaction with the measured peaks in the correlation operation (to follow later in this section), better performance is obtained by using piecewise-constant interpolation between the SEEVOC peaks.

A reasonable estimation criterion is to seek the minimum of the mean-squared error (MSE),

(10.17)

Image

over ωo and Image. The MSE in Equation (10.17) can be expanded as

(10.18)

Image

Observe that the first term of Equation (10.18) is the average energy (power) in the measured signal. We denote this average by Ps [an averaged version of the previously defined total energy in Equation (10.10)], i.e.,

(10.19)

Image

Substituting Equation (10.16) in the second term of Equation (10.18) leads to the relation

(10.20)

Image

Finally, substituting Equation (10.16) in the third term of Equation (10.18) leads to the relation

Image

where the approximation is valid provided the analysis window duration satisfies the condition, Image, which is more or less assured by making the analysis window 2.5 times the average pitch period. Letting

(10.21)

Image

denote one slice of the short-time Fourier transform (STFT) of the speech signal and using this in Equation (10.20), then the MSE in Equation (10.18) becomes (Exercise 10.2)

(10.22)

Image

Since the phase parameters Image affect only the second term in Equation (10.22), the MSE is minimized by choosing the phase estimates

Image

and the resulting MSE is given by

(10.23)

Image

where the second term is reminiscent of a correlation function in the frequency domain. The unknown pitch affects only the second and third terms in Equation (10.23), and these can be combined by defining

(10.24)

Image

The smooth weighting function Image biases the method away from excessively low pitch estimates. The MSE can then be expressed as

(10.25)

Image

Because the first term is a known constant, the minimum mean-squared error is obtained by maximizing ρo) over ωo.

It is useful to manipulate this metric further by making explicit use of the sinusoidal representation of the input speech waveform. Assume, as in Section 10.3, that a frame of the input speech waveform has been analyzed in terms of its sinusoidal components using the analysis system described in Chapter 9.6 The measured speech data s[n] is, therefore, represented as

6 This mean-squared-error pitch extractor is predicated on the assumption that the input speech waveform has been represented in terms of the sinusoidal model. This implicitly assumes that the analysis has been performed using a Hamming window approximately two and one-half times the average pitch. It seems, therefore, that the pitch must be known in order to estimate the average pitch that is needed to estimate the pitch. This circular dilemma can be broken by using some other method to estimate the average pitch based on a fixed window. Since only an average pitch value is needed, the estimation technique does not have to be accurate on every frame; hence, any of the well-known techniques can be used.

(10.26)

Image

where Image represents the amplitudes, frequencies, and phases of the K measured sinewaves. The sinewave representation allows us to extrapolate the speech measurement beyond the analysis window duration Nw to a larger interval N, as we described earlier. With a sinewave representation, it is straightforward to show that the signal power is given by the approximation

Image

and substituting the sinewave representation in Equation (10.26) in the short-time Fourier transform defined in Equation (10.21) leads to the expression

Image

where

Image

Because the sinewaves are well-resolved, the magnitude of the STFT can then be approximated by

Image

where D(x) = |sinc (x)|. The MSE criterion then becomes

(10.27)

Image

where ωl are the frequencies of the sinewave representation in Equation (10.26) and ωo is the candidate pitch.

To gain some insight into the meaning of this criterion, suppose that the input speech is periodic with pitch frequency ω*. Then (barring measurement error) ωl = lω*, Al = ω ( *), and

Image

When ωo corresponds to sub-multiples of the pitch, the first term in Equation (10.27) remains unchanged, since Dlkωo) = 0 at the submultiples, but the second term, because it is an envelope and always non-zero, will increase at the submultiples of ω*. As a consequence,

Image

which shows that the MSE criterion leads to unambiguous pitch estimates. To see this property more clearly, consider the example illustrated in Figure 10.4. In this example, the true pitch is

Figure 10.4 Illustration of sinewave-based pitch estimator for a periodic input: (a) the first term in Equation (10.28) with D(fkfo) (a set of “comb filters”) spaced by the candidate fundamental frequency ƒo, where ƒ denotes the frequency in Hertz. The comb is sampled at the measured frequencies ƒl = lƒ*; (b) values of the likelihood function ρ(ƒo) for pitch candidates 50, 100, 200, and 400 Hz. The true pitch is 200 Hz.

Image

200 Hz and the sinewave envelope is constant at unity. One can then simplify Equation (10.27) as

(10.28)

Image

where K(ωo) is the number of harmonics over a fixed bandwidth of 800 Hz. The first term in Equation (10.28) corresponds to laying down a set of “comb filters” D(ωkωo) (yet again different in nature from those previously described) spaced by the candidate fundamental frequency ωo. The comb is then sampled at the measured frequencies ωl = lω* and the samples are summed. Finally, the resulting value is reduced by half the number of harmonics over the fixed band.

For the candidate (trial) of ƒ0=200 Hz, ρ(ƒ0)=2, as illustrated in Figure 10.4b. For the candidate ƒ0=100 Hz (pitch-halving), the first term is the same (as for ƒ0=200 Hz), but the second term decreases (negatively) so that ρ(ƒ0) = 0. The remaining two cases in Figure 10.4b are straightforward to evaluate. This argument with constant Image ( ω) holds more generally since we can write Equation (10.27) as

(10.29)

Image

where the first term is a correlation-like term [similar in style to the frequency-domain correlation-based pitch estimators, i.e., “comb filters,” in Equations (10.7) and (10.15)] and the second term is the generalized negative compensation for low-frequency fundamental candidates. Possibly the most significant attribute of the sinewave-based pitch extractor is that the usual problems with pitch-halving and pitch-doubling do not occur with the new error criterion (Exercise 10.2). This pitch estimator has been further refined to improve its resolution, resolve problems with formant-pitch interaction (as alluded to in the context of the autocorrelation pitch estimator), and improve robustness in additive noise by exploiting the auditory masking principle that small tones are masked by neighboring high tones [8]. (This auditory masking principle is described in Chapter 13 in the context of speech enhancement.) The following example compares the sinewave-based pitch estimator for voiced and unvoiced speech:

Example 10.2       In one implementation of the sinewave-based MSE pitch extractor, the speech is sampled at 10 kHz and analyzed using a 1024-point FFT. The sinewave amplitudes and frequencies are determined over a 1000-Hz bandwidth. In Figure 10.5(b), the measured amplitudes and frequencies are shown along with the piecewise-constant SEEVOC envelope for a voiced speech segment. Figure 10.5(c) is a plot of the first term in Equation (10.29) over a candidate pitch range from 38 Hz to 400 Hz and the inherent ambiguity of the correlator (comb-filter) is apparent. It should be noted that the peak at the correct pitch is largest, but during steady vowels the ambiguous behavior illustrated in the figure commonly occurs. Figure 10.5(d) is a plot of the complete likelihood function Equation (10.29) derived from the above MSE criterion and the manner in which the ambiguities are eliminated is clearly demonstrated. Figure 10.6 illustrates typical results for a segment of unvoiced fricative speech for which there is no distinct peak in the likelihood function. Image

10.4.3 Voicing Detection

In the context of the sinusoidal model, the degree to which a given frame of speech is voiced is determined by the degree to which the harmonic model fits the original sinewave data [5],[6]. The previous example indicated that the likelihood function is useful as a means to determine this degree of voicing. The accuracy of the harmonic fit can be related, in turn, to the signal-to-noise ratio (SNR) defined by

Image

where Image is the sinewave harmonic model at the selected pitch ωo. From Equation (10.25), it follows that

Image

Figure 10.5 Sinewave pitch estimator performance for voiced speech: (a) input speech; (b) piecewise-constant SEEVOC envelope; (c) autocorrelation (comb-filter) component of the likelihood function; (d) complete likelihood function.

SOURCE: R.J. McAulay and T.F. Quatieri, “Sinusoidal Coding,” chapter in Speech Coding and Synthesis [8]. ©1995, Elsevier Science. Reprinted with permission from Elsevier Science.

Image

where the input power Ps can be computed from the sinewave amplitudes. If the SNR is large, then the MSE is small and the harmonic fit is very good, which indicates that the input speech is probably voiced. For small SNR, on the other hand, the MSE is large and the harmonic fit is poor, which indicates that the input speech is more likely to be unvoiced. Therefore, the degree of voicing is functionally dependent on the SNR. Although the determination of the exact functional form is difficult, a rule that has proven useful in several speech applications is the following (Figure 10.7):

Figure 10.6 Sinewave pitch estimator for unvoiced fricative speech: (a) input speech; (b) piecewise-constant SEEVOC envelope; (c) autocorrelation (comb-filter) component of the likelihood function; (d) complete likelihood function.

SOURCE: R.J. McAulay and T.F. Quatieri, “Sinusoidal Coding,” chapter in Speech Coding and Synthesis [8]. ©1995, Elsevier Science. Reprinted with permission from Elsevier Science.

Image

Figure 10.7 Voicing probability measure derived from the SNR associated with the MSE of the sinewave pitch estimator.

Image

Image

where Pυ represents the probability that speech is voiced, and the SNR is expressed in dB. It is this quantity that was used to control the voicing-adaptive sinewave-based modification schemes in Chapter 9 and the voicing-adaptive frequency cutoff for the phase model to be used later in this chapter and in Chapter 12 for sinewave-based speech coding.

The next two examples illustrate the performance of the sinewave-based pitch estimator of Section 10.4.2 and the above voicing estimator for normal and “anomalous” voice types.

Example 10.3       Figure 10.8 illustrates the sinewave pitch and voicing estimates for the utterance, “Which tea party did Baker go to?” from a female speaker. Although the utterance is a question, and thus, from our discussion in Chapter 3, we might expect a rising pitch at the termination of the passage, we see a rapid falling of pitch in the final word because the speaker is relaxing her vocal cords. Image

Example 10.4       Figure 10.9 illustrates the sinewave pitch and voicing estimates for the utterance, “Jazz hour” from a very low-pitched male speaker. We see that the diplophonia in the utterance causes a sudden doubling of pitch during the second word “hour” where secondary pulses are large within a glottal cycle. Secondary pulses can result in an amplitude dip on every other harmonic (Exercise 3.3) and thus a doubling of the pitch estimate. We also see that diplophonia causes a severe raggedness in the voicing probability measure. The reader is asked to consider these contours further in Exercise 10.13. Image

10.4.4 Time-Frequency Resolution Perspective

We now summarize the sinewave pitch estimator strategy from a time-frequency resolution perspective. In Equation (10.17), we are attempting to fit the measurement s[n] over an Nw-sample analysis window by a sum of sinewaves that are harmonically related, i.e., by a model signal that is perfectly periodic. We have the spectral envelope of the sinewave amplitudes and thus the sinewave amplitudes for any fundamental frequency.

Figure 10.8 Pitch and voicing contours from sinewave-based estimators for the utterance, “Which tea party did Baker go to?” from female speaker: (a) waveform; (b) pitch contour; (c) voicing probability contour.

Image

When we work through the algebra, we obtain the error function Equation (10.23) to be minimized over the unknown pitch ωo and its equivalent form Equation (10.24) to be maximized. In this analysis, we have not yet made any assumptions about the measurement s[n]. We might simply stop at this point and perform the maximization to obtain a pitch estimate. For some conditions, this will give a reasonable pitch estimate and, in fact, the error function in Equation (10.24) benefits from the presence of the negative term on the right side of the equation in the sense that it helps to avoid pitch halving. But there is a problem with this approach.

Observe that Equation (10.24) contains the short-time Fourier transform magnitude of the measurement and recall that we make the analysis window as short as possible to obtain a time resolution as good as possible. Because the window is short, the main lobe of the window’s Fourier transform is wide. For a low-pitch candidate ωo, then, the first term in the expression for ρ(ω) in Equation (10.24) will be sampled many times within the main lobes of the window Fourier transforms embedded inside |S(o)|. This may unreasonably bias the pitch to low estimates, even with the negative term on the right. Therefore, we are motivated to make the analysis window very long to narrow the window’s main lobe. This reduces time resolution, however, and can badly blur time variations and voiced/unvoiced transitions.

Figure 10.9 Pitch and voicing contours from sinewave-based estimators for the utterance, “Jazz hour” from a very low-pitched male speaker with diplophonia: (a) waveform; (b) pitch contour; (c) voicing probability contour.

Image

Our goal, then, is a representation of the measurement s[n] that is derived from a short window but has the advantage of being able to represent a long interval in Equation (10.24). Recall that we were faced with this problem with the earlier autocorrelation pitch estimator of Equation (10.1). In that case, we used a sinewave representation of the measurement s[n] to allow extrapolation beyond the analysis window interval. We can invoke the same procedure in this case. We modify the original error criterion in Equation (10.17) so that s[n] is a sinewave representation derived from amplitude, frequencies, and phases at the STFT magnitude peaks. This leads to Equation (10.27) in which the main lobe of the function D(ω) = |sinc(ω)| is controlled by the (pseudo) window length N. That is, we can make N long (and longer than the original window length Nw) to make the main lobe of D(ω) narrow, thus avoiding the above problem of biasing the estimate to low pitch. Observe we are using two sinewave representations: a harmonic sinewave model that we use to fit to the measurement, and a sinewave model of the measurement that is not necessarily harmonic since it is derived from peak-picking the STFT magnitude.

10.4.5 Evaluation by Harmonic Sinewave Reconstruction

Validating the performance of a pitch extractor can be a time-consuming and laborious procedure since it requires a comparison with hand-labeled data. An alternative approach is to reconstruct the speech using the harmonic sinewave model in Equation (10.16) and to listen for pitch errors. The procedure is not quite so straightforward as Equation (10.16) indicates, however, because, during unvoiced speech, meaningless pitch estimates are made that can lead to perceptual artifacts whenever the pitch estimate is greater than about 150 Hz. This is due to the fact that, in these cases, there are too few sinewaves to adequately synthesize a noiselike waveform. This problem has been eliminated by defaulting to a fixed low pitch (≈ 100 Hz) during unvoiced speech whenever the pitch exceeds 100 Hz. The exact procedure for doing this is to first define a voicing-dependent cutoff frequency, ωc (as we did in Chapter 9):

(10.30)

Image

which is constrained to be no smaller than 2π (1500 Hz/ƒs), where ƒs is the continuous-to-discrete time sampling frequency. If the pitch estimate is ωo, then the sinewave frequencies used in the reconstruction are

(10.31)

Image

where k* is the largest value of k for which k*ωoωc(Pυ), and where ωu, the unvoiced pitch, corresponds to 100 Hz (i.e., ωu = 2π(100 Hz/ƒs). If ωo < ωu, then we set ωk = o for all k. The harmonic reconstruction then becomes

(10.32)

Image

where Imagek is the phase obtained by sampling a piecewise-constant phase function derived from the measured STFT phase using the same strategy to generate the SEEVOC envelope (Section 10.5.3). Strictly speaking, this procedure is harmonic only during strongly-voiced speech because if the speech is a voiced/unvoiced mixture the frequencies above the cutoff, although equally spaced by ωu, are aharmonic, since they are themselves not multiples of the fundamental pitch.

The synthetic speech produced by this model is of very high quality, almost perceptually equivalent to the original. Not only does this validate the performance of the MSE pitch extractor, but it also shows that if the amplitudes and phases of the harmonic representation could be efficiently coded, then only the pitch and voicing would be needed to code the information in the sinewave frequencies. Moreover, when the measured phases are replaced by those defined by a voicing-adaptive phase model to be derived in the following section, the synthetic speech is also of good quality, although not equivalent to that obtained using the phase samples of the piecewise-flat phase envelope derived from the STFT phase measurements. However, it provides an excellent basis from which to derive a low bit-rate speech coder as described in Chapter 12.

10.5 Glottal Pulse Onset Estimation

In the previous section, the sinewave model was used as the basis for pitch estimation, which also led naturally to a harmonic sinewave model and associated harmonic analysis/synthesis system of high quality. This harmonic representation is important not only for the testing of the pitch estimation algorithm, but also, as stated, for the development of a complete parameterization for speech coding; an efficient parametric representation must be developed to reduce the size of the parameter set. That is, the raw sinewave amplitudes, phases, and frequencies cannot all be coded efficiently without some sort of parameterization, as we will discuss further in Chapter 12. The next step toward this goal is to develop a model for the sinewave phases by explicitly identifying the source linear phase component, i.e., the glottal pulse onset time, and the vocal tract phase component. In Chapter 9, in the application of time-scale modification, we side-stepped the absolute onset estimation problem with the use of a relative onset time derived by accumulating pitch periods. In this chapter, we propose one approach to absolute onset estimation [6],[7],[8]. Other methods include inverse filtering-based approaches described in Chapter 5 and a phase derivative-based approach [14].

10.5.1 A Phase Model Based on Onset Time

We first review the excitation onset time concept that was introduced in Chapter 9. For a periodic voiced speech waveform with period P, in discrete time the excitation is given by a train of periodically spaced impulses, i.e.,

Image

where no is a displacement of the pitch pulse train from the origin. The sequence of excitation pitch pulses can also be written in terms of complex sinewaves as

(10.33)

Image

where ωk = (2π/P)k. We assume that the excitation amplitudes ak are unity, implying that the measured sinewave amplitudes at spectral peaks are due solely to the vocal tract system function, the glottal flow function being embedded within the vocal tract system function. Generally, the sinewave excitation frequencies ωk are assumed to be aharmonic but constant over an analysis frame. The parameter no corresponds to the time of occurrence of the pitch pulse nearest the center of the current analysis frame. The occurrence of this temporal event, called the onset time and introduced in Chapter 9, ensures that the underlying excitation sinewaves will be “in phase” at the time of the pitch pulse.

The amplitude and phase of the excitation sinewaves are altered by the vocal tract system function. Letting H(ω) denote the composite system function, the speech signal at its output becomes

(10.34)

Image

which we write in terms of system and excitation components as

(10.35)

Image

where

Image

The time dependence of the system function, which was included in Chapter 9, is omitted here under the assumption that the vocal tract (and glottal flow function) is stationary over the duration of each synthesis frame. The excitation phase is linear with respect to frequency and time. As in Chapter 9, we assume that the system function has no linear phase so that all linear phase in Image is due to the excitation.

Let us now write the composite phase in Equation (10.35) as

θk[n] = Φ(ωk) + Imagek[n].

At time n = 0 (i.e., the analysis and synthesis frame center), then

θk[0] = Φ(ωk) + Imagek[0]

          = Φ(ωk) − noωk.

The excitation and system phase components of the composite phase are illustrated in Figure 10.10 for continuous frequency ω. The phase value θk[0] is obtained from the STFT at frequencies ωk at the center of an analysis frame, as described in Chapter 9. Likewise, the value M(ωk) is the corresponding measured sinewave amplitude. Toward the development of a sinewave analysis/synthesis system based on the above phase decomposition, a method for estimating the onset time will now be described, given the measured sinewave phase θk[0] and amplitude M (ωk) values [6],[7].

Figure 10.10 Vocal tract and excitation sinewave phase contributions. At the center of the analysis frame, i.e., at time n = 0, the excitation phase is linear in frequency and given by − noω.

Image

10.5.2 Onset Estimation

From Chapter 9, the speech waveform can also be represented in terms of measured sinewave amplitude, frequency, and phase parameters as

(10.36)

Image

where θk denotes the measured phase θk[0] at the analysis frame center and where sinewave parameters are assumed to be fixed over the analysis frame. We now have two sinewave representations, i.e., s[n] in Equation (10.36) obtained from parameter measurements and the sinewave model Image in Equation (10.35) with a linear excitation phase in terms of an unknown onset time. In order to determine the excitation phase, the onset time parameter no is estimated by choosing the value of no so that Image is as close as possible to s[n], according to some meaningful criterion [6],[7],[8]. A reasonable criterion is to seek the minimum of the mean-squared error (MSE) over a time interval N (generally greater than the original analysis window duration Nw), i.e.,

(10.37)

Image

over no and where we have added the argument no in Image to denote that this sequence is a function of the unknown onset time.

The MSE in Equation (10.37) can be expanded as

(10.38)

Image

If the sinusoidal representation for s[n] in Equation (10.36) is used in the first term of Equation (10.38), then, as before, the power in the measured signal can be defined as

Image

Letting the system transfer function be written in terms of its magnitude M (ω) and phase Φ (ω), namely

H (ω) = M (ω) exp[jΦ(ω)]

and using this as in Equation (10.35), the second term of Equation (10.38) can be written as

Image

Finally, the third term in Equation (10.38) can be written as

Image

These relations are valid provided the sinewaves are well-resolved, a condition that is basically assured by making the interval N, two and one-half times the average pitch period, which was the condition assumed in the estimation of the sinewave parameters in the first place. [Indeed, we can make N as long as we like because we are working with the sinewave representation in Equation (10.36).] Combining the above manipulations leads to the following expression for the MSE:

(10.39)

Image

Equation (10.39) was derived under the assumption that the system amplitude M (ω) and phase Φ (ω) were known. In order to obtain the optimal onset time, therefore, the amplitude and phase of the system function must next be estimated from the data. We choose here the SEEVOC amplitude estimate Image to be described in Section 10.5.3. If the system function is assumed to be minimum phase, then, as we have seen in Chapter 6, we can obtain the corresponding phase function by applying a rightsided lifter to the real cepstrum associated with Image, i.e.,

Image

from which the system minumum-phase estimate follows as

(10.40)

Image

where l[0] = 1, l[n] = 0 for n < 0, and l[n] = 2 for n > 0. Use of Equation (10.40) is incomplete, however, because the minimum-phase analysis fails to account for the sign of the input speech waveform. This is due to the fact that the sinewave amplitudes from which the system amplitude and phase are derived are the same for −s[n] and s[n]. This ambiguity can be accounted for by generalizing the system phase Image in Equation (10.40) by

(10.41)

Image

and then choosing no and β to minimize the MSE simultaneously. Substituting Equation (10.40) and Equation (10.41) into Equation (10.39) leads to the following equation for the MSE:

Image

Since only the second term depends on the phase model, it suffices to choose no and β to maximize the “likelihood” function

Image

However, since

ρ(no, β = 1) = −ρ(no, β = 0)

it suffices to maximize |ρ(no)| where now

(10.42)

Image

and if Image is the maximizing value, then choose Image and β = 1, otherwise.

Example 10.5       In this experiment, the sinewave analysis uses a 10-kHz sampling rate and a 1024-point FFT to compute the STFT [7]. The magnitude of the STFT is computed and the underlying sinewaves are identified by determining the frequencies at which a change in the slope of the STFT magnitude occurs. The SEEVOC algorithm was used to estimate the piecewise-linear envelope of the sinewave amplitudes. A typical set of results is shown in Figure 10.11. The logarithm of the STFT magnitude is shown in Figure 10.11d together with the SEEVOC envelope. The cepstral coefficients are used in Equation (10.40) to compute the system phase, from which the onset likelihood function is computed using Equation (10.42). The result is shown in Figure 10.11b. It is interesting to note that the peaks in the onset function occur at the points which seem to correspond to a sharp transition in the speech waveform, probably near the glottal pulse, rather than at a peak in the waveform. We also see in Figure 10.11c that the phase residual is close to zero below about 3500 Hz, indicating the accuracy of our phase model in this region. The phase residual is the difference between the measured phase and our modeled phase, interpolated across sinewave frequencies.

An example of the application of the onset estimator for an unvoiced fricative case is shown in Figure 10.12 [7]. The estimate of the onset time is meaningless in this case, and this is reflected in the relatively low value of the likelihood function. The onset time, however, can have meaning and importance for non-fricative consonants such as plosives where the phase and event timing is important for maintaining perceptual quality. Image

10.5.3 Sinewave Amplitude Envelope Estimation

The above results show that if the envelope of the sinewave amplitudes is known, then the MSE criterion can lead to a technique for estimating the glottal pulse onset time under the assumption that the glottal pulse and vocal tract response are minimum-phase. This latter assumption was necessary to derive an estimate of the system phase, hence the performance of the estimator depends on the system magnitude. An ad hoc estimator for the magnitude of the system function is simply to apply linear interpolation between successive sinewave peaks. This results in the function

(10.43)

Image

The problem with such a simple envelope estimator is that the system phase is sensitive to low-level peaks that can arise due to time variations in the system function or signal processing artifacts such as side-lobe leakage. Fortunately, this problem can be avoided using the technique proposed by Paul [8],[10] in the development of the Spectral Envelope Estimation Vocoder (SEEVOC).

Figure 10.11 Onset estimation for voiced speech example: (a) voiced speech segment; (b) likelihood function; (c) phase residual; (d) STFT magnitude and superimposed piecewise-linear SEEVOC envelope.

SOURCE: R.J. McAulay and T.F. Quatieri, “Low-Rate Speech Coding Based on the Sinusoidal Model,” chapter in Advances in Speech Signal Processing [7]. ©1992, Marcel Dekker, Inc. Courtesy of Marcel Dekker, Inc.

Image

The SEEVOC algorithm depends on having an estimate of the average pitch, denoted here by Image. The first step is to search for the largest sinewave amplitude in the interval Image. Having found the amplitude and frequency of that peak, labeled (A1, ω1), then one searches the interval Image for its largest peak, labeled (A2, ω2), as illustrated in Figure 10.13. The process is continued by searching the intervals Image for the largest peaks (Ak, ωk) until the edge of the speech bandwidth is reached. If no peak is found in a search bin, then the largest endpoint of the short-time Fourier transform magnitude is used and placed at the bin center, from which the search procedure is continued. The principle advantage of this pruning method is the fact that any low-level peaks within a pitch interval are masked by the largest peak, presumably a peak that is close to an underlying harmonic. Moreover, the procedure is not dependent on the peaks’ being harmonic, nor on the exact value of the average pitch since the procedure resets itself after each peak has been found. The SEEVOC envelope, the envelope upon which the above minimum-phase analysis is based, is then obtained by applying the linear interpolation rule,7 Equation (10.43), where now the sinewave amplitudes and frequencies are those obtained using the SEEVOC peak-picking routine.

Figure 10.12 Onset estimation for unvoiced fricative speech example: (a) unvoiced speech segment; (b) likelihood function; (c) phase residual; (d) STFT magnitude and superimposed piecewise-linear SEEVOC envelope.

SOURCE: R.J. McAulay and T.F. Quatieri, “Low-Rate Speech Coding Based on the Sinusoidal Model,” chapter in Advances in Speech Signal Processing [7]. ©1992, Marcel Dekker, Inc. Courtesy of Marcel Dekker, Inc.

Image

7 In the sinewave pitch estimation of the previous section, a piecewise-constant envelope was derived from the pruned peaks rather than a piecewise-linear envelope, having the effect of maintaining at harmonic frequencies the original peak amplitudes.

Figure 10.13 SEEVOC spectral envelope estimator. One spectral peak is selected for each harmonic bin. The envelope estimate is denoted by Image and the speech spectrum by |S(ω)|.

Image

10.5.4 Minimum-Phase Sinewave Reconstruction

Consider now a minimum-phase model which comes naturally from the discussion in Section 10.5.2. In order to evaluate the accuracy of a minimum-phase model, as well as the onset estimator, it is instructive to examine the behavior of the phase error (previously referred to as “phase residual”) associated with this model that we saw earlier. Let

(10.44)

Image

be the model estimate of the sinewave phase at any frequency, ω where Image denotes the minimum-phase estimate. Then, for the sinewave at frequency ωk, for which the measured phase is θk, the phase error is

Image

The phase error for the speech sample in Figure 10.11a is shown in Figure 10.11c. In this example, the phase error is small for frequencies below about 3.5 kHz. The larger structured error beyond 3.5 kHz probably indicates the inadequacy of the minimum-phase assumption and the possible presence of noise in the speech source or maximum-phase zeros in the transfer function H (ω) (Chapter 6). Nevertheless, as with the pitch estimator, it is instructive to evaluate the estimators by reconstruction [6],[7],[8].

One approach to reconstruction is to assume that the phase residual is negligible and form a phase function for each synthesis frame that is the sum of the minimum-phase function and a linear excitation phase and sign derived from the onset estimator. In other words, from Equation (10.44), the sinewave phases on each frame are given by

(10.45)

Image

When this phase is used, along with the measured sinewave amplitudes at ωk, the resulting quality of voiced speech is quite natural but some slight hoarseness is introduced due to a few-sample inaccuracy in the onset estimator that results in a randomly changing pitch period (pitch jitter); it was found that even a sporadic one- or two-sample error in the onset estimator can introduce this distortion. This property was further confirmed by replacing the above absolute onset estimator by the relative onset estimator introduced in Chapter 9, whereby onset estimates are obtained by accumulating pitch periods. With this relative onset estimate, the hoarseness attributed to error in the absolute onset time is removed and the synthesized speech is free of artifacts. In spite of this limitation of the absolute onset estimator, it provides a means to obtain useful glottal flow timing information in applications where reduction in the linear phase is important (Exercise 10.5). Observe that we have described reconstruction of only voiced speech. When the phase function Equation (10.45) is applied to unvoiced speech, particularly fricatives and voicing with frication or strong aspiration, the reconstruction is “buzzy” because an unnatural waveform peakiness arises due to the sinewaves no longer be randomly displaced from one another (Exercise 10.6). An approach to remove this distortion when a minimum-phase vocal tract system is invoked is motivated by the phase-dithering model of Section 9.5.2 and will be described in Chapter 12 in the context of speech coding.

10.6 Multi-Band Pitch and Voicing Estimation

In this section, we describe a generalization of sinewave pitch and voicing estimation that yields a voicing decision in multiple frequency bands. A byproduct is sinewave amplitude estimates derived from a minimum-mean-squared error criterion. These pitch and voicing estimates were developed in the context of the Multi-Band Excitation (MBE) speech representation developed by Griffin and Lim [2],[8] where, as in Section 10.4, speech is represented as a sum of harmonic sinewaves.

10.6.1 Harmonic Sinewave Model

As above, the synthetic waveform for a harmonic set of sinewaves is written as

(10.46)

Image

Whereas in Section 10.4.2 the sinewave amplitudes were assumed to be harmonic samples of an underlying vocal tract envelope, in MBE they are allowed to be unconstrained free variables and are chosen to render Image a minimum-mean-squared error fit to the measured speech signal s[n]. For analysis window w[n] of length Nw, the short-time speech segment and its harmonic sinewave representation is denoted by sw[n] = w[n]s[n] and Image (thus, simplifying to one STFT slice), respectively. The mean-squared error between the two signals is given by

(10.47)

Image

where Image and Image are the vectors of unknown amplitudes and phases at the sinewave harmonics. Following the development in [8],

Image

denotes the discrete-time Fourier transform of sw[n] and, similarly, Image denotes the discrete-time Fourier Transform of Image Then using Parseval’s Theorem, Equation (10.47) becomes

(10.48)

Image

The first term, which is independent of the pitch, amplitude, and phase parameters, is the energy in the windowed speech signal that we denote by Ew. Letting Bk exp (jImage) represent the complex amplitude of the kth harmonic and using the sinewave decomposition in Equation (10.46), Image can be written as

Image

with

Image

where W(ω) is the discrete-time Fourier transform of the analysis window w[n]. Substituting this relation into Equation (10.48), the mean-squared error can be written as

Image

For each value of ωo this equation is quadratic in βk. Therefore, each βk is a function of ωo and we denote this set of unknowns by Image. It is straightforward to solve for the Image that results in the minimum-mean-squared error, Eo,Image]. This process can be repeated for each value of ωo so that the optimal minimum-mean-squared error estimate of the pitch can be determined. Although the quadratic optimization problem is straightforward to solve, it requires solution of a simultaneous set of linear equations for each candidate pitch value. This makes the resulting pitch estimator complicated to implement. However, following [2], we assume that W(ω) is essentially zero in the region Image, which corresponds to the condition posed in Section 10.4 to insure that the sinewaves are well-resolved. We then define the frequency region about each harmonic frequency o as

(10.49)

Image

with which the mean-squared error can be approximated as

Image

from which it follows that the values of the complex amplitudes that minimize the mean-squared error are

(10.50)

Image

The best mean-squared error fit to the windowed speech data is therefore given by

Image

This expression is then used in Equation (10.47) to evaluate the mean-squared error for the given value of ωo. This procedure is repeated for each value of ωo in the pitch range of interest and the optimum estimate of the pitch is the value of ωo that minimizes the mean-squared error. While the procedure is similar to that used in Section 10.4, there are important differences. The reader is asked to explore these differences and similarities in Exercise 10.8. Extensions of this algorithm by Griffin and Lim exploit pitch estimates from past and future frames in “forward-backward” pitch tracking to improve pitch estimates during regions in which the pitch and/or vocal tract are rapidly changing [2].

10.6.2 Multi-Band Voicing

As in Section 10.4.3, distinguishing between voiced and unvoiced spectral regions is based on how well the harmonic set of sinewaves fits the measured set of sinewaves. In Section 10.4.3, a signal-to-noise ratio (SNR) was defined in terms of the normalized mean-squared error, and this was mapped into a cutoff frequency below which the sinewaves were declared voiced and above which they were declared unvoiced. This idea, which originated with the work of Makhoul, Viswanathan, Schwartz, and Huggins [4], was generalized by Griffin and Lim [2] to allow for an arbitrary sequence of voiced and unvoiced bands with the measure of voicing in each of the bands determined by a normalized mean-squared error computed for the windowed speech signals. Letting

γm = {ω : ωm1ωωm}, m = 1, 2, …, M

denote the mth band of M multi-bands over the speech bandwidth, then using Equation (10.48), the normalizing mean-squared error for each band can be written as

(10.51)

Image

Each of the M values of the normalized mean-squared error is compared with a threshold function to determine the binary voicing state of the sinewaves in each band. If Image is below the threshold, the mean-squared error is small, hence the harmonic sinewaves fit well to the input speech and the band is declared voiced. The setting of the threshold uses several heuristic rules to obtain the best performance [3].

It was observed that when the multi-band voicing decisions are combined into a two-band voicing-adaptive cutoff frequency, as was used in Section 10.4.3, no loss in quality was perceived in low-bandwidth (e.g., 300–4000 Hz) synthesis [3],[8],[9]. Nevertheless, this scheme affords the possibility of a more accurate two-band cutoff than that in Section 10.4.3 and the multi-band extension may be useful in applications such as speech transformations (Exercise 10.9). An additional advantage of multi-band voicing is that it can make reliable voicing decisions when the speech signal has been corrupted by additive acoustical noise [2],[3],[9]. The reason for this lies in the fact that the normalized mean-squared error essentially removes the effect of the spectral tilt, which means that the sinewave amplitudes contribute more or less equally from band to band. When one wide-band voicing decision is made, as in Section 10.4.3, only the largest sinewave amplitudes will contribute to the mean-squared error, and if these have been corrupted due to noise, then the remaining sinewaves, although harmonic, may not contribute enough to the error measure to offset those that are corrupted. Finally, observe that although the multi-band voicing provides a refinement to the two-band voicing strategy, it does not account for the additive nature of noise components in the speech spectrum, as was addressed in the deterministic-stochastic model of Chapter 9.

10.7 Summary

In this chapter, we introduced a frequency-domain approach to estimating the pitch and voicing state of a speech waveform, in contrast to time-domain approaches, based on analysis/synthesis techniques of earlier chapters. Specifically, we saw that pitch estimation can be thought of as fitting a harmonic set of sinewaves to a measured set of sinewaves, and the accuracy of the harmonic fit is an indicator of the voicing state. A simple autocorrelation-based estimator, formulated in the frequency domain, was our starting point and led to a variety of sinewave-based pitch estimators of increasing complexity and accuracy. A generalization of the sinewave model to a multi-band pitch and voicing estimation method was also described. In addition, we applied the sinewave model to the problem of glottal pulse onset time estimation. Finally, we gave numerous mixed- and minimum-phase sinewave-based waveform reconstruction techniques for evaluating the pitch, voicing, and onset estimators. A spinoff of these evaluation techniques is analysis/synthesis structures based on harmonic sinewaves, a minimum-phase vocal tract, and a linear excitation phase that form the basis for frequency-domain sinewave-based speech coding methods in Chapter 12. Other pitch and voicing estimators will be described as needed in the context of speech coding, being more naturally developed for a particular coding structure.

We saw both the features and limitations of the various pitch estimators using examples of typical voice types and a diplophonic voice with secondary glottal pulses occurring regularly, and about midway, between primary glottal pulses within a glottal cycle. Many “anomalous” voice types that we have described earlier in this text, however, were not addressed. These include, for example, the creaky voice with erratically-spaced glottal pulses, and the case of every other vocal tract impulse response being amplitude modulated. These cases give difficulty to pitch and voicing estimators, generally, and are considered in Exercises 10.10, 10.11, 10.12, and 10.13. Such voice types, as well as voiced/unvoiced transitions, and rapidly-varying speech events, in spite of the large strides in improvements of pitch and voicing estimators, render the pitch and voicing problem in many ways still challenging and unsolved.

Exercises

10.1 Show that the autocorrelation-based pitch estimator in Equation (10.2) follows from minimizing the error criterion in Equation (10.1) with respect to the unknown pitch period P. Justify why you must constrain P > (a small positive value close to zero), i.e., P must be sufficiently far from zero.

10.2 In this problem you are asked to the complete missing steps in the harmonic sinewave model-based pitch estimator of Sections 10.4.1 and 10.4.2.

(a) Show that Equation (10.9) follows from Equation (10.8).

(b) Show that minimizing Equation (10.9) with respect to Image gives Equation (10.11), and thus gives Equation (10.12).

(c) Show that minimizing Equation (10.12) with respect to B gives Equation (10.13), and thus gives Equation (10.14).

(d) Explain why Equation (10.15) can lead to pitch-halving errors.

(e) Show how Equation (10.18) can be manipulated to obtain Equation (10.22). Fill in all of the missing steps in the text. Interpret Equation (10.29) in terms of its capability to avoid pitch halving relative to correlation-based pitch estimators. Argue that the pitch estimator also avoids pitch doubling.

(f) Explain how to obtain the results in Figure 10.4b for the cases ƒo = 400 Hz and ƒo = 50 Hz.

(g) Propose an extension of the sinewave-based pitch estimator where an a priori vocal tract system phase envelope is known, as well as an a priori system magnitude envelope. Qualitatively describe the steps in deriving the estimator and explain why this additional phase information might improve the pitch estimate.

10.3 In the context of homomorphic filtering, we saw in Chapter 6 one approach to determining voicing state (i.e., a speech segment is either voiced or unvoiced) which requires the use of the real cepstrum, and in this chapter we derived a voicing measure based on the degree of harmonicity of the short-time Fourier transform. In this problem, you consider some other simple voicing measurements. Justify the use of each of the following measurements as a voicing indicator. For the first two measures, use your knowledge of acoustic phonetics, i.e., the waveform and spectrogram characteristics of voiced and unvoiced phonemes. For the last two measurements use your knowledge of linear prediction analysis.

(a) The relative energy in the outputs of complementary highpass and lowpass filters of Figure 10.14.

(b) The number of zero crossings in the signal.

(c) The first reflection coefficient Image generated in the Levinson recursion.

(d) The linear prediction residual obtained by inverse filtering the speech waveform by the inverse filter A(z).

Figure 10.14 Highpass and lowpass filters.

Image

10.4 Suppose, in the sinewave-based pitch estimator of Section 10.4.2, we do not replace the spectrum by its sinewave representation. What problems arise that the sinewave representation helps prevent? How does it help resolve problems inherent in the autocorrelation-based sinewave pitch estimators?

10.5 Show how the onset estimator of Section 10.5.2, which provides a linear excitation phase estimate, may be useful in obtaining a vocal tract phase estimate from sinewave phase samples. Hint: Recall from Chapter 9 that one approach to vocal tract system phase estimation involves interpolation of the real and imaginary parts of the complex STFT samples at the sinewave frequencies. When the vocal tract impulse response is displaced from the time origin, a large linear phase can be introduced.

10.6 It was observed that when the phase function in Equation (10.45), consisting of a minimum-phase system function and a linear excitation phase derived from the onset estimator of Section 10.5.2, is applied to unvoiced speech, particularly fricatives and voicing with frication or strong aspiration, the reconstruction is “buzzy.” It was stated that this buzzy perceptual quality is due to an unnatural waveform peakiness. Give an explanation for this peakiness property, considering the characteristics of the phase residual in Figures 10.11 and 10.12, as well as the time-domain characteristics of a linear excitation phase.

10.7 In this problem, you investigate the “magnitude-only” counterpart to the multi-band pitch estimator of Section 10.6. Consider a perfectly periodic voiced signal of the form

x[n] = h[n] * p[n]

where

Image

P being the pitch period. The windowed signal y[n] = w[n]x[n] (the window w[n] being a few pitch periods in duration) is expressed by

y[n] = w[n]x[n]

                      = w[n](h[n] * p[n]).

(a) Show that the Fourier transform of y[n] is given by

Image

where Image is the fundamental frequency (pitch) and N is the number of harmonics.

(b) A frequency-domain pitch estimator uses the error criterion:

Image

where S(ω) is the short-time spectral measurement and Y(ω) is the model from part (a). In the model Y(ω) there are two unknowns: the pitch ωo and the vocal tract spectral values H(kωo). Suppose for the moment that we know the pitch ωo. Consider the error E over a region around a harmonic and suppose that the window transform is narrow enough so that main window lobes are independent (non-overlapping). Then, the error around the kth harmonic can be written as

Image

and the total error is approximately

Image

Given ωo, find an expression for H(kωo) that minimizes E(k). With this solution, write an expression for E. Keep your expression in the frequency domain and do not necessarily simplify. It is possible with Parseval’s Theorem to rewrite this expression in the time domain in terms of autocorrelation functions, which leads to an efficient implementation, but you are not asked to show this here.

(c) From part (b), propose a method for estimating the pitch ωo that invokes minimization of the total error E. Do not attempt to find a closed-form solution, but rather describe your approach qualitatively. Discuss any disadvantages of your approach.

10.8 Consider similarities and differences in the sinewave-based pitch and voicing estimator developed in Sections 10.4.2 and 10.4.3 with the multi-band pitch and voicing estimators of Section 10.6 for the following:

1. Pitch ambiguity with pitch halving or pitch doubling. Hint: Consider the use of unconstrained amplitude estimates in the multi-band pitch estimator and the use of samples of a vocal tract envelope in the sinewave pitch estimator of Section 10.4.2.

2. Voicing estimation with voiced fricatives or voiced sounds with strong aspiration.

3. Computational complexity.

4. Dependence on the phase of the discrete-time Fourier transform over each harmonic lobe. Hint: The phase of the discrete-time Fourier transform is not always constant across every harmonic lobe. How might this changing phase affect the error criterion in Equation (10.48) and thus the amplitude estimation in the multi-band amplitude estimator of Equation (10.50)?

10.9 This problem considers the multi-band voicing measure described in Section 10.6.2.

(a) Propose a strategy for combining individual band decisions into a two-band voicing measure, similar to that described in Section 10.4.3, where the spectrum above a cutoff frequency is voiced and below the cutoff is unvoiced. As a point of interest, little quality difference has been observed between this type of reduced two-band voicing measure and the original multi-band voicing measure when used in low-bandwidth (e.g., 3000–4000 Hz) synthesis.

(b) Explain why the multi-band voicing measure of part (a), reduced to a two-band decision, gives the possibility of a more accurate two-band voicing cutoff than the sinewave-based method of Section 10.4.3.

(c) Explain how the multi-band voicing measure may be more useful in sinewave-based speech transformations than the reduced two-band decision, especially for wide-bandwidth (e.g., > 4000 Hz) synthesis. Consider, in particular, pitch modification.

10.10 In this problem, you investigate the different “anomalous” voice types of Figure 10.15 with diplophonic, creaky, and modulation (pitch periods with alternating gains) characteristics. Consider both the time-domain waveform and the short-time spectrum obtained from the center of each waveform segment.

(a) For the diplophonic voice, describe how secondary pulses generally affect the performance of both time- and frequency-domain pitch estimators.

(b) For the creaky voice, describe how erratic glottal pulses generally affect the performance of both time- and frequency-domain pitch estimators.

(c) For the modulated voice, explain why different spectral bands exhibit different pitch values. Consider, for example, time-domain properties of the signal. Propose a multi-band pitch estimator for such voice cases.

Figure 10.15 Examples of “anomalous” voice types: (a) diplophonic; (b) creaky; (c) alternating gain over successive pitch periods. The waveform and STFT magnitude are shown for each case.

Image

10.11 (MATLAB) In this problem, you investigate the correlation-based pitch estimator of Section 10.2.

(a) Implement in MATLAB the correlation-based pitch estimator in Equation (10.2). Write your function to loop through a speech waveform at a 10-ms frame interval and to plot the pitch contour.

(b) Apply your estimator to the voiced speech waveform speech1_8k (at 8000 samples/s) in workspace ex10M1.mat located in companion website directory Chap_exercises/chapter10. Plot the short-time autocorrelation function for numerous frames. Describe possible problems with this approach for typical voiced speech.

(c) Apply your estimator to the “anomalous” voice types in workspace ex10M2.mat located in companion website directory Chap_exercises/chapter10. Three voice types are given (also shown in Figure 10.15): diplo1_8k (diplophonic), creaky1_8k (creaky), and modulated1_8k (pitch periods with alternating gain). Describe problems that your pitch estimator encounters with these voice types.

10.12 (MATLAB) In this problem, you investigate the “comb-filtering” pitch estimator of Section 10.3.

(a) Implement in MATLAB the comb-filtering pitch estimator of Equation (10.7) and used in Example 10.1 by selecting the last large peak in the pitch likelihood function Q(ω). Write the function to loop through a speech waveform at a 10-ms frame interval and to plot the pitch contour.

(b) Apply your pitch estimator to the voiced speech waveform speech1_8k (at 8000 samples/s) in workspace ex10M1.mat located in companion website directory Chap_exercises/chapter10. Add white noise to the speech waveform and recompute your pitch estimate. Plot the pitch likelihood function for numerous frames. Describe possible problems with this approach.

(c) Apply your estimator to the “anomalous” voice types in workspace ex10M2.mat located in companion website directory Chap_exercises/chapter10. Three voice types are given (and are shown in Figure 10.15): diplo1_8k (diplophonic), creaky1_8k (creaky), and modulated1_8k (pitch periods with alternating gain). Describe problems that your pitch estimator encounters with these voice types.

10.13 (MATLAB) In this problem you investigate the sinewave-based pitch estimators developed in Sections 10.4.1 and 10.4.2.

(a) Implement in MATLAB the sinewave-based pitch estimator in Equation (10.15). Write the function to loop through a speech waveform at a 10-ms frame interval and to plot the pitch contour.

(b) Apply your estimator to the voiced speech waveform speech1_8k (at 8000 samples/s) in workspace ex10M1.mat located in companion website directory Chap_exercises/chapter10. Discuss possible problems with this approach, especially the pitch-halving problem and the problem of a bias toward low pitch.

(c) Apply your estimator from part (a) to the “anomalous” voice types in workspace ex10M2.mat located in companion website directory Chap_exercises/chapter10. Three voice types are given (Figure 10.15): diplo1_8k (diplophonic), creaky1_8k (creaky), and modulated1_8k (pitch periods with alternating gain). Describe problems that your pitch estimator encounters with these voice types.

(d) Compare your results of parts (a) and (b) with the complete sinewave-based estimator in Equation (10.29) of Section 10.4.2. In companion website directory Chap_exercises/chapter 10 you will find the scripts ex10_13_speech1.m, ex10_13_diplo.m, ex10_13_creaky.m, and ex10_13_modulate.m. By running each script, you will obtain a plot of the pitch and voicing contours from the complete sinewave-based estimator for the four cases: speech1_8k, creaky1_8k, diplo1_8k, and modulated1_8k, respectively. Using both time- and frequency-domain signal properties, explain your observations and compare your results with those from parts (a) and (b).

(e) Now consider the time-domain homomorphic pitch and voicing estimators of Chapter 6. Predict the behavior of the homomorphic estimators for the three waveform types of Figure 10.15. How might the behavior differ from the frequency-domain estimator of Section 10.4.2?

Bibliography

[1] B. Gold and L.R. Rabiner, “Parallel Processing Techniques for Estimating Pitch Periods of Speech in the Time Domain,” J. Acoustical Society of America, vol. 46, no. 2, pp. 442–448, 1969.

[2] D. Griffin and J.S. Lim, “Multiband Excitation Vocoder,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. ASSP–36, pp. 1223–1235, 1988.

[3] A. Kondoz, Digital Speech: Coding for Low Bit Rate Communication Systems, John Wiley & Sons, New York, NY, 1994.

[4] J. Makhoul, R. Viswanathan, R. Schwartz, and A.W.F. Huggins, “A Mixed-Source Model for Speech Compression and Synthesis,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Tulsa, OK, pp. 163–166, 1978.

[5] R.J. McAulay and T.F. Quatieri, “Pitch Estimation and Voicing Detection Based on a Sinusoidal Model,” Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Albuquerque, NM, pp. 249–252, 1990.

[6] R.J. McAulay and T.F. Quatieri, “Phase Modeling and its Application to Sinusoidal Transform Coding,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Tokyo, Japan, pp. 1713–1715, April 1986.

[7] R.J. McAulay and T.F. Quatieri, “Low-Rate Speech Coding Based on the Sinusoidal Speech Model,” chapter in Advances in Speech Signal Processing, S. Furui and M.M. Sondhi, eds., Marcel Dekker, 1992.

[8] R.J. McAulay and T.F. Quatieri, “Sinusoidal Coding,” chapter in Speech Coding and Synthesis, W.B. Kleijn and K.K. Paliwal, eds., Elsevier, 1995.

[9] M. Nishiguchi, J. Matsumoto, R. Wakatsuki, and S. Ono, “Vector Quantized MBE with Simplified V/UV Division at 3.0 kb/s,” in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, Minneapolis, MN, vol. 2, pp. 151–154, April 1993.

[10] D.B. Paul, “The Spectral Envelope Estimation Vocoder,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. ASSP–29, pp. 786–794, 1981.

[11] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice Hall, Englewood Cliffs, NJ, 1978.

[12] L.R. Rabiner, “On the Use of Autocorrelation Analysis for Pitch Detection,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–25, no. 1, pp. 24–33, 1977.

[13] M.M. Sondhi, “New Methods of Pitch Extraction,” IEEE Trans. Audio and Electroacoustics, vol. AU–16, no. 2, pp. 262–266, June 1968.

[14] R. Smits and B. Yegnanarayana, “Determination of Instants of Significant Excitation in Speech Using Group Delay Function,” IEEE Trans. Speech and Audio Processing, vol. 3, no. 5, pp. 325–333, Sept. 1995.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset