We have seen throughout the text that different approaches to speech analysis/synthesis naturally lead to different methods of pitch and voicing estimation. For example, in homomorphic analysis, the location of quefrency peaks (of lack thereof) in the cepstrum provides a pitch and voicing estimate. Likewise, the distance between primary peaks in the linear prediction error yields an estimate of the pitch period. We also saw that the wavelet transform lends itself to pitch period estimation by way of the correlation of maxima in filter-bank outputs across different scales; this “parallel processing” approach is similar in style to that of an early successful pitch estimation method conceived by Gold and Rabiner that looks across a set of impulse trains generated by different peaks and valleys in the signal [1]. These latter methods based on linear prediction, the wavelet transform, and temporal peaks and valleys, provide not only a pitch-period estimate, but also an estimate of the time occurrence of the glottal pulse, an important parameter in its own right for a variety of applications. We can think of all the above methods as time-domain approaches to pitch, voicing, and glottal pulse time estimation. In this chapter, we take an alternative view to these estimation problems in the frequency domain motivated by the sinewave representation of Chapter 9.1
1 We cannot hope to cover the vast variety of all pitch, voicing, and glottal pulse time estimators in the time and frequency domain. By focusing on a few specific classes of estimators, however, we illustrate the goals and problems common to the many approaches.
In Chapter 9, it was shown that it is possible to generate synthetic speech of very high quality using an analysis/synthesis system based on a sinusoidal speech model which, except in updating the average pitch to adjust the width of the analysis window, made no explicit use of pitch and voicing. Pitch and voicing, however, played an important role in accomplishing sinewave-based modification in Chapter 9 and will also play an important role in reducing the bit rate in sinewave-based speech coding in Chapter 12, much as they do in the speech analysis/synthesis and coding based on linear prediction. In this chapter, we will see that the sinewave representation brings new insights to the problems of pitch estimation, voicing detection, and glottal pulse time estimation. Specifically, pitch estimation can be thought of as fitting a harmonic set of sinewaves to the measured set of sinewaves, and the accuracy of the harmonic fit is an indicator of the voicing state. It is the purpose of this chapter to explore this idea in detail. The result is a powerful pitch and voicing algorithm which has become a basic component in all of the applications of the sinewave system.
We begin with a simple pitch estimator based on the autocorrelation function, which becomes the basis for transiting into the frequency domain through a sinewave representation. A variety of sinewave-based pitch estimators of increasing complexity and accuracy are then derived. We end this chapter with an application of the sinewave model to glottal pulse time estimation, and finally a generalization of the sinewave model to multi-band pitch and voicing estimation [2].
Consider a discrete-time short-time sequence given by
sn[m] = s[m]w[n − m]
where w[n] is an analysis window of duration Nw. The short-time autocorrelation function rn[τ] is defined by
When s[m] is periodic with period P, rn[τ] contains peaks at or near the pitch period, P. For unvoiced speech, no clear peak occurs near an expected pitch period. Typical sequences rn[τ] for different window lengths were shown in Figure 5.6 of Chapter 5. We see that the location of a peak (or lack thereof) in the pitch period range provides a pitch estimate and voicing decision. This is similar to the strategy used in determining pitch and voicing from the cepstrum in homomorphic analysis.
It is interesting to observe that the above correlation pitch estimator can be obtained more formally by minimizing, over possible pitch periods (P > 0), the error criterion given by
(10.1)
Minimizing E[P] with respect to P yields
(10.2)
where P > ∈, i.e., P is sufficiently far from zero (Exercise 10.1). This alternate view of autocorrelation pitch estimation is used in the following section.
The autocorrelation function is a measure of “self-similarity,” so we expect that it peaks near P for a periodic sequence. Partly due to the presence of the window, however, the peak at the pitch period does not always have the greatest amplitude. We saw in Chapter 5 that the envelope of the short-time autocorrelation function of a periodic waveform decreases roughly linearly with increasing P (Exercise 5.2). A longer window helps in assuring the peak at τ = P is largest, but a long window causes other problems when the sequence s[m] is not exactly periodic. Peaks in the autocorrelation of the vocal tract impulse response, as well as peaks at multiple pitch periods,2 may become larger than the peak at τ = P due to time-variations in the vocal tract and pitch. Another problem arises as a result of the interaction between the pitch and the first formant. If the formant bandwidth is narrow relative to the harmonic spacing (so that the vocal tract impulse response decays very slowly within a pitch period), the correlation function may reflect the formant frequency rather than the underlying pitch. Nonlinear time-domain processing techniques using various types of waveform center-clipping algorithms have been developed to alleviate this problem [11],[12],[13].
2 Two common problems in pitch estimation are referred to as pitch-halving and pitch-doubling, whereby the pitch estimate is half or double the true pitch.
In the autocorrelation method of pitch estimation, Equation (10.2), the effective limits on the sum get smaller as the candidate pitch period P increases, i.e., as the windowed data segment shifts relative to itself, causing the autocorrelation to have the roughly linear envelope alluded to above. If we could extrapolate the periodic data segment so as to pretend that a longer periodic segment exists, we could avoid the effect of the window. Extrapolation has two advantages. First, we can use the full interval of duration Nw without assuming the data is zero outside the window duration and, second, we can make the interval long while maintaining stationarity (Figure 10.1). In the next section, we exploit this extrapolation concept together with the error criterion of Equation (10.1).
One can imagine the design of a pitch estimator in the frequency domain by running the waveform through a comb filter with peaks at multiples of a hypothesized fundamental frequency (pitch) and selecting the comb with a pitch that matches the waveform harmonics and thus, hopefully, giving the largest energy. In this section, we derive a type of comb filter for pitch estimation, not by straightforward filtering, but based on the extrapolation and error criterion concepts of the previous section.
One approach to extrapolating a segment of speech beyond its analysis window as in Figure 10.1 is through the sinewave model of Chapter 9. Suppose that, rather than using the short-time segment sn[m] (where the subscript n refers to the window location), we use its sinewave representation. Then we form a sequence expressed as
(10.3)
where the sinewave amplitudes, frequencies, and phases Ak, ωk, and θk are obtained from the short-time segment3 sn[m], but where is thought of as infinite in duration. The sinewave representation, therefore, can extrapolate the signal beyond the analysis window duration. In this chapter, as in Equation (10.3), it is particularly convenient to use only the complex sinewave representation; hence the real part notation introduced in Chapter 9 has been omitted.
3 The sinewave amplitudes, frequencies, and phases Ak, ωk , and θk are fixed parameters from one analysis window on one frame and not the interpolation functions used for synthesis in Chapter 9. Recall that θk is a phase offset relative to the analysis frame center, as described in Section 9.3.
Consider now the error criterion in Equation (10.1) over an extrapolated interval of length N. Substituting Equation (10.3) into Equation (10.1), we have
(10.4)
where the extrapolation interval N is assumed odd and centered about the origin. Note that we could have begun with Equation (10.2), i.e., the autocorrelation perspective, rather than Equation (10.1). Observe also that, in Equation (10.4), the limits on the sum are [−(N − 1)/2, (N −1)/2] for all P and that data truncation does not occur. Rearranging terms in Equation (10.4), we obtain
If we let the extrapolation interval N go to infinity, the function q(ω) approaches zero except at the origin so that
then
Let where ωo is the fundamental frequency. Then
(10.5)
where for convenience we have deleted the scale factor of two in Equation (10.5). We now want to minimize E[P], as expressed in Equation (10.5), with respect to ωo To give this minimization an intuitive meaning, we rewrite Equation (10.5) as
(10.6)
Minimization of E[P] is then equivalent to maximizing with respect to ωo the term
(10.7)
We refer to Q(ω) as a likelihood function because as the value of the function increases then so does the likelihood of the hypothesized fundamental frequency as the true value. One way to view Q (ωo) is to first replace ωk in Equation (10.7) by a continuous ω. For each ωo, the continuous function
is sampled at ωk, and each F(ωk) is weighted by , and these weighted values are summed to form Equation (10.7). We can, loosely speaking, think of this as “comb-filtering” for each ωo, and we want the ωo whose comb filter has the maximum output. Figure 10.2a shows an example sampling of the function F(ω).
If the ωk’s are multiples of a fundamental frequency , and if , we have F(ωk) =1 (as in Figure 10.2a) and E[P] = 0 and minimization is achieved. Specifically,
and thus, from Equation (10.6) E[P] = 0. In Figure 10.2b we see that the estimator is insensitive to pitch doubling due to a cancellation effect. Notice, however, a disturbing feature of this pitch estimator. A fundamental frequency estimate of , i.e., half the true pitch, will also yield zero error. Thus, the solution is ambiguous. These properties are illustrated in the following example:
Example 10.1 Figure 10.3a,b,d shows the result of Equation (10.7) for a true pitch of = 50, 100, and 200 Hz and where Ak = 1. A frequency increment of 0.5 Hz was used for the hypothesized pitch candidates (Exercise 10.12). One sees multiple peaks in the pitch likelihood function and thus the pitch estimate ambiguity. Nevertheless, the example also shows that the correct pitch is given by the last large peak, i.e., the peak of greatest frequency. There are no peaks beyond the true frequency because there is a cancellation effect for multiples of the true pitch (as in Figure 10.2b), and thus the likelihood function falls rapidly. Finally, Figure 10.3c shows the effect of white noise (added to the measured frequencies) on the likelihood function for a true pitch of 50 Hz. In this case, although the last large peak of Q(ωo) falls at roughly 50 Hz, there is little confidence in the estimate due to the multiple peaks and, in particular, a larger peak at about 25 Hz and 50 Hz.
This problem of pitch-halving (i.e., underestimating the pitch by a factor of two) is typical of many pitch estimators such as the homomorphic and autocorrelation pitch estimators. In fact, explicit comb filtering approaches have also been applied, but the tendency of these methods to suffer from pitch-halving has limited their use. Our goal is to design a pitch likelihood function that is characterized by a distinct peak at the true pitch. In the following Section 10.4, we propose an alternative sinewave-based pitch estimator which utilizes additional information. This new estimator is capable of resolving the above ambiguity.
The previous sinewave-based correlation pitch estimator was derived by minimizing the mean-squared error between an estimated sinewave model and itself shifted by P samples. In this section, we fit a sinewave model, with unknown amplitudes, phases, and harmonic frequencies, to a waveform measurement [5],[8]. Although the resulting pitch estimator is prone to pitch doubling ambiguity, we show that, with a priori knowledge of the vocal tract spectral envelope, the pitch doubling problem is alleviated. This approach to pitch estimation leads naturally to a measure of the degree of voicing within a speech segment. Methods to evaluate the pitch estimator are also described.
Consider a sinusoidal waveform model with unknown amplitudes and phases, and with harmonic frequencies:
where ωo is an unknown fundamental frequency, where B and represent vectors of unknown amplitudes and phases, {Bk} and {}, respectively, and where K(ωo) is the number of harmonics in the speech bandwidth. (For clarity, we have changed notation from to exp[].) A reasonable estimation criterion is to seek the minimum mean-squared error between the harmonic model and the measured speech waveform:
(10.8)
where we assume that the analysis window duration Nw is odd and the window is centered about the time origin. Then we can show that
(10.9)
where S(ω) represents one slice of the STFT of the speech waveform over the interval [−(Nw − 1)/2, (Nw − 1)/2] and
(10.10)
Minimizing Equation (10.9) with respect to , we see that the phase estimates are
(10.11)
so that
(10.12)
and minimizing Equation (10.12) with respect to :
(10.13)
Bk = |S(Kωo)|.
Therefore,
(10.14)
The reader is asked to work through the algebraic steps of this derivation in Exercise 10.2. Thus, the optimal ωo is given by
(10.15)
where we assume that ωo > ∈ (a small positive value close to zero) to avoid a bias toward a low fundamental frequency. As in the previous estimator, we see that this estimator acts like a comb filter, but here samples of the measured spectrum are at candidate harmonics. For a perfectly periodic waveform, ωo would be chosen to correspond to the harmonic peaks in |S(kωo)|. This criterion could lead, however, to a pitch-halving error similar to what we found in the sinewave-based correlation pitch estimator (Exercise 10.2).
Consider now giving ourselves some additional information; in particular, we assume a known vocal-tract spectral envelope,4 |H(ω)|. With this a priori information, we will show that the resulting error criterion can resolve the pitch-halving ambiguity [5],[8]. The following Section 10.4.2 derives the pitch estimator, while Section 10.4.4 gives a more intuitive view of the approach with respect to time-frequency resolution considerations.
4 We do not utilize the full capacity of the sinewave pitch estimator since we could have also supplied an a priori phase envelope, thus giving temporal alignment information (Exercise 10.2).
The goal again is to represent the speech waveform by another for which all of the frequencies are harmonic, but now let us assume a priori knowledge of the vocal tract spectral envelope, |H(ω)|. Under the assumption that the excitation amplitudes are unity, i.e., ak(t) = 1 in Equation (9.5), |H(ω)| also provides an envelope for the sinewave amplitudes Ak. The harmonic sinewave model then becomes
(10.16)
where ωo is the fundamental frequency (pitch), K(ωo) is the number of harmonics in the speech bandwidth, ( ω) = |H(ω)| is the vocal tract envelope, and represents the phases of the harmonics. We would like to estimate the pitch ωo and the phases so that is as close as possible to the speech measurement s[n], according to some meaningful criterion. While a number of methods can be used for estimating the envelope (ω), for example, linear prediction or homomorphic estimation techniques, it is desirable to use a method that yields an envelope that passes through the measured sinewave amplitudes. Such a technique has been developed in the Spectral Envelope Estimation Vocoder (SEEVOC) [10]. This estimator will also be used later in this chapter as the basis of the minimum-phase analysis for estimating the source excitation onset times and so we postpone its description (Section 10.5.3).5
5 We will see that, in the application of source onset time estimation, it is appropriate to linearly interpolate between the successive sinewave amplitudes. In the application to mean-squared-error pitch estimation in this section, however, the main purpose of the envelope is to eliminate pitch ambiguities. Since the linearly interpolated envelope could affect the fine structure of the mean-squared-error criterion through its interaction with the measured peaks in the correlation operation (to follow later in this section), better performance is obtained by using piecewise-constant interpolation between the SEEVOC peaks.
A reasonable estimation criterion is to seek the minimum of the mean-squared error (MSE),
(10.17)
over ωo and . The MSE in Equation (10.17) can be expanded as
(10.18)
Observe that the first term of Equation (10.18) is the average energy (power) in the measured signal. We denote this average by Ps [an averaged version of the previously defined total energy in Equation (10.10)], i.e.,
(10.19)
Substituting Equation (10.16) in the second term of Equation (10.18) leads to the relation
(10.20)
Finally, substituting Equation (10.16) in the third term of Equation (10.18) leads to the relation
where the approximation is valid provided the analysis window duration satisfies the condition, , which is more or less assured by making the analysis window 2.5 times the average pitch period. Letting
(10.21)
denote one slice of the short-time Fourier transform (STFT) of the speech signal and using this in Equation (10.20), then the MSE in Equation (10.18) becomes (Exercise 10.2)
(10.22)
Since the phase parameters affect only the second term in Equation (10.22), the MSE is minimized by choosing the phase estimates
and the resulting MSE is given by
(10.23)
where the second term is reminiscent of a correlation function in the frequency domain. The unknown pitch affects only the second and third terms in Equation (10.23), and these can be combined by defining
(10.24)
The smooth weighting function biases the method away from excessively low pitch estimates. The MSE can then be expressed as
(10.25)
Because the first term is a known constant, the minimum mean-squared error is obtained by maximizing ρ(ωo) over ωo.
It is useful to manipulate this metric further by making explicit use of the sinusoidal representation of the input speech waveform. Assume, as in Section 10.3, that a frame of the input speech waveform has been analyzed in terms of its sinusoidal components using the analysis system described in Chapter 9.6 The measured speech data s[n] is, therefore, represented as
6 This mean-squared-error pitch extractor is predicated on the assumption that the input speech waveform has been represented in terms of the sinusoidal model. This implicitly assumes that the analysis has been performed using a Hamming window approximately two and one-half times the average pitch. It seems, therefore, that the pitch must be known in order to estimate the average pitch that is needed to estimate the pitch. This circular dilemma can be broken by using some other method to estimate the average pitch based on a fixed window. Since only an average pitch value is needed, the estimation technique does not have to be accurate on every frame; hence, any of the well-known techniques can be used.
(10.26)
where represents the amplitudes, frequencies, and phases of the K measured sinewaves. The sinewave representation allows us to extrapolate the speech measurement beyond the analysis window duration Nw to a larger interval N, as we described earlier. With a sinewave representation, it is straightforward to show that the signal power is given by the approximation
and substituting the sinewave representation in Equation (10.26) in the short-time Fourier transform defined in Equation (10.21) leads to the expression
where
Because the sinewaves are well-resolved, the magnitude of the STFT can then be approximated by
where D(x) = |sinc (x)|. The MSE criterion then becomes
(10.27)
where ωl are the frequencies of the sinewave representation in Equation (10.26) and ωo is the candidate pitch.
To gain some insight into the meaning of this criterion, suppose that the input speech is periodic with pitch frequency ω*. Then (barring measurement error) ωl = lω*, Al = ω ( lω*), and
When ωo corresponds to sub-multiples of the pitch, the first term in Equation (10.27) remains unchanged, since D(ωl − kωo) = 0 at the submultiples, but the second term, because it is an envelope and always non-zero, will increase at the submultiples of ω*. As a consequence,
which shows that the MSE criterion leads to unambiguous pitch estimates. To see this property more clearly, consider the example illustrated in Figure 10.4. In this example, the true pitch is
200 Hz and the sinewave envelope is constant at unity. One can then simplify Equation (10.27) as
(10.28)
where K(ωo) is the number of harmonics over a fixed bandwidth of 800 Hz. The first term in Equation (10.28) corresponds to laying down a set of “comb filters” D(ω − kωo) (yet again different in nature from those previously described) spaced by the candidate fundamental frequency ωo. The comb is then sampled at the measured frequencies ωl = lω* and the samples are summed. Finally, the resulting value is reduced by half the number of harmonics over the fixed band.
For the candidate (trial) of ƒ0=200 Hz, ρ(ƒ0)=2, as illustrated in Figure 10.4b. For the candidate ƒ0=100 Hz (pitch-halving), the first term is the same (as for ƒ0=200 Hz), but the second term decreases (negatively) so that ρ(ƒ0) = 0. The remaining two cases in Figure 10.4b are straightforward to evaluate. This argument with constant ( ω) holds more generally since we can write Equation (10.27) as
(10.29)
where the first term is a correlation-like term [similar in style to the frequency-domain correlation-based pitch estimators, i.e., “comb filters,” in Equations (10.7) and (10.15)] and the second term is the generalized negative compensation for low-frequency fundamental candidates. Possibly the most significant attribute of the sinewave-based pitch extractor is that the usual problems with pitch-halving and pitch-doubling do not occur with the new error criterion (Exercise 10.2). This pitch estimator has been further refined to improve its resolution, resolve problems with formant-pitch interaction (as alluded to in the context of the autocorrelation pitch estimator), and improve robustness in additive noise by exploiting the auditory masking principle that small tones are masked by neighboring high tones [8]. (This auditory masking principle is described in Chapter 13 in the context of speech enhancement.) The following example compares the sinewave-based pitch estimator for voiced and unvoiced speech:
Example 10.2 In one implementation of the sinewave-based MSE pitch extractor, the speech is sampled at 10 kHz and analyzed using a 1024-point FFT. The sinewave amplitudes and frequencies are determined over a 1000-Hz bandwidth. In Figure 10.5(b), the measured amplitudes and frequencies are shown along with the piecewise-constant SEEVOC envelope for a voiced speech segment. Figure 10.5(c) is a plot of the first term in Equation (10.29) over a candidate pitch range from 38 Hz to 400 Hz and the inherent ambiguity of the correlator (comb-filter) is apparent. It should be noted that the peak at the correct pitch is largest, but during steady vowels the ambiguous behavior illustrated in the figure commonly occurs. Figure 10.5(d) is a plot of the complete likelihood function Equation (10.29) derived from the above MSE criterion and the manner in which the ambiguities are eliminated is clearly demonstrated. Figure 10.6 illustrates typical results for a segment of unvoiced fricative speech for which there is no distinct peak in the likelihood function.
In the context of the sinusoidal model, the degree to which a given frame of speech is voiced is determined by the degree to which the harmonic model fits the original sinewave data [5],[6]. The previous example indicated that the likelihood function is useful as a means to determine this degree of voicing. The accuracy of the harmonic fit can be related, in turn, to the signal-to-noise ratio (SNR) defined by
where is the sinewave harmonic model at the selected pitch ωo. From Equation (10.25), it follows that
where the input power Ps can be computed from the sinewave amplitudes. If the SNR is large, then the MSE is small and the harmonic fit is very good, which indicates that the input speech is probably voiced. For small SNR, on the other hand, the MSE is large and the harmonic fit is poor, which indicates that the input speech is more likely to be unvoiced. Therefore, the degree of voicing is functionally dependent on the SNR. Although the determination of the exact functional form is difficult, a rule that has proven useful in several speech applications is the following (Figure 10.7):
where Pυ represents the probability that speech is voiced, and the SNR is expressed in dB. It is this quantity that was used to control the voicing-adaptive sinewave-based modification schemes in Chapter 9 and the voicing-adaptive frequency cutoff for the phase model to be used later in this chapter and in Chapter 12 for sinewave-based speech coding.
The next two examples illustrate the performance of the sinewave-based pitch estimator of Section 10.4.2 and the above voicing estimator for normal and “anomalous” voice types.
Example 10.3 Figure 10.8 illustrates the sinewave pitch and voicing estimates for the utterance, “Which tea party did Baker go to?” from a female speaker. Although the utterance is a question, and thus, from our discussion in Chapter 3, we might expect a rising pitch at the termination of the passage, we see a rapid falling of pitch in the final word because the speaker is relaxing her vocal cords.
Example 10.4 Figure 10.9 illustrates the sinewave pitch and voicing estimates for the utterance, “Jazz hour” from a very low-pitched male speaker. We see that the diplophonia in the utterance causes a sudden doubling of pitch during the second word “hour” where secondary pulses are large within a glottal cycle. Secondary pulses can result in an amplitude dip on every other harmonic (Exercise 3.3) and thus a doubling of the pitch estimate. We also see that diplophonia causes a severe raggedness in the voicing probability measure. The reader is asked to consider these contours further in Exercise 10.13.
We now summarize the sinewave pitch estimator strategy from a time-frequency resolution perspective. In Equation (10.17), we are attempting to fit the measurement s[n] over an Nw-sample analysis window by a sum of sinewaves that are harmonically related, i.e., by a model signal that is perfectly periodic. We have the spectral envelope of the sinewave amplitudes and thus the sinewave amplitudes for any fundamental frequency.
When we work through the algebra, we obtain the error function Equation (10.23) to be minimized over the unknown pitch ωo and its equivalent form Equation (10.24) to be maximized. In this analysis, we have not yet made any assumptions about the measurement s[n]. We might simply stop at this point and perform the maximization to obtain a pitch estimate. For some conditions, this will give a reasonable pitch estimate and, in fact, the error function in Equation (10.24) benefits from the presence of the negative term on the right side of the equation in the sense that it helps to avoid pitch halving. But there is a problem with this approach.
Observe that Equation (10.24) contains the short-time Fourier transform magnitude of the measurement and recall that we make the analysis window as short as possible to obtain a time resolution as good as possible. Because the window is short, the main lobe of the window’s Fourier transform is wide. For a low-pitch candidate ωo, then, the first term in the expression for ρ(ω) in Equation (10.24) will be sampled many times within the main lobes of the window Fourier transforms embedded inside |S(kωo)|. This may unreasonably bias the pitch to low estimates, even with the negative term on the right. Therefore, we are motivated to make the analysis window very long to narrow the window’s main lobe. This reduces time resolution, however, and can badly blur time variations and voiced/unvoiced transitions.
Our goal, then, is a representation of the measurement s[n] that is derived from a short window but has the advantage of being able to represent a long interval in Equation (10.24). Recall that we were faced with this problem with the earlier autocorrelation pitch estimator of Equation (10.1). In that case, we used a sinewave representation of the measurement s[n] to allow extrapolation beyond the analysis window interval. We can invoke the same procedure in this case. We modify the original error criterion in Equation (10.17) so that s[n] is a sinewave representation derived from amplitude, frequencies, and phases at the STFT magnitude peaks. This leads to Equation (10.27) in which the main lobe of the function D(ω) = |sinc(ω)| is controlled by the (pseudo) window length N. That is, we can make N long (and longer than the original window length Nw) to make the main lobe of D(ω) narrow, thus avoiding the above problem of biasing the estimate to low pitch. Observe we are using two sinewave representations: a harmonic sinewave model that we use to fit to the measurement, and a sinewave model of the measurement that is not necessarily harmonic since it is derived from peak-picking the STFT magnitude.
Validating the performance of a pitch extractor can be a time-consuming and laborious procedure since it requires a comparison with hand-labeled data. An alternative approach is to reconstruct the speech using the harmonic sinewave model in Equation (10.16) and to listen for pitch errors. The procedure is not quite so straightforward as Equation (10.16) indicates, however, because, during unvoiced speech, meaningless pitch estimates are made that can lead to perceptual artifacts whenever the pitch estimate is greater than about 150 Hz. This is due to the fact that, in these cases, there are too few sinewaves to adequately synthesize a noiselike waveform. This problem has been eliminated by defaulting to a fixed low pitch (≈ 100 Hz) during unvoiced speech whenever the pitch exceeds 100 Hz. The exact procedure for doing this is to first define a voicing-dependent cutoff frequency, ωc (as we did in Chapter 9):
(10.30)
which is constrained to be no smaller than 2π (1500 Hz/ƒs), where ƒs is the continuous-to-discrete time sampling frequency. If the pitch estimate is ωo, then the sinewave frequencies used in the reconstruction are
(10.31)
where k* is the largest value of k for which k*ωo ≤ ωc(Pυ), and where ωu, the unvoiced pitch, corresponds to 100 Hz (i.e., ωu = 2π(100 Hz/ƒs). If ωo < ωu, then we set ωk = kωo for all k. The harmonic reconstruction then becomes
(10.32)
where k is the phase obtained by sampling a piecewise-constant phase function derived from the measured STFT phase using the same strategy to generate the SEEVOC envelope (Section 10.5.3). Strictly speaking, this procedure is harmonic only during strongly-voiced speech because if the speech is a voiced/unvoiced mixture the frequencies above the cutoff, although equally spaced by ωu, are aharmonic, since they are themselves not multiples of the fundamental pitch.
The synthetic speech produced by this model is of very high quality, almost perceptually equivalent to the original. Not only does this validate the performance of the MSE pitch extractor, but it also shows that if the amplitudes and phases of the harmonic representation could be efficiently coded, then only the pitch and voicing would be needed to code the information in the sinewave frequencies. Moreover, when the measured phases are replaced by those defined by a voicing-adaptive phase model to be derived in the following section, the synthetic speech is also of good quality, although not equivalent to that obtained using the phase samples of the piecewise-flat phase envelope derived from the STFT phase measurements. However, it provides an excellent basis from which to derive a low bit-rate speech coder as described in Chapter 12.
In the previous section, the sinewave model was used as the basis for pitch estimation, which also led naturally to a harmonic sinewave model and associated harmonic analysis/synthesis system of high quality. This harmonic representation is important not only for the testing of the pitch estimation algorithm, but also, as stated, for the development of a complete parameterization for speech coding; an efficient parametric representation must be developed to reduce the size of the parameter set. That is, the raw sinewave amplitudes, phases, and frequencies cannot all be coded efficiently without some sort of parameterization, as we will discuss further in Chapter 12. The next step toward this goal is to develop a model for the sinewave phases by explicitly identifying the source linear phase component, i.e., the glottal pulse onset time, and the vocal tract phase component. In Chapter 9, in the application of time-scale modification, we side-stepped the absolute onset estimation problem with the use of a relative onset time derived by accumulating pitch periods. In this chapter, we propose one approach to absolute onset estimation [6],[7],[8]. Other methods include inverse filtering-based approaches described in Chapter 5 and a phase derivative-based approach [14].
We first review the excitation onset time concept that was introduced in Chapter 9. For a periodic voiced speech waveform with period P, in discrete time the excitation is given by a train of periodically spaced impulses, i.e.,
where no is a displacement of the pitch pulse train from the origin. The sequence of excitation pitch pulses can also be written in terms of complex sinewaves as
(10.33)
where ωk = (2π/P)k. We assume that the excitation amplitudes ak are unity, implying that the measured sinewave amplitudes at spectral peaks are due solely to the vocal tract system function, the glottal flow function being embedded within the vocal tract system function. Generally, the sinewave excitation frequencies ωk are assumed to be aharmonic but constant over an analysis frame. The parameter no corresponds to the time of occurrence of the pitch pulse nearest the center of the current analysis frame. The occurrence of this temporal event, called the onset time and introduced in Chapter 9, ensures that the underlying excitation sinewaves will be “in phase” at the time of the pitch pulse.
The amplitude and phase of the excitation sinewaves are altered by the vocal tract system function. Letting H(ω) denote the composite system function, the speech signal at its output becomes
(10.34)
which we write in terms of system and excitation components as
(10.35)
where
The time dependence of the system function, which was included in Chapter 9, is omitted here under the assumption that the vocal tract (and glottal flow function) is stationary over the duration of each synthesis frame. The excitation phase is linear with respect to frequency and time. As in Chapter 9, we assume that the system function has no linear phase so that all linear phase in is due to the excitation.
Let us now write the composite phase in Equation (10.35) as
θk[n] = Φ(ωk) + k[n].
At time n = 0 (i.e., the analysis and synthesis frame center), then
θk[0] = Φ(ωk) + k[0]
= Φ(ωk) − noωk.
The excitation and system phase components of the composite phase are illustrated in Figure 10.10 for continuous frequency ω. The phase value θk[0] is obtained from the STFT at frequencies ωk at the center of an analysis frame, as described in Chapter 9. Likewise, the value M(ωk) is the corresponding measured sinewave amplitude. Toward the development of a sinewave analysis/synthesis system based on the above phase decomposition, a method for estimating the onset time will now be described, given the measured sinewave phase θk[0] and amplitude M (ωk) values [6],[7].
From Chapter 9, the speech waveform can also be represented in terms of measured sinewave amplitude, frequency, and phase parameters as
(10.36)
where θk denotes the measured phase θk[0] at the analysis frame center and where sinewave parameters are assumed to be fixed over the analysis frame. We now have two sinewave representations, i.e., s[n] in Equation (10.36) obtained from parameter measurements and the sinewave model in Equation (10.35) with a linear excitation phase in terms of an unknown onset time. In order to determine the excitation phase, the onset time parameter no is estimated by choosing the value of no so that is as close as possible to s[n], according to some meaningful criterion [6],[7],[8]. A reasonable criterion is to seek the minimum of the mean-squared error (MSE) over a time interval N (generally greater than the original analysis window duration Nw), i.e.,
(10.37)
over no and where we have added the argument no in to denote that this sequence is a function of the unknown onset time.
The MSE in Equation (10.37) can be expanded as
(10.38)
If the sinusoidal representation for s[n] in Equation (10.36) is used in the first term of Equation (10.38), then, as before, the power in the measured signal can be defined as
Letting the system transfer function be written in terms of its magnitude M (ω) and phase Φ (ω), namely
H (ω) = M (ω) exp[jΦ(ω)]
and using this as in Equation (10.35), the second term of Equation (10.38) can be written as
Finally, the third term in Equation (10.38) can be written as
These relations are valid provided the sinewaves are well-resolved, a condition that is basically assured by making the interval N, two and one-half times the average pitch period, which was the condition assumed in the estimation of the sinewave parameters in the first place. [Indeed, we can make N as long as we like because we are working with the sinewave representation in Equation (10.36).] Combining the above manipulations leads to the following expression for the MSE:
(10.39)
Equation (10.39) was derived under the assumption that the system amplitude M (ω) and phase Φ (ω) were known. In order to obtain the optimal onset time, therefore, the amplitude and phase of the system function must next be estimated from the data. We choose here the SEEVOC amplitude estimate to be described in Section 10.5.3. If the system function is assumed to be minimum phase, then, as we have seen in Chapter 6, we can obtain the corresponding phase function by applying a rightsided lifter to the real cepstrum associated with , i.e.,
from which the system minumum-phase estimate follows as
(10.40)
where l[0] = 1, l[n] = 0 for n < 0, and l[n] = 2 for n > 0. Use of Equation (10.40) is incomplete, however, because the minimum-phase analysis fails to account for the sign of the input speech waveform. This is due to the fact that the sinewave amplitudes from which the system amplitude and phase are derived are the same for −s[n] and s[n]. This ambiguity can be accounted for by generalizing the system phase in Equation (10.40) by
(10.41)
and then choosing no and β to minimize the MSE simultaneously. Substituting Equation (10.40) and Equation (10.41) into Equation (10.39) leads to the following equation for the MSE:
Since only the second term depends on the phase model, it suffices to choose no and β to maximize the “likelihood” function
However, since
ρ(no, β = 1) = −ρ(no, β = 0)
it suffices to maximize |ρ(no)| where now
(10.42)
and if is the maximizing value, then choose and β = 1, otherwise.
Example 10.5 In this experiment, the sinewave analysis uses a 10-kHz sampling rate and a 1024-point FFT to compute the STFT [7]. The magnitude of the STFT is computed and the underlying sinewaves are identified by determining the frequencies at which a change in the slope of the STFT magnitude occurs. The SEEVOC algorithm was used to estimate the piecewise-linear envelope of the sinewave amplitudes. A typical set of results is shown in Figure 10.11. The logarithm of the STFT magnitude is shown in Figure 10.11d together with the SEEVOC envelope. The cepstral coefficients are used in Equation (10.40) to compute the system phase, from which the onset likelihood function is computed using Equation (10.42). The result is shown in Figure 10.11b. It is interesting to note that the peaks in the onset function occur at the points which seem to correspond to a sharp transition in the speech waveform, probably near the glottal pulse, rather than at a peak in the waveform. We also see in Figure 10.11c that the phase residual is close to zero below about 3500 Hz, indicating the accuracy of our phase model in this region. The phase residual is the difference between the measured phase and our modeled phase, interpolated across sinewave frequencies.
An example of the application of the onset estimator for an unvoiced fricative case is shown in Figure 10.12 [7]. The estimate of the onset time is meaningless in this case, and this is reflected in the relatively low value of the likelihood function. The onset time, however, can have meaning and importance for non-fricative consonants such as plosives where the phase and event timing is important for maintaining perceptual quality.
The above results show that if the envelope of the sinewave amplitudes is known, then the MSE criterion can lead to a technique for estimating the glottal pulse onset time under the assumption that the glottal pulse and vocal tract response are minimum-phase. This latter assumption was necessary to derive an estimate of the system phase, hence the performance of the estimator depends on the system magnitude. An ad hoc estimator for the magnitude of the system function is simply to apply linear interpolation between successive sinewave peaks. This results in the function
(10.43)
The problem with such a simple envelope estimator is that the system phase is sensitive to low-level peaks that can arise due to time variations in the system function or signal processing artifacts such as side-lobe leakage. Fortunately, this problem can be avoided using the technique proposed by Paul [8],[10] in the development of the Spectral Envelope Estimation Vocoder (SEEVOC).
The SEEVOC algorithm depends on having an estimate of the average pitch, denoted here by . The first step is to search for the largest sinewave amplitude in the interval . Having found the amplitude and frequency of that peak, labeled (A1, ω1), then one searches the interval for its largest peak, labeled (A2, ω2), as illustrated in Figure 10.13. The process is continued by searching the intervals for the largest peaks (Ak, ωk) until the edge of the speech bandwidth is reached. If no peak is found in a search bin, then the largest endpoint of the short-time Fourier transform magnitude is used and placed at the bin center, from which the search procedure is continued. The principle advantage of this pruning method is the fact that any low-level peaks within a pitch interval are masked by the largest peak, presumably a peak that is close to an underlying harmonic. Moreover, the procedure is not dependent on the peaks’ being harmonic, nor on the exact value of the average pitch since the procedure resets itself after each peak has been found. The SEEVOC envelope, the envelope upon which the above minimum-phase analysis is based, is then obtained by applying the linear interpolation rule,7 Equation (10.43), where now the sinewave amplitudes and frequencies are those obtained using the SEEVOC peak-picking routine.
7 In the sinewave pitch estimation of the previous section, a piecewise-constant envelope was derived from the pruned peaks rather than a piecewise-linear envelope, having the effect of maintaining at harmonic frequencies the original peak amplitudes.
Consider now a minimum-phase model which comes naturally from the discussion in Section 10.5.2. In order to evaluate the accuracy of a minimum-phase model, as well as the onset estimator, it is instructive to examine the behavior of the phase error (previously referred to as “phase residual”) associated with this model that we saw earlier. Let
(10.44)
be the model estimate of the sinewave phase at any frequency, ω where denotes the minimum-phase estimate. Then, for the sinewave at frequency ωk, for which the measured phase is θk, the phase error is
The phase error for the speech sample in Figure 10.11a is shown in Figure 10.11c. In this example, the phase error is small for frequencies below about 3.5 kHz. The larger structured error beyond 3.5 kHz probably indicates the inadequacy of the minimum-phase assumption and the possible presence of noise in the speech source or maximum-phase zeros in the transfer function H (ω) (Chapter 6). Nevertheless, as with the pitch estimator, it is instructive to evaluate the estimators by reconstruction [6],[7],[8].
One approach to reconstruction is to assume that the phase residual is negligible and form a phase function for each synthesis frame that is the sum of the minimum-phase function and a linear excitation phase and sign derived from the onset estimator. In other words, from Equation (10.44), the sinewave phases on each frame are given by
(10.45)
When this phase is used, along with the measured sinewave amplitudes at ωk, the resulting quality of voiced speech is quite natural but some slight hoarseness is introduced due to a few-sample inaccuracy in the onset estimator that results in a randomly changing pitch period (pitch jitter); it was found that even a sporadic one- or two-sample error in the onset estimator can introduce this distortion. This property was further confirmed by replacing the above absolute onset estimator by the relative onset estimator introduced in Chapter 9, whereby onset estimates are obtained by accumulating pitch periods. With this relative onset estimate, the hoarseness attributed to error in the absolute onset time is removed and the synthesized speech is free of artifacts. In spite of this limitation of the absolute onset estimator, it provides a means to obtain useful glottal flow timing information in applications where reduction in the linear phase is important (Exercise 10.5). Observe that we have described reconstruction of only voiced speech. When the phase function Equation (10.45) is applied to unvoiced speech, particularly fricatives and voicing with frication or strong aspiration, the reconstruction is “buzzy” because an unnatural waveform peakiness arises due to the sinewaves no longer be randomly displaced from one another (Exercise 10.6). An approach to remove this distortion when a minimum-phase vocal tract system is invoked is motivated by the phase-dithering model of Section 9.5.2 and will be described in Chapter 12 in the context of speech coding.
In this section, we describe a generalization of sinewave pitch and voicing estimation that yields a voicing decision in multiple frequency bands. A byproduct is sinewave amplitude estimates derived from a minimum-mean-squared error criterion. These pitch and voicing estimates were developed in the context of the Multi-Band Excitation (MBE) speech representation developed by Griffin and Lim [2],[8] where, as in Section 10.4, speech is represented as a sum of harmonic sinewaves.
As above, the synthetic waveform for a harmonic set of sinewaves is written as
(10.46)
Whereas in Section 10.4.2 the sinewave amplitudes were assumed to be harmonic samples of an underlying vocal tract envelope, in MBE they are allowed to be unconstrained free variables and are chosen to render a minimum-mean-squared error fit to the measured speech signal s[n]. For analysis window w[n] of length Nw, the short-time speech segment and its harmonic sinewave representation is denoted by sw[n] = w[n]s[n] and (thus, simplifying to one STFT slice), respectively. The mean-squared error between the two signals is given by
(10.47)
where and are the vectors of unknown amplitudes and phases at the sinewave harmonics. Following the development in [8],
denotes the discrete-time Fourier transform of sw[n] and, similarly, denotes the discrete-time Fourier Transform of Then using Parseval’s Theorem, Equation (10.47) becomes
(10.48)
The first term, which is independent of the pitch, amplitude, and phase parameters, is the energy in the windowed speech signal that we denote by Ew. Letting Bk exp (j) represent the complex amplitude of the kth harmonic and using the sinewave decomposition in Equation (10.46), can be written as
with
where W(ω) is the discrete-time Fourier transform of the analysis window w[n]. Substituting this relation into Equation (10.48), the mean-squared error can be written as
For each value of ωo this equation is quadratic in βk. Therefore, each βk is a function of ωo and we denote this set of unknowns by . It is straightforward to solve for the that results in the minimum-mean-squared error, E[ωo,]. This process can be repeated for each value of ωo so that the optimal minimum-mean-squared error estimate of the pitch can be determined. Although the quadratic optimization problem is straightforward to solve, it requires solution of a simultaneous set of linear equations for each candidate pitch value. This makes the resulting pitch estimator complicated to implement. However, following [2], we assume that W(ω) is essentially zero in the region , which corresponds to the condition posed in Section 10.4 to insure that the sinewaves are well-resolved. We then define the frequency region about each harmonic frequency kωo as
(10.49)
with which the mean-squared error can be approximated as
from which it follows that the values of the complex amplitudes that minimize the mean-squared error are
(10.50)
The best mean-squared error fit to the windowed speech data is therefore given by
This expression is then used in Equation (10.47) to evaluate the mean-squared error for the given value of ωo. This procedure is repeated for each value of ωo in the pitch range of interest and the optimum estimate of the pitch is the value of ωo that minimizes the mean-squared error. While the procedure is similar to that used in Section 10.4, there are important differences. The reader is asked to explore these differences and similarities in Exercise 10.8. Extensions of this algorithm by Griffin and Lim exploit pitch estimates from past and future frames in “forward-backward” pitch tracking to improve pitch estimates during regions in which the pitch and/or vocal tract are rapidly changing [2].
As in Section 10.4.3, distinguishing between voiced and unvoiced spectral regions is based on how well the harmonic set of sinewaves fits the measured set of sinewaves. In Section 10.4.3, a signal-to-noise ratio (SNR) was defined in terms of the normalized mean-squared error, and this was mapped into a cutoff frequency below which the sinewaves were declared voiced and above which they were declared unvoiced. This idea, which originated with the work of Makhoul, Viswanathan, Schwartz, and Huggins [4], was generalized by Griffin and Lim [2] to allow for an arbitrary sequence of voiced and unvoiced bands with the measure of voicing in each of the bands determined by a normalized mean-squared error computed for the windowed speech signals. Letting
γm = {ω : ωm−1 ≤ ω ≤ ωm}, m = 1, 2, …, M
denote the mth band of M multi-bands over the speech bandwidth, then using Equation (10.48), the normalizing mean-squared error for each band can be written as
(10.51)
Each of the M values of the normalized mean-squared error is compared with a threshold function to determine the binary voicing state of the sinewaves in each band. If is below the threshold, the mean-squared error is small, hence the harmonic sinewaves fit well to the input speech and the band is declared voiced. The setting of the threshold uses several heuristic rules to obtain the best performance [3].
It was observed that when the multi-band voicing decisions are combined into a two-band voicing-adaptive cutoff frequency, as was used in Section 10.4.3, no loss in quality was perceived in low-bandwidth (e.g., 300–4000 Hz) synthesis [3],[8],[9]. Nevertheless, this scheme affords the possibility of a more accurate two-band cutoff than that in Section 10.4.3 and the multi-band extension may be useful in applications such as speech transformations (Exercise 10.9). An additional advantage of multi-band voicing is that it can make reliable voicing decisions when the speech signal has been corrupted by additive acoustical noise [2],[3],[9]. The reason for this lies in the fact that the normalized mean-squared error essentially removes the effect of the spectral tilt, which means that the sinewave amplitudes contribute more or less equally from band to band. When one wide-band voicing decision is made, as in Section 10.4.3, only the largest sinewave amplitudes will contribute to the mean-squared error, and if these have been corrupted due to noise, then the remaining sinewaves, although harmonic, may not contribute enough to the error measure to offset those that are corrupted. Finally, observe that although the multi-band voicing provides a refinement to the two-band voicing strategy, it does not account for the additive nature of noise components in the speech spectrum, as was addressed in the deterministic-stochastic model of Chapter 9.
In this chapter, we introduced a frequency-domain approach to estimating the pitch and voicing state of a speech waveform, in contrast to time-domain approaches, based on analysis/synthesis techniques of earlier chapters. Specifically, we saw that pitch estimation can be thought of as fitting a harmonic set of sinewaves to a measured set of sinewaves, and the accuracy of the harmonic fit is an indicator of the voicing state. A simple autocorrelation-based estimator, formulated in the frequency domain, was our starting point and led to a variety of sinewave-based pitch estimators of increasing complexity and accuracy. A generalization of the sinewave model to a multi-band pitch and voicing estimation method was also described. In addition, we applied the sinewave model to the problem of glottal pulse onset time estimation. Finally, we gave numerous mixed- and minimum-phase sinewave-based waveform reconstruction techniques for evaluating the pitch, voicing, and onset estimators. A spinoff of these evaluation techniques is analysis/synthesis structures based on harmonic sinewaves, a minimum-phase vocal tract, and a linear excitation phase that form the basis for frequency-domain sinewave-based speech coding methods in Chapter 12. Other pitch and voicing estimators will be described as needed in the context of speech coding, being more naturally developed for a particular coding structure.
We saw both the features and limitations of the various pitch estimators using examples of typical voice types and a diplophonic voice with secondary glottal pulses occurring regularly, and about midway, between primary glottal pulses within a glottal cycle. Many “anomalous” voice types that we have described earlier in this text, however, were not addressed. These include, for example, the creaky voice with erratically-spaced glottal pulses, and the case of every other vocal tract impulse response being amplitude modulated. These cases give difficulty to pitch and voicing estimators, generally, and are considered in Exercises 10.10, 10.11, 10.12, and 10.13. Such voice types, as well as voiced/unvoiced transitions, and rapidly-varying speech events, in spite of the large strides in improvements of pitch and voicing estimators, render the pitch and voicing problem in many ways still challenging and unsolved.
10.1 Show that the autocorrelation-based pitch estimator in Equation (10.2) follows from minimizing the error criterion in Equation (10.1) with respect to the unknown pitch period P. Justify why you must constrain P > ∈ (a small positive value close to zero), i.e., P must be sufficiently far from zero.
10.2 In this problem you are asked to the complete missing steps in the harmonic sinewave model-based pitch estimator of Sections 10.4.1 and 10.4.2.
(a) Show that Equation (10.9) follows from Equation (10.8).
(b) Show that minimizing Equation (10.9) with respect to gives Equation (10.11), and thus gives Equation (10.12).
(c) Show that minimizing Equation (10.12) with respect to B gives Equation (10.13), and thus gives Equation (10.14).
(d) Explain why Equation (10.15) can lead to pitch-halving errors.
(e) Show how Equation (10.18) can be manipulated to obtain Equation (10.22). Fill in all of the missing steps in the text. Interpret Equation (10.29) in terms of its capability to avoid pitch halving relative to correlation-based pitch estimators. Argue that the pitch estimator also avoids pitch doubling.
(f) Explain how to obtain the results in Figure 10.4b for the cases ƒo = 400 Hz and ƒo = 50 Hz.
(g) Propose an extension of the sinewave-based pitch estimator where an a priori vocal tract system phase envelope is known, as well as an a priori system magnitude envelope. Qualitatively describe the steps in deriving the estimator and explain why this additional phase information might improve the pitch estimate.
10.3 In the context of homomorphic filtering, we saw in Chapter 6 one approach to determining voicing state (i.e., a speech segment is either voiced or unvoiced) which requires the use of the real cepstrum, and in this chapter we derived a voicing measure based on the degree of harmonicity of the short-time Fourier transform. In this problem, you consider some other simple voicing measurements. Justify the use of each of the following measurements as a voicing indicator. For the first two measures, use your knowledge of acoustic phonetics, i.e., the waveform and spectrogram characteristics of voiced and unvoiced phonemes. For the last two measurements use your knowledge of linear prediction analysis.
(a) The relative energy in the outputs of complementary highpass and lowpass filters of Figure 10.14.
(b) The number of zero crossings in the signal.
(c) The first reflection coefficient generated in the Levinson recursion.
(d) The linear prediction residual obtained by inverse filtering the speech waveform by the inverse filter A(z).
10.4 Suppose, in the sinewave-based pitch estimator of Section 10.4.2, we do not replace the spectrum by its sinewave representation. What problems arise that the sinewave representation helps prevent? How does it help resolve problems inherent in the autocorrelation-based sinewave pitch estimators?
10.5 Show how the onset estimator of Section 10.5.2, which provides a linear excitation phase estimate, may be useful in obtaining a vocal tract phase estimate from sinewave phase samples. Hint: Recall from Chapter 9 that one approach to vocal tract system phase estimation involves interpolation of the real and imaginary parts of the complex STFT samples at the sinewave frequencies. When the vocal tract impulse response is displaced from the time origin, a large linear phase can be introduced.
10.6 It was observed that when the phase function in Equation (10.45), consisting of a minimum-phase system function and a linear excitation phase derived from the onset estimator of Section 10.5.2, is applied to unvoiced speech, particularly fricatives and voicing with frication or strong aspiration, the reconstruction is “buzzy.” It was stated that this buzzy perceptual quality is due to an unnatural waveform peakiness. Give an explanation for this peakiness property, considering the characteristics of the phase residual in Figures 10.11 and 10.12, as well as the time-domain characteristics of a linear excitation phase.
10.7 In this problem, you investigate the “magnitude-only” counterpart to the multi-band pitch estimator of Section 10.6. Consider a perfectly periodic voiced signal of the form
x[n] = h[n] * p[n]
where
P being the pitch period. The windowed signal y[n] = w[n]x[n] (the window w[n] being a few pitch periods in duration) is expressed by
y[n] = w[n]x[n]
= w[n](h[n] * p[n]).
(a) Show that the Fourier transform of y[n] is given by
where is the fundamental frequency (pitch) and N is the number of harmonics.
(b) A frequency-domain pitch estimator uses the error criterion:
where S(ω) is the short-time spectral measurement and Y(ω) is the model from part (a). In the model Y(ω) there are two unknowns: the pitch ωo and the vocal tract spectral values H(kωo). Suppose for the moment that we know the pitch ωo. Consider the error E over a region around a harmonic and suppose that the window transform is narrow enough so that main window lobes are independent (non-overlapping). Then, the error around the kth harmonic can be written as
and the total error is approximately
Given ωo, find an expression for H(kωo) that minimizes E(k). With this solution, write an expression for E. Keep your expression in the frequency domain and do not necessarily simplify. It is possible with Parseval’s Theorem to rewrite this expression in the time domain in terms of autocorrelation functions, which leads to an efficient implementation, but you are not asked to show this here.
(c) From part (b), propose a method for estimating the pitch ωo that invokes minimization of the total error E. Do not attempt to find a closed-form solution, but rather describe your approach qualitatively. Discuss any disadvantages of your approach.
10.8 Consider similarities and differences in the sinewave-based pitch and voicing estimator developed in Sections 10.4.2 and 10.4.3 with the multi-band pitch and voicing estimators of Section 10.6 for the following:
1. Pitch ambiguity with pitch halving or pitch doubling. Hint: Consider the use of unconstrained amplitude estimates in the multi-band pitch estimator and the use of samples of a vocal tract envelope in the sinewave pitch estimator of Section 10.4.2.
2. Voicing estimation with voiced fricatives or voiced sounds with strong aspiration.
3. Computational complexity.
4. Dependence on the phase of the discrete-time Fourier transform over each harmonic lobe. Hint: The phase of the discrete-time Fourier transform is not always constant across every harmonic lobe. How might this changing phase affect the error criterion in Equation (10.48) and thus the amplitude estimation in the multi-band amplitude estimator of Equation (10.50)?
10.9 This problem considers the multi-band voicing measure described in Section 10.6.2.
(a) Propose a strategy for combining individual band decisions into a two-band voicing measure, similar to that described in Section 10.4.3, where the spectrum above a cutoff frequency is voiced and below the cutoff is unvoiced. As a point of interest, little quality difference has been observed between this type of reduced two-band voicing measure and the original multi-band voicing measure when used in low-bandwidth (e.g., 3000–4000 Hz) synthesis.
(b) Explain why the multi-band voicing measure of part (a), reduced to a two-band decision, gives the possibility of a more accurate two-band voicing cutoff than the sinewave-based method of Section 10.4.3.
(c) Explain how the multi-band voicing measure may be more useful in sinewave-based speech transformations than the reduced two-band decision, especially for wide-bandwidth (e.g., > 4000 Hz) synthesis. Consider, in particular, pitch modification.
10.10 In this problem, you investigate the different “anomalous” voice types of Figure 10.15 with diplophonic, creaky, and modulation (pitch periods with alternating gains) characteristics. Consider both the time-domain waveform and the short-time spectrum obtained from the center of each waveform segment.
(a) For the diplophonic voice, describe how secondary pulses generally affect the performance of both time- and frequency-domain pitch estimators.
(b) For the creaky voice, describe how erratic glottal pulses generally affect the performance of both time- and frequency-domain pitch estimators.
(c) For the modulated voice, explain why different spectral bands exhibit different pitch values. Consider, for example, time-domain properties of the signal. Propose a multi-band pitch estimator for such voice cases.
10.11 (MATLAB) In this problem, you investigate the correlation-based pitch estimator of Section 10.2.
(a) Implement in MATLAB the correlation-based pitch estimator in Equation (10.2). Write your function to loop through a speech waveform at a 10-ms frame interval and to plot the pitch contour.
(b) Apply your estimator to the voiced speech waveform speech1_8k (at 8000 samples/s) in workspace ex10M1.mat located in companion website directory Chap_exercises/chapter10. Plot the short-time autocorrelation function for numerous frames. Describe possible problems with this approach for typical voiced speech.
(c) Apply your estimator to the “anomalous” voice types in workspace ex10M2.mat located in companion website directory Chap_exercises/chapter10. Three voice types are given (also shown in Figure 10.15): diplo1_8k (diplophonic), creaky1_8k (creaky), and modulated1_8k (pitch periods with alternating gain). Describe problems that your pitch estimator encounters with these voice types.
10.12 (MATLAB) In this problem, you investigate the “comb-filtering” pitch estimator of Section 10.3.
(a) Implement in MATLAB the comb-filtering pitch estimator of Equation (10.7) and used in Example 10.1 by selecting the last large peak in the pitch likelihood function Q(ω). Write the function to loop through a speech waveform at a 10-ms frame interval and to plot the pitch contour.
(b) Apply your pitch estimator to the voiced speech waveform speech1_8k (at 8000 samples/s) in workspace ex10M1.mat located in companion website directory Chap_exercises/chapter10. Add white noise to the speech waveform and recompute your pitch estimate. Plot the pitch likelihood function for numerous frames. Describe possible problems with this approach.
(c) Apply your estimator to the “anomalous” voice types in workspace ex10M2.mat located in companion website directory Chap_exercises/chapter10. Three voice types are given (and are shown in Figure 10.15): diplo1_8k (diplophonic), creaky1_8k (creaky), and modulated1_8k (pitch periods with alternating gain). Describe problems that your pitch estimator encounters with these voice types.
10.13 (MATLAB) In this problem you investigate the sinewave-based pitch estimators developed in Sections 10.4.1 and 10.4.2.
(a) Implement in MATLAB the sinewave-based pitch estimator in Equation (10.15). Write the function to loop through a speech waveform at a 10-ms frame interval and to plot the pitch contour.
(b) Apply your estimator to the voiced speech waveform speech1_8k (at 8000 samples/s) in workspace ex10M1.mat located in companion website directory Chap_exercises/chapter10. Discuss possible problems with this approach, especially the pitch-halving problem and the problem of a bias toward low pitch.
(c) Apply your estimator from part (a) to the “anomalous” voice types in workspace ex10M2.mat located in companion website directory Chap_exercises/chapter10. Three voice types are given (Figure 10.15): diplo1_8k (diplophonic), creaky1_8k (creaky), and modulated1_8k (pitch periods with alternating gain). Describe problems that your pitch estimator encounters with these voice types.
(d) Compare your results of parts (a) and (b) with the complete sinewave-based estimator in Equation (10.29) of Section 10.4.2. In companion website directory Chap_exercises/chapter 10 you will find the scripts ex10_13_speech1.m, ex10_13_diplo.m, ex10_13_creaky.m, and ex10_13_modulate.m. By running each script, you will obtain a plot of the pitch and voicing contours from the complete sinewave-based estimator for the four cases: speech1_8k, creaky1_8k, diplo1_8k, and modulated1_8k, respectively. Using both time- and frequency-domain signal properties, explain your observations and compare your results with those from parts (a) and (b).
(e) Now consider the time-domain homomorphic pitch and voicing estimators of Chapter 6. Predict the behavior of the homomorphic estimators for the three waveform types of Figure 10.15. How might the behavior differ from the frequency-domain estimator of Section 10.4.2?
[1] B. Gold and L.R. Rabiner, “Parallel Processing Techniques for Estimating Pitch Periods of Speech in the Time Domain,” J. Acoustical Society of America, vol. 46, no. 2, pp. 442–448, 1969.
[2] D. Griffin and J.S. Lim, “Multiband Excitation Vocoder,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. ASSP–36, pp. 1223–1235, 1988.
[3] A. Kondoz, Digital Speech: Coding for Low Bit Rate Communication Systems, John Wiley & Sons, New York, NY, 1994.
[4] J. Makhoul, R. Viswanathan, R. Schwartz, and A.W.F. Huggins, “A Mixed-Source Model for Speech Compression and Synthesis,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Tulsa, OK, pp. 163–166, 1978.
[5] R.J. McAulay and T.F. Quatieri, “Pitch Estimation and Voicing Detection Based on a Sinusoidal Model,” Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Albuquerque, NM, pp. 249–252, 1990.
[6] R.J. McAulay and T.F. Quatieri, “Phase Modeling and its Application to Sinusoidal Transform Coding,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Tokyo, Japan, pp. 1713–1715, April 1986.
[7] R.J. McAulay and T.F. Quatieri, “Low-Rate Speech Coding Based on the Sinusoidal Speech Model,” chapter in Advances in Speech Signal Processing, S. Furui and M.M. Sondhi, eds., Marcel Dekker, 1992.
[8] R.J. McAulay and T.F. Quatieri, “Sinusoidal Coding,” chapter in Speech Coding and Synthesis, W.B. Kleijn and K.K. Paliwal, eds., Elsevier, 1995.
[9] M. Nishiguchi, J. Matsumoto, R. Wakatsuki, and S. Ono, “Vector Quantized MBE with Simplified V/UV Division at 3.0 kb/s,” in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, Minneapolis, MN, vol. 2, pp. 151–154, April 1993.
[10] D.B. Paul, “The Spectral Envelope Estimation Vocoder,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. ASSP–29, pp. 786–794, 1981.
[11] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice Hall, Englewood Cliffs, NJ, 1978.
[12] L.R. Rabiner, “On the Use of Autocorrelation Analysis for Pitch Detection,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–25, no. 1, pp. 24–33, 1977.
[13] M.M. Sondhi, “New Methods of Pitch Extraction,” IEEE Trans. Audio and Electroacoustics, vol. AU–16, no. 2, pp. 262–266, June 1968.
[14] R. Smits and B. Yegnanarayana, “Determination of Instants of Significant Excitation in Speech Using Group Delay Function,” IEEE Trans. Speech and Audio Processing, vol. 3, no. 5, pp. 325–333, Sept. 1995.