Chapter 5
Analysis and Synthesis of Pole-Zero Speech Models

5.1 Introduction

In Chapter 4, based on knowledge of speech production and the wave equation for sound propagation, we developed a continuous- and then a discrete-time transfer function for the relation between the acoustic pressure at the lips output and the volume velocity at the glottis. For the idealized voiced speech case, the transfer function contains poles that correspond to the resonances of the vocal tract cavity. Using a particular model of the glottal airflow, two additional poles were introduced outside the unit circle. The radiation load at the lips was represented by one zero inside the unit circle. More generally, the transfer function from the glottis to the lips also includes zeros that represent the energy-absorbing anti-resonances resulting from the back cavity during unvoiced plosives or fricatives, from the oral passage during nasal consonants, or from the nasal passage during nasalized vowels.

Speech sounds with periodic or impulsive sources are loosely categorized as “deterministic,” while speech sounds with noise sources are loosely classified as “stochastic” sounds. In Sections 5.3 and 5.4 of this chapter we estimate parameters of an all-pole system function for both the deterministic and stochastic sound classes, respectively; we will see that the resulting solution equations for the two classes are similar in structure. The solution approach is referred to as linear prediction analysis from which we derive, in Section 5.6, a method of speech synthesis based on the all-pole model. We also see how this method of analysis is intimately associated with the concatenated lossless tube model of Chapter 4, and as such presents a mechanism for estimating the vocal tract shape from all-pole system parameters. In Section 5.5, we evaluate the “goodness” of the linear prediction solution both in the time domain with respect to the shape of the waveform obtained in synthesis, and in the frequency domain with respect to the magnitude and the phase of the resulting vocal tract impulse response. We then describe in Section 5.7 methods that generalize the all-pole system analysis to estimation of a model transfer function consisting of both poles and zeros. This leads to a “pitch synchronous” technique, based on the glottal closed phase, for separating the glottal flow waveform from the vocal tract impulse response. These methods do not constrain the glottal flow function to a maximum-phase two-pole model. Example estimates illustrate the typical glottal airflow, as well as speaker-dependent contributions such as aspiration and secondary glottal pulses, that occur in the breathy and diplophonic voices, described in Chapter 3. Because the methods require that we view speech through a sliding window, we begin in Section 5.2 with a brief return to short-time processing of speech.

5.2 Time-Dependent Processing

We have seen that an essential property of speech production is that the vocal tract and the nature of its source vary with time and that this variation can be rapid. Many analysis techniques, however, assume that these characteristics change relatively slowly, which means that, over a short-time interval, e.g., 20–40 ms, the vocal tract and its input are stationary. By “stationary,” we mean that the vocal tract shape, and thus its transfer function, remain fixed (or nearly fixed) over this short-time interval. In addition, a periodic source is characterized by a steady pitch and glottal airflow function for each glottal cycle within the short-time interval; In addition, a periodic source is characterized by a steady pitch and glottal airflow function for each glottal cycle within the short-time interval; likewise, a noise source has fixed statistical properties over this time.

In analyzing the speech waveform, we apply a sliding window whose duration is selected to make the short-time stationarity assumption approximately valid. We can make the window arbitrarily short in time to meet this short-time criterion; this would provide a window short enough to resolve short-duration and rapidly-changing events without (most of the time) blending them with adjacent sounds, thus giving adequate time resolution. On the other hand, we want the window long enough to resolve fine structure in the speech spectrum such as individual harmonics and closely-spaced resonances, thus giving adequate frequency resolution. As we have seen in our discussion of the wideband and narrowband spectrograms in Chapter 3, these time-frequency objectives often cannot be met simultaneously, a reflection of the uncertainty principle that we reviewed in Chapter 2. We select a window duration, therefore, to trade off our objectives, typically of duration 20–40 ms. Our selected window slides at a frame interval sufficient to follow changing speech events, typically 5–10 ms, and thus adjacent sliding windows overlap in time. The specific shape of the window also contributes to the time and frequency resolution properties; typical windows are rectangular or tapered at their ends, and are characterized by different mainlobe and sidelobe structure. The rectangular window, for example, has a narrower mainlobe than the tapered Hamming window, but higher sidelobe structure.

In performing analysis over each window, we estimate the vocal tract transfer function parameters, e.g., vocal tract poles and zeros, as well as parameters that characterize the vocal tract input of our discrete-time model. The short-time stationarity condition requires that the parameters of the underlying system are nearly fixed under the analysis window and therefore that their estimation is meaningful.

5.3 All-Pole Modeling of Deterministic Signals

5.3.1 Formulation

We begin by considering a transfer function model from the glottis to the lips output for deterministic speech signals, i.e., speech signals with a periodic or impulsive source. From Chapter 4, during voicing the transfer function consists of glottal flow, vocal tract, and radiation load contributions given by the all-pole z-transform

Image

where we have lumped the gain and glottal airflow (assumed modeled by two poles outside the unit circle) over a single glottal cycle into the transfer function H(z). This corresponds to an idealized input ug[n] that is a periodic impulse train. (Observe that the notation H(z) has taken on a different meaning from that in Chapter 4.) With a single-impulse input for a plosive, there is no glottal flow contribution, G(z). Observe that we have made R(z) all-pole, as well as V(z) and G(z). We saw in Chapter 4, however, that a simplified model for R(z) is a single zero inside the unit circle. A zero inside the unit circle, however, can be expressed as an infinite product of poles inside the unit circle. To see this relation, first recall that the geometric series Image converges toImage for |r| < 1 so that letting r = az−1, we have

Image

Cross-multiplying, we then have

Image

which, in practice, is approximated by a finite set of poles as ak → 0 with k → ∞. Consequently, H(z) can be considered an all-pole representation, albeit representing a zero by a large number of poles is not efficient in light of estimating the zero directly. We study zero estimation methods later in this chapter. Our goal here is to estimate the filter coefficients ak for a specific order p and the gain A. The poles can be inside or outside the unit circle, the outside poles corresponding to the maximum-phase two-pole model for G(z).

The basic idea behind linear prediction analysis is that each speech sample is approximated as a linear combination of past speech samples. This notion leads to a set of analysis techniques for estimating parameters of the all-pole model. To motivate the idea, consider the z-transform of the vocal tract input ug[n], Ug(z), with a gain A [being embedded in H(z)], and let S(z) denote the z-transform of its output. Then S(z) = H(z)Ug(z) and we can write

Image

or

Image

which in the time domain is written as

(5.1)

Image

Equation (5.1) is sometimes referred to as an autoregressive (AR) model because the output can be thought of as regressing on itself. The coefficients ak are referred to as the linear prediction coefficients, and their estimation is termed linear prediction analysis [2]. Quantization of these coefficients, or of a transformed version of these coefficients, is called linear prediction coding (LPC) and will be described in Chapter 12 on speech coding.

Recall that in voicing we have lumped the glottal airflow into the system function so that ug[n] is a train of unit samples. Therefore, except for the times at which ug[n] is nonzero, i.e., every pitch period, from Equation (5.1) we can think of s[n] as a linear combination of past values of s[n], i.e.,

Image

This observation motivates the analysis technique of linear prediction. Before describing this analysis method, we note a minor inconsistency in our formulation. We have assumed that the z-transform of the glottal airflow is approximated by two poles outside the unit circle, i.e., Image. Because our all-pole transfer function is written in negative powers of z, and thus assumed causal, it is implied that any delay due to the glottal source has been removed. We will see shortly that removal of this delay is of no consequence in the analysis.

We first state some definitions [21]. A linear predictor of order p is defined by

(5.2)

Image

The sequence Image is the prediction of s[n] by the sum of p past weighted samples of s [n]. The system function associated with the pth order predictor is a finite-length impulse response (FIR) filter of length p given by

(5.3)

Image

Prediction in the z-domain is therefore represented by Image = P(z)S(z). The prediction error sequence is given by the difference of the sequence s[n] and its prediction Image, i.e.,

(5.4)

Image

and the associated prediction error filter is defined as

(5.5)

Image

which is illustrated in Figure 5.1a.

Suppose now that a measurement s[n] is the output of an all-pole system modeled as in Equation (5.1). Then if the predictor coefficients equal the model coefficients, i.e., αk = ak,

Figure 5.1 Filtering view of linear prediction: (a) prediction-error filter A(z) = 1 − P(z); (b) recovery of s[n] with Image for αk = ak. A(z) is also considered the inverse filter because it can yield the input to an all-pole transfer function.

Image

we can write the prediction error as

Image

and thus the input sequence Aug[n] can be recovered by passing s[n] through A(z). For this reason, under the condition αkak, the prediction error filter A(z) is sometimes called the inverse filter. Correspondingly, when we pass Aug[n] through the system Image, we obtain s[n] as shown in Figure 5.1b. Because A(z) is of finite order p, it consists of p zeros; likewise, its inverse Image consists of p poles. Observe now that for voiced speech idealized with a periodic impulse train input, when αk = ak, e[n] is an impulse train and therefore equal to zero most of the time except at the periodic impulses. Likewise, when the speech waveform is a single impulse response, as for an idealized plosive sound, then e[n] is zero except at the time of the impulsive input. In a stochastic context, when ug[n] is white noise, then the inverse filter “whitens” the input signal, as we will elaborate in Section 5.4.1

1 Observe that our subscript notation “g” is not appropriate for plosive and noise sounds because the source of these sounds typically does not occur at the glottis.

Example 5.1       Consider an exponentially decaying impulse response of the form h[n] = anu[n] where u[n] is the unit step. Then the response to a scaled unit sample [n] is simply

s[n] = [n] * h[n] = Aanu[n].

Consider now the predictionB of s[n] using a linear predictorb of order p = 1. The prediction error sequence with α1 = a is given by

Image

i.e., the prediction of the signal is exact except at the time origin. This is consistent with A(z) = 1 − P(z) = 1 − az−1 being the inverse filter to H(z). Image

These observations motivate linear prediction analysis that estimates the model parameters when they are unknown. The approach of linear prediction analysis is to find a set of prediction coefficients that minimizes the mean-squared prediction error over a short segment of the speech waveform. The minimization leads to a set of linear equations that can be solved efficiently for the linear predictor coefficients ak and the gain A.

5.3.2 Error Minimization

The prediction of a sample at a particular time no is illustrated in Figure 5.2a. Over all time, we can define a mean-squared prediction error as

Image

Figure 5.2 Short-time sequences used in linear prediction analysis: (a) prediction at time no; (b) samples in the vicinity of time n, i.e., samples over the interval [nM, n + M]; (c) samples required for prediction of samples in the interval [nM, n + M]. This set of samples, denoted by sn[m], includes samples inside and outside the interval [nM, n + M].

Image

This definition, however, is not desirable because, as we have seen, we must assume short-time stationarity of the speech waveform. In practice, therefore, the prediction error is formed over a finite interval (Figure 5.2b), i.e.,

Image

where the subscript n refers to “in the vicinity of time n.” The time interval [nM, n + M] is called the prediction error interval. For notational simplicity, we have assumed an interval of odd length 2M + 1 samples. Alternatively, we can write

(5.6)

Image

where

(5.7)

Image

and zero elsewhere. The sequence sn[m] is defined in the vicinity of time n over the interval [nMp, n + M] (and zero elsewhere), including p samples prior to time nM because these samples are needed in the prediction of the first p samples beginning at time nM (Figure5.2c). The error En is a quadratic function in the unknowns αk and as such can be visualized as a “quadratic bowl” in p dimensions. Our goal is to find the minimum of this function, i.e., the bottom of the quadratic bowl.

To minimize En over αk, we set the derivative of En with respect to each variable αk to zero [14],[21]:

Image

resulting in

Image

Observing that Image because αksn[mk] is constant with respect to Image, for k ≠ i we then have with Image set to zero:

Image

Multiplying through gives

(5.8)

Image

Define now the function

Image

Then we have

(5.9)

Image

The set of Equations (5.9) are sometimes referred to as the normal equations and can be put in matrix form as

(5.10)

Image

where the elements of the (i, k)th element of the p × p matrix Φ are given by Φn[i, k], the i th element of the 1 × p vector b is given by Φn[i, 0], and the i th element of the 1 × p vector α is given by αi. The solution then requires obtaining Φn[i, k] from the speech samples, setting up the matrix equation, and inverting.

Finally, the minimum error for the optimal αk can be derived as follows:

(5.11)

Image

Interchanging sums, the third term in Equation (5.11) can be rewritten as

Image

where the last step invoked the optimal solution [Equation (5.8)] for the bracketed term. Replacing the third term in Equation (5.11) by the above simplified form, we have

(5.12)

Image

When we don’t know the order of an underlying all-pole (order p) transfer function and the data s[n] follows an order p autoregression within the prediction error interval, the error En can be monitored to help establish p, because under this condition the error for a pth order predictor in theory equals that of a (p + 1)st order predictor. The predictor coefficients can also be monitored in finding p because the value of predictor coefficients for k > p equals zero. (The reader should argue both properties from Equations (5.9) and (5.12).)

In Equation (5.6), the prediction error en[m] is non-zero only “in the vicinity” of the time n, i.e., [nM, n + M]. We saw in Figure 5.2, however, in predicting values of the short-time sequence sn[m], p values outside (and to the left) of the prediction error interval [nM, n + M] are required. For a long stationary speech sequence, these values are available. This is in contrast to cases where data is not available outside of the prediction error interval as, for example, with rapidly-varying speech events. The solution that uses for prediction such samples outside the prediction error interval, but forms the mean-squared error from values only within the prediction interval, is referred to as the covariance method of linear prediction. When the signal measurement s[n] corresponds to an underlying all-pole system function with a single or periodic impulse input, then the covariance method can give the exact model parameters from a short-time speech segment (Exercise 5.8). An alternative method assumes speech samples of zero value outside the prediction error interval, but forms the mean-squared error over all time. This method is referred to as the autocorrelation method of linear prediction. The zero-value approximation outside the short-time interval does not result in the exact all-pole model, as can the covariance method, under the above conditions. Nevertheless, the approximation leads to a very efficient computational procedure which, interestingly, we will see is tied to our concatenated tube model of Chapter 4.

Before leaving this section, we show that an alternative “geometric” approach, in contrast to the above “algebraic” approach, to solving the minimization problem leading to the normal equations is through the Projection Theorem [16],[25]. To do so, we interpret Equation (5.7) as

an overdetermined set of equations in p unknowns:

Image

The columns of the above p×(2M + 1) matrix Sn are considered basis vectors for a vector space. The Projection Theorem states that the least-squared error occurs when the error vector en with components Image, for m in the interval [nM, n + M], is orthogonal to each basis vector, i.e., Image, where “T” denotes transpose. This orthogonality condition can be shown to lead to the normal equations that can be equivalently written as[16],[26]

Image

The beauty of this approach is its simplicity and its straightforward extension to other contexts such as stochastic signals.

5.3.3 Autocorrelation Method

In the previous section, we introduced the covariance method of linear prediction that uses samples outside the prediction error interval and can result in an exact solution when the signal follows an all-pole model. In this section, we describe an alternative approach to linear prediction analysis that is referred to as the autocorrelation method that is a suboptimal method, but a method that leads to an efficient and stable solution to the normal equations. The autocorrelation method assumes that the samples outside the time interval [nM, n + M] are all zero and extends the prediction error interval, i.e., the range over which we minimize the mean-squared error, to ±∞. For convenience, the short-time segment begins at time n and ends at time n + Nw − 1 (Nw = 2M + 1), along the time axis m, rather than being centered at time n as we had previously assumed. We then shift the segment to the left by n samples so that the first nonzero sample falls at m = 0. Alternately, we can view this operation as shifting the speech sequence s[m] by n samples to the left and then windowing by an Nw–point rectangular window, w[m] = 1 for m = 0, 1, 2 … Nw − 1, to form the sequence (Figure 5.3)

sn[m] = s[m + n]w[m].

From this perspective, it can be seen with the help of the third-order predictor example in Figure 5.4 that the prediction error is nonzero only in the interval [0, Nw + p − 1] with Nw the window length and p the predictor order. The prediction error is largest at the left and right ends of the segment because we are predicting from and to zeros, respectively. Consequently, a tapered window is often used (e.g., Hamming). Without a tapered window, the mean-squared error Equation (5.6) may be dominated by end effects, tending to give the largest errors inside the window at the left edge, and outside the window at the right edge, as illustrated in Figure 5.4. On the other hand, with a tapered window, the data is distorted, hence biasing the estimates αk. For example, an all-pole waveform may no longer follow an all-pole model when tapered. Tradeoffs exist, therefore, in the selection of a window for linear prediction analysis. Nevertheless, because of the finite-length window, the estimated αk will never be exactly correct even when s[n] follows an all-pole model.

Figure 5.3 Formulation of the short-time sequence used in the autocorrelation method. In this interpretation, the waveform s[m] [panel (a)] is shifted by n samples [panel (b)] and then windowed by an Nw–point rectangular sequence w[m] [panel (c)]. The resulting sequence sn[m] is zero outside the interval [0, Nw − 1].

Image

Figure 5.4 Example of a third-order predictor in the autocorrelation method of linear prediction: (a) sliding predictor filter; (b) prediction error. Prediction error is largest at the beginning and the end of the interval [0, Nw + p − 1].

Image

With this formulation we now state the autocorrelation method of linear prediction [14],[21]. Let the mean-squared prediction error be given by

Image

where the limits of summation refer to our new time origin and where the prediction error outside this interval is zero. We can then show that the normal equations take on the following form (Exercise 5.1)

(5.13)

Image

where

Image

Figure 5.5 Construction of the autocorrelation function in the autocorrelation method. Overlapping regions of sn[mk] and sn[mi] are used in determining Φn[i, k].

Image

with specific summation limits due to windowing the speech waveform. Now, from Figure 5.5, we can write Φn[i, k]as

Image

recognizing that only the interval [i, k + Nw − 1] contributes to the sum. With a change in variables mmi, we can rewrite the function Φn[i, k]as

Image

which is a function of only the difference ik and so we denote it as

rn[ik] = Φn[i, k].

Letting τ = ik, sometimes referred to as the correlation “lag,” we obtain the short-time autocorrelation function

(5.14)

Image

which is the short-time sequence sn[m] convolved with itself flipped in time. The autocorrelation function is a measure of the “self-similarity” of the signal at different lags τ. When rn[τ] is large, then signal samples spaced by τ are said to be highly correlated.

Some properties of rn[τ] are as follows (Exercise 5.3):

P1: For an N-point sequence, rn[τ] is zero outside the interval [−(N − 1), N − 1].

P2: rn[τ] is even in τ.

P3: rn[0] ≥ rn[τ].

P4: rn[0] equals the energy in Image

P5: If sn[m] is a segment of a periodic sequence, then rn[τ] is periodic-like, reflecting rn[τ] as a measure of periodic self-similarity. Because sn[m] is short-time, the overlapping data in the correlation decreases as τ increases, and thus the amplitude of rn[τ ] decreases as τ increases; e.g., with a rectangular window, the envelope of rn[τ] decreases linearly.

P6: If sn[m] is a random white noise sequence, then rn[τ] is impulse-like, reflecting self-similarity within a small neighborhood (see Appendix 5.A for a brief review of random process theory).

The following example illustrates these properties with speech waveforms:

Example 5.2       Examples of autocorrelation functions for a vowel, unvoiced plosive, voiced plosive, and unvoiced fricative are shown in Figure 5.6. Speech segments were obtained by applying a short-time rectangular window of about 40 ms to various speech waveforms. The autocorrelation function for the fricative phone is noisy and non-periodic as well as impulse-like, while for the voiced phone it is periodic-like with decreasing amplitude. The autocorrelation function of the voiced plosive shows a high-frequency and a noise contribution superimposed on a periodic component, while its unvoiced counterpart is comprised mainly of a high-frequency component with overriding noise due to the turbulent source and aspiration following the burst. Image

Figure 5.6 Illustration of autocorrelation functions of speech: (a) vowel /o/ in “pop”; (b) unvoiced plosive /k/ in “baker”; (c) unvoiced fricative /f/ in “father”; (d) voiced plosive /g/ in “go.”

Image

With our new definitions, by letting Φn[i, k] = rn[ik], we can rewrite the normal equations as

Image

or

Image

which represent p linear equations in p unknowns, αk, for 1 ≤ kp. Using the normal equation solution, it can then be shown that the corresponding minimum mean-squared prediction error is given by

(5.15)

Image

As before, the normal equations can be put in matrix form:

(5.16)

Image

that is,

Image

The matrix Rn in Equation (5.16) has the property of being symmetric about the diagonal and all elements of the diagonal are equal. We will see that this structure, which is referred to as the “Toeplitz” property, implies an efficient algorithm for solution. Motivation for the name “autocorrelation method” now becomes clear; entries in the matrix Rn are the first p autocorrelation coefficients of sn[m]. It can be shown that the columns of Rn are linearly independent and thus Rn is invertible [16],[25].

We now consider a number of properties of the autocorrelation method. As we stated earlier, for finite-length data the method can never give an exact solution, even when s[n] follows an all-pole model. In certain cases, however, as the window length arbitrarily increases, the autocorrelation solution approaches the true solution. We illustrate this property in the following example adapted from McClellan [16]:

Example 5.3       Consider an exponentially decaying impulse response of the form h[n] = anu[n], with u[n] the unit step. Then the response to a unit sample is simply

s[n] = δ[n] * h[n] = anu[n]

and has a one-pole z-transform

Image

We are interested in estimating the parameter a of a first-order all-pole model from the first N samples of s[n] by using the autocorrelation method of linear prediction. This is equivalent to applying an N–point rectangular window in the interval [0, N − 1]. Using the closed-form solution for a finite geometric series, the first two autocorrelation coefficients can be computed from the first N samples of s[n] as [16]

Image

where the subscript “0” refers to the data segment located at time n = 0. The solution for the first-order predictor coefficient α1 should be a, but it is always biased2 for finite N. From the autocorrelation normal equations, Equation (5.16), we can compute α1 as

2 It is important to note that the error incurred in predicting the beginning of a causal impulse response is not the cause of bias in either the autocorrelation or the covariance method because the “model’s impulse response is assumed produced with zero initial conditions” [16].

Image

which approaches the true solution a as N → ∞. We can also show that the minimum squared error, Equation (5.15), for this solution is given by (Exercise 5.5)

Image

that as N → ∞ gives the value of s2[0] = 1, which is consistent with the first-order predictor’s inability to predict the first nonzero sample of the sequence. We saw this property in Example 5.1 where the prediction error sequence for the true predictor is given by e[n] = s[n]−as[n − 1] = δ[n], i.e., with α1 = a, the prediction of the signal is exact except at the time origin.

Source: J.H. McClellan, “Parametric Signal Modeling” [16]. ©1988, Pearson Education, Inc. Used by permission. Image

The previous example illustrates that, with enough data, the autocorrelation method yields a solution close to the true single-pole model for an impulse input. When the underlying measured sequence is the impulse response of an arbitrary all-pole sequence, then this result can be shown to also hold. There are a number of scenarios in the speech context, however, in which the true solution is not obtained with the autocorrelation method even with an arbitrarily long data sequence. As an important example, consider a periodic sequence, simulating a steady voiced sound, formed by convolving a periodic impulse train p[n] with an all-pole impulse response h[n]. The z-transform of h[n] is given by

Image

so that

Image

where we have made the gain of the impulse input of value unity. By multiplying both sides of the above difference equation by h[ni] and summing over n and noting that h[n] is causal, we can show that (Exercise 5.7)

(5.17)

Image

where the autocorrelation of h[n] is denoted by rh[τ] = h[τ] * h[−τ] and which represents an alternate approach to obtaining the normal equations when the signal of interest is known a priori to be an all-pole impulse response. Suppose now that a periodic waveform s[n] is constructed by running a periodic impulse train through h[n], i.e.,

Image

where P is the pitch period. The normal equations associated with s[n], windowed over multiple pitch periods, for an order p predictor, are given by

(5.18)

Image

where rn[τ] can be shown to equal periodically repeated replicas of rh[τ ], i.e.,

Image

but with decreasing amplitude due to the window (Exercise 5.7). The autocorrelation function of the windowed s[n], rn[τ], can thus be thought of as “aliased” versions of rh[τ]. When the aliasing is minor, the two solutions, Equations (5.17) and (5.18), are approximately equal. The accuracy of this approximation, however, decreases as the pitch period decreases because the autocorrelation function replicas, repeated every P samples, overlap and distort the underlying desired rh[τ] and this effect is potentially more severe for higher-pitched speakers (Figure 5.7). To summarize, from Equation (5.17), if the autocorrelation function in the normal equations equals that of h[n], then the solution to the normal equations must equal the αk’s of H(z) and thus the true solution. This true solution is achieved approximately for a periodic waveform for large pitch periods.

In practice, in the autocorrelation method when a finite segment is extracted, and even with little autocorrelation aliasing present, distortion in the true autocorrelation can occur due to window truncation or tapering effects, as well as time-variation of the vocal tract (Exercise 5.10). We will see later in this chapter that the covariance method, on the other hand, can obtain the true solution even from a periodic waveform. The covariance method, however, has the disadvantage that, unlike the autocorrelation method, stability is not guaranteed when the underlying signal does not follow an all-pole model [16],[21].

Figure 5.7 Illustration of autocorrelation “aliasing” in the autocorrelation method of linear prediction for a windowed periodic waveform: (a) rh[τ]; (b) rn[τ] for 50-Hz pitch; (c) rn[τ] for 100-Hz pitch; (d) rn[τ] for 200-Hz pitch. Aliasing increases with rising pitch. For the 200-Hz pitch case, low-order autocorrelation coefficients are considerably different from those of rh[τ], thus illustrating that accuracy of linear prediction analysis decreases with increasing pitch. In this example, h[m] consists of two poles at 500 Hz and 1500 Hz and sn[m] is the convolution of h[m] with a periodic impulse train, rectangularly windowed over a 40-ms duration.

Image

Observe that we have not yet computed the unknown gain A of our model. We will return to this in a moment, but first we describe an efficient recursive method for solving the normal equations for the autocorrelation method. This recursive solution is the basis for a number of important properties of the autocorrelation method.

5.3.4 The Levinson Recursion and Its Associated Properties

Although we can solve for the predictor coefficients using a matrix inverse in Equations (5.10) and (5.16), the Toeplitz form of Rn in the normal equations (5.16) of the autocorrelation method leads to a more efficient recursive approach. A direct inverse, i.e., Image requires on the order of p3 multiplies and additions using a conventional inversion algorithm such as Gaussian elimination for matrix equations of the general form Ax = b [25]. On the other hand, the recursive solution requires on the order of p2 multiplies and additions and, furthermore, also leads to a direct link to the concatenated lossless tube model of Chapter 4, and thus a mechanism for estimating the vocal tract area function from an all-pole-model estimation. The recursive method was first published in 1947 by Levinson [13] and is known as the Levinson recursion.

The Recursion — The basic idea of the Levinson recursion is to build an order i + 1 solution from an order i solution until we reach the desired order p. Each recursion solves the i th-pole (or i th-order prediction) problem, finding the solution that minimizes the mean-squared error for each order predictor, by updating the lower-order solution. Alternatively, using the Projection Theorem, we can interpret each solution as giving an error vector that is orthogonal to columns of a matrix similar in structure to Sn but of lower order; at each stage the error vector becomes orthogonal to a vector space of increasing dimensionality, becoming progressively decorrelated and eventually “whitened” when i = p [26]. Specifically, the Levinson recursion is given in the following steps. For simplicity, we occasionally drop the subscript n notation on sn[m] and rn[τ], unless otherwise stated.

S1: Initialize the prediction error and predictor coefficient of an order-zero predictor, i.e.,

Image

Because we assume that the initial predictor is zero, i.e., all predictor coefficients αk are zero, the predicted sequence Image. The mean-squared prediction error, therefore, is the energy in the sequence which, as we saw in a previous section, is the first value of the autocorrelation of the windowed s[n], i.e., r[0].

S2: Form a weighting factor for the i th pole model as

Image

S3: Using the result of step S2, update coefficients for the i th pole model as

Image

where we have set the highest-order coefficient to ki and where ki has also been used as a weighting factor on the coefficients of order i − 1 to obtain the coefficients of order i. We refer to the constants ki as the partial correlation coefficients for which the motivation will shortly become clear. These coefficients are also called the PARCOR coefficients.

S4: Update the minimum mean-squared prediction error for the i th pole model as

Image

Ei, being a mean-squared prediction error, provides a way of monitoring the accuracy of prediction when the predictor order is unknown.

S5: Repeat steps S2 to S4 for i = 1, 2, … p. At the pth step we have the desired pth order predictor coefficients

Image

where “*” denotes the (optimal) coefficients that minimize the mean-squared pth-order prediction error.

The derivation of the Levinson recursion can be found in [15],[16]. From step S2, we can show on each iteration that the predictor coefficients αk can be written as solely functions of the autocorrelation coefficients (Exercise 5.11). Finally, the desired transfer function is given by

Image

where the gain A has yet to be determined. Calculation of the gain depends on certain properties of the estimated prediction coefficients and will be described later in this section.

Properties — Properties of the autocorrelation method motivated by the Levinson recursion are given below.

P1: The magnitude of each of the partial correlation coefficients is less than unity, i.e., |ki| < 1. To show this, we first recall that the mean-squared prediction error is always greater than zero because we cannot attain perfect prediction even when the sequence follows an all-pole model. Therefore, from the Levinson recursion, we have

Image

from which

Image

and so it follows that

Image

P2: The roots of the predictor polynomial A(z) lie strictly inside the unit circle, i.e., the transfer function H(z) has poles inside the unit circle and thus is minimum-phase. This can be shown by using the previous property that |ki| < 1 [16]. Therefore, the condition |ki| < 1 is sufficient for stability. In contrast, with the covariance method, the poles are not restricted to lie inside the unit circle, and thus stability is not guaranteed.

P3: Suppose that a stable sequence s[n] has a z-transform S(z) which is all-pole, but with both a minimum-phase and a maximum-phase contribution:

Image

where we assume poles occur in complex conjugate pairs. If the prediction error interval is infinite in duration and two-sided, then the autocorrelation method yields the true minimum-phase poles. The resulting maximum-phase poles, however, fall in the conjugate reciprocal locations of their true counterparts, i.e., the maximum-phase poles are converted to minimum-phase poles, so that the resulting transfer function is of the form

Image

We can argue this property as follows. The sequence s[n] follows an all-pole model, with some poles inside and some poles outside the unit circle. Observe that if we flip all maximum-phase poles inside the unit circle to their conjugate reciprocal locations, then the autocorrelation of s[n], i.e., rs[τ] = s[n] * s[−n] remains intact. This invariance can be seen by first observing that the spectral magnitude of s[n], when squared and inverse Fourier transformed, gives the autocorrelation function. As we saw in Chapter 2, the squared spectral magnitude function does not change with conjugate reciprocal zero-flipping. Now, we know from our property (P1) above that the autocorrelation method of linear prediction must yield poles inside the unit circle. Using the fact that the autocorrelation method gives the true solution for a minimum-phase sequence with an infinitely-long prediction error interval (as in Example 5.3), the reader can then deduce the remainder of the argument.

It is interesting to observe the meaning of this property with respect to the spectral phase of a measurement and its all-pole estimate. Suppose that s[n] follows an all-pole model, and suppose the prediction error function is defined over all time, i.e., we assume no window truncation effects. Let us write the Fourier transform of s[n]as

Image

where θ smin (ω) and θ smax (ω) are the Fourier transform phase functions for the minimum-and maximum-phase contributions of S(ω), respectively. Then it is straightforward to argue (Exercise 5.14) that the autocorrelation method solution can be expressed as

(5.19)

Image

i.e., the maximum-phase component is made negative, thus distorting the desired phase function of S(ω). In the following example, we show an effect of this phase change in the linear prediction analysis of speech.

Example 5.4       Consider the discrete-time model of the complete transfer function from the glottis to the lips derived in Chapter 4, but without zero contributions from the radiation and vocal tract:

Image

In this model the glottal flow is represented by the double maximum-phase pole at z = β. Suppose we measure a single impulse response denoted by h[n] which is equal to the inverse z-transform of H(z). Then application of the autocorrelation method with a model order set to the number of poles of H(z), i.e., p = 2 + 2Ci, with the prediction error defined over the entire extent of h[n], yields a solution

Image

i.e., the two maximum-phase poles due to the glottal flow are converted to their minimum-phase counterparts (Figure 5.8).

Figure 5.8 Transformation of glottal maximum-phase poles to minimum-phase poles by the autocorrelation method of linear prediction: (a) maximum-phase glottal flow (with displacement from the time origin), vocal tract impulse response, and resulting speech waveform for one excitation impulse; (b) minimum-phase glottal flow (with displacement from the time origin), vocal tract impulse response, and resulting speech waveform for one excitation impulse as obtained by the autocorrelation method of linear prediction. This example illustrates how the “attack” of an impulse response (and thus its perception) can be altered through linear prediction analysis. The simulated vocal tract consists of two resonances at 200 Hz and 600 Hz. The glottal flow waveform in (a) is the time-reversed decaying exponential 0.93nu[n] (with u[n] the unit step) convolved with itself.

Image

Consider now the true frequency response written as

Image

where θυ(ω) is the minimum-phase contribution due to the vocal tract poles inside the unit circle, and θg(ω) is the maximum-phase contribution due to the glottal poles outside the unit circle. From Equation (5.19) and the above expression for Image(z), the resulting estimated frequency response can be expressed as

Image

so that the original phase function has been modified by the subtraction of twice the glottal phase component. This phase distortion can have a perceptual consequence when constructing a synthetic speech waveform from Image [the inverse Fourier transform of Image because, as seen in Figure 5.8, the gradual onset of the glottal flow, and thus of the speech waveform during the open phase of the glottal cycle, is transformed to a “sharp attack,” consistent with the energy concentration property of minimum-phase sequences that we reviewed in Chapter 2. Image

P4: There is a one-to-one correspondence between the two sets of parameters [α1, α2, … , αp, A] and [rn[0], rn[1], … , rn[p]]. We have already indicated that the predictor coefficients are uniquely determined by the first p autocorrelation coefficients (Exercise 5.11). It is also possible to show that we can go in the other direction [16], i.e., the autocorrelation can be obtained from the predictor coefficients. Likewise, it is possible to show that the set consisting of partial correlation coefficients along with the gain, i.e., [k1, k2, … , kp, A], have a one-to-one correspondence with either of the two sets. Although we will not formally prove these correspondences here, we can motivate them by reversing the Levinson recursion to obtain the partial correlation coefficients from the predictor coefficients [14]:

(5.20)

Image

thus obtaining lower-order models from higher-order models, rather than higher from lower order as in the (forward) Levinson recursion. You will prove this reverse Levinson recursion in Exercise 5.12 and use it in that problem to argue the one-to-one correspondence between different parameter sets.

P5: This next property is called autocorrelation matching. Consider a measurement sn[m] = s[n + m]w[m] with autocorrelation rn[τ]. Let h[n] denote the impulse response associated with the estimated transfer function of order p, Image, obtained from the autocorrelation method of linear prediction using the measurement sn[m]. We denote the autocorrelation of h[n] by rh[τ]. Suppose now that the energy in the impulse response h[n] equals the energy in the measurement sn[m], i.e., rh[0] = rn[0]. Then it follows from this constraint that the two autocorrelation functions are equal for |τ| < p, i.e.,

(5.21)

Image

The proof of this property is quite involved and so we refer the interested reader to [16]. In the following subsection, we use the autocorrelation matching property to compute the gain A of the estimated system response; in particular, we show that when rh[0] = rn[0] then the squared gain factor is given by the minimum average prediction error and thus can be expressed by

Image

Computation of Gain in Autocorrelation Method — One approach3 to select the gain A is to require that the energy in the all-pole impulse response h[n] equals the energy in the measurement s[n], i.e., rh[0] = rn[0]. As we have seen, this constraint results in the matching of the autocorrelation of the two sequences for |τ| < p, a property that will be exploited in determining the value of A.

3 We will see in a later section of this chapter that this particular selection of the gain A can be placed within a more rigorous framework in which our time-domain prediction error function En is mapped to the frequency domain. Nevertheless, in practice it may be desirable to require the energy in h[n] to equal the energy over a pitch period, rather than over the complete short-time waveform, the energy in a pitch period perhaps more accurately reflecting the energy in the vocal tract impulse response.

We begin with the all-pole transfer function

(5.22)

Image

where αk for k = 1, 2 … p are determined by the autocorrelation method using short-time autocorrelation coefficients rn[τ], τ = 0, 1, 2 … p. Transforming Equation (5.22) to the time domain, we have

(5.23)

Image

where δ[n] is the unit sample function. Multiplying Equation (5.23) by h[mi] results in

(5.24)

Image

and letting i = 0 and summing over m gives

(5.25)

Image

Because h[0] = A from Equation (5.23), the last term in Equation (5.25) becomes A2 and so Equation (5.25) can be written as

(5.26)

Image

Under the condition that

Energy in h[m] = Energy in short time measurement sn[m]

or

(5.27)

Image

then the autocorrelation matching property holds, i.e.,

(5.28)

Image

Therefore, from Equations (5.26) and (5.28),

(5.29)

Image

Finally, from an earlier result for the prediction error (Equation (5.15)),

(5.30)

Image

where En is the average minimum prediction error for the (optimal) pth-order predictor. Therefore, requiring that the energy in the all-pole impulse response h[m] equals the energy in the measurement sn[m] leads to a squared gain equal to the minimum prediction error.

5.3.5 Lattice Filter Formulation of the Inverse Filter

Recall that we wrote the pth-order model as

Image

where

Image

was referred to as the inverse filter or as the prediction error filter. In this section, we exploit the Levinson recursion, together with prediction error filters at different stages of the recursion, to obtain the lattice filter formulation of the autocorrelation method [15],[21].

Lattice Structure — Consider the ith-order prediction error filter

Image

which we can think of as the ith stage of the Levinson recursion. The output to Ai(z) for the short-time input sequence sn[m] is given by (Henceforth, for simplicity in this section, we again drop the n subscript notation.)

(5.31)

Image

We can think of ei[m] as the forward prediction error sequence for the i th-order prediction error (inverse) filter. In the z-domain, this forward prediction error is given by

(5.32)

Image

Consider now the counterpart to the forward prediction error sequence, i.e., we define the backward prediction error sequence as

(5.33)

Image

which represents the prediction of s[mi] from i samples of the future input, i.e., the samples s[mi + k] for k = 1, 2, … i follow the sample s[mi], as illustrated in Figure 5.9a. The z-transform of bi[m] can be found as follows. Taking the z-transform of Equation (5.33),

Image

and because Image then

(5.34)

Image

With the above time-and frequency-(z) domain expressions for the forward and backward prediction errors and using step S3 of the Levinson recursion, i.e., Image, 1 ≤ ji − 1, with much algebra, we can arrive at the following relations (Appendix 5.B):

(5.35)

Image

which can be interpreted as the forward and backward prediction error sequences, respectively, for the i th-order predictor in terms of prediction errors of the (i − 1) th-order predictor. The recursion in Equation (5.35) begins with the 0th-order predictor giving

e0[m] = s[m]
b0[m] = s[m]

Figure 5.9 Lattice formulation of the linear prediction solution [15], [21]: (a) forward and backward prediction using an i th-order predictor; (b) flow graph of the lattice implementation.

SOURCE: L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals [21]. ©1978, Pearson Education, Inc. Used by permission.

Image

because the 0th-order predictor is equivalent to no prediction. These equations can be depicted by a flow graph which we refer to as a lattice network (Figure 5.9b). The output of the pth stage of the lattice equals the forward prediction error for the pth-order predictor, i.e., ep[m] = e[m] which equals the output of A(z). When the underlying measurement follows an all-pole model, the output of the lattice is therefore an impulse for αk = ak The lattice represents an elegant implementation of the prediction error (inverse) filter A(z).

The partial correlation coefficients ki required in the lattice are obtained by way of the Levinson recursion. This requires computation of the predictor coefficients at each stage. On the other hand, it was shown by Itakura and Saito [11] that the partial correlation coefficients could (remarkably!) be found recursively through the prediction errors derived from the lattice. In particular, the partial correlation coefficients can be written as

Image

which is the correlation at lag zero between the forward and backward prediction errors at stage i − 1, normalized by the product of their energies, hence the terminology, “partial correlation (PARCOR) coefficients.” This implies that we can compute the required ki without computation of the predictor coefficients. We can then use these computed ki’s in the Levinson recursion without ever having to compute the autocorrelation coefficients rn[τ]. We return to this result in discussion of alternative lattice structures for linear prediction analysis and synthesis.

Relation of Linear Prediction to the Lossless Tube Model — The recursive solution to the linear prediction estimation bears a strong resemblance to the recursion given in Chapter 4 for the lossless tube model. Recall that for the lossless concatenated tube model, with glottal impedance Zg(z) = ∞, i.e., an open circuit, we wrote a recursion for computing the denominator polynomial in the all-pole transfer function Image from the reflection coefficients rk, i.e., for

Image

the recursion for the denominator polynomial D(z) is given by

(5.36)

Image

where N is the number of tubes and where the reflection coefficient rk is a function of cross-sectional areas of successive tubes, i.e.,

(5.37)

Image

Likewise, the Levinson recursion for obtaining the denominator in the all-pole transfer function

Image

can be written in the z-domain by mapping the relation Image and its starting and ending conditions to obtain (Appendix 5.B)

Image

where ki are the partial correlation coefficients that are determined at each step in the Levinson recursion. In this recursion, we obtain the starting condition by mapping Image to Image The two recursions are seen to be identical when the reflection coefficients in the former equal the negative of the partial correlation coefficients in the latter, i.e., when ri = −ki, then Di(z) = Ai(z), showing the relation of linear prediction analysis (the autocorrelation method) with the lossless concatenated tube model.

If we could measure the cross-sectional areas of the tubes in the concatenated tube model, then we would know the reflection coefficients because Image. And suppose we set the partial correlation coefficients to the negative of the reflection coefficients, i.e., ki = −ri. Then, as we noted earlier, the two above recursions are identical, and so the Levinson recursion yields for its last (pth) step in the recursion the transfer function of the vocal tract, i.e.,

D(z) = Ap(z).

Alternatively, if we don’t know the cross-sectional area ratios, we can first compute the autocorrelation coefficients from the speech measurement. We then solve the Levinson recursion from which we can obtain the relative cross-sectional areas given by Image The Levinson recursion thus provides a means for obtaining vocal tract shape from a speech measurement.

Observe that we have not included boundary conditions in the above transfer function, i.e., that the transfer function V(z) represents the ratio between an ideal volume velocity at the glottis and at the lips:

Image

Our speech pressure measurement at the lips output, however, has embedded within it the glottal shape G(z), as well as the radiation load R(z). Recall that in the voiced case (with no vocal tract zeros), the complete transfer function is given by

Image

where the factors α and β represent a zero and double-pole due to the radiation and glottal shape, respectively. The presence of the glottal shape introduces poles that are not part of the vocal tract. As we saw earlier, the presence of the zero from R(z) can also be thought of as a multiple-pole contribution. Because these poles are not part of the vocal tract, they distort the relative area function estimate. Wakita has observed that, because the net effect of the radiation and glottal shape is typically a 6 dB/octave fall-off contribution to the spectral tilt of Image, the influence of the glottal flow shape and radiation load can be approximately removed with a pre-emphasis of 6 dB/octave spectral rise [27]. Observe that because we have modeled the frequency response of the glottal flow shape with two maximum-phase poles, the pre-emphasis must account for this phase characteristic if a complete equalization is to be performed. However, because the autocorrelation method of linear prediction discards phase, the glottal phase component, in practice (for this application), need not be removed.

Figure 5.10 Estimation of vocal tract area functions using the autocorrelation method of linear prediction: (a) the vowel /a/; (b) the vowel /i/. The dotted curve is the cross-section measurement made for a Russian vowel (from Fant [27]), while the solid curve is the cross-section tube estimates based on linear prediction using the American vowel counterpart.

SOURCE: H. Wakita, “Estimation of the Vocal Tract Shape by Optimal Inverse Filtering and Acoustic/Articulatory Conversion Methods” [27]. ©1972, H. Wakita. Used by permission.

Image

Example 5.5       Two examples are given in Figure 5.10 that show good matches to measured vocal tract area functions for the vowels /a/ and /i/ derived from estimates of the partial correlation coefficients [27]. In these examples, a cross-sectional area function AN for the last tube in the concatenated tube model was obtained from an average cross-sectional area of the lip opening, thus allowing for absolute area function measurements, rather than the above relative measures. With AN, we can obtain Image from which we obtain AN−2 and so on. For the examples in Figure 5.10, the speech bandwidth is 3500 Hz and a seven-tube model was assumed, allowing for three resonances in the middle of the band and one at either band edge.4 In this approach, Wakita found that the area function estimates are sensitive to the spectral slope of the glottal/radiation equalization, the area function of the lips used for normalization, and an accurate estimate of the parameters of the all-pole model, so that improvements in these characteristics should give improved cross sections. Furthermore, the lossless concatenated tube model does not allow distribution of energy loss along the vocal tract, and this limitation, as well as other effects such as vibrating vocal tract walls, nonlinearities, and distributed sources not represented in the model, may lead to refined area function estimates. Image

4 Recall from Chapter 4 that two tubes are needed to represent one complex-conjugate pole pair associated with a typical resonance. Thus, seven tubes can account for three complex-conjugate pole pairs and one solitary pole at ω = 0 or ω = π.

5.3.6 Frequency-Domain Interpretation

We have defined for linear prediction analysis a mean-squared prediction error (Equation (5.6)) in the time-domain. In this section, we show that this time-domain error has a frequency-domain counterpart, based on the difference in the log-magnitude of the measurement and model frequency responses, that leads to our earlier solution for the all-pole model coefficients and, in addition, to a rigorous way to justify our earlier (heuristic) selection of the gain A. Furthermore, in Section 5.5.2 this frequency-domain error leads to insights into the performance of the autocorrelation method of linear prediction, different from insights that can be obtained in the time domain. We here outline the proof of this time-frequency equivalence; the details are left as an exercise (Exercise 5.17).

Consider the all-pole model of speech production

Image

with

Image

Suppose we define a new frequency-domain error function [15]

(5.38)

Image

with Q(ω) defined by the difference of the log-magnitude of measured and modeled spectra, i.e.,

(5.39)

Image

where S(ω) is the Fourier transform of the short-time segment of speech being modeled. Our goal is the minimization of the error criterion I, with respect to the unknown αk’s and unknown gain A.

Using Equation (5.39), we can write Q(ω) as

Image

where E(ω) = A(ω)S(ω) is the Fourier transform of the prediction error sequence. Suppose now that the zeros of A(z) are constrained to lie inside the unit circle. Writing |A(ω)|2 = A(ω)A*(ω), it follows that

Image

from which we can show that

Image

Substituting Q(ω) into I, and using the previous result, we can then show that minimization of I, with respect to the unknown filter coefficients αk, is equivalent to minimization of the average squared prediction-error filter output Image, and hence yields the same solution as the autocorrelation method of linear prediction! Finally, it is possible to show that minimization of I with respect to the unknown A2 gives the optimal gain solution

Image

i.e., the mean-squared linear prediction error and, therefore, that the optimal A2 is the same solution as obtained by the heuristic energy-matching solution of the previous section where we constrained the energy in our all-pole estimated impulse response h[n] to equal the energy in the measurement s[n]. The frequency-domain error function I thus provides a rigorous way of obtaining both the unknown filter coefficients and the unknown gain.

Jumping ahead to Figure 5.14a, we see an illustration of the function f(Q) = eQQ−1 used in the error criterion I. We can argue from this view of the error criterion that linear prediction analysis tends to fit spectral peaks more accurately than spectral valleys. We return to this observation in our discussion of the performance of the autocorrelation method.

5.4 Linear Prediction Analysis of Stochastic Speech Sounds

We motivated linear prediction analysis with the observation that for a single impulse or periodic impulse train input to an all-pole vocal tract, the prediction error is zero “most of the time.” Such analysis therefore appears not to be applicable to speech sounds with fricative or aspirated sources modeled as a stochastic (or random) process (Appendix 5.A), specifically, often as a white noise sequence. In this section we see that the autocorrelation method of linear prediction, however, can be formulated for the stochastic case where a white noise input takes on the role of the single impulse. The solution to a stochastic optimization problem, analogous to the minimization of the mean-squared error function En in Section 5.3, leads to normal equations which are the stochastic counterparts to our earlier solution [14].

5.4.1 Formulation

Consider a speech sound where turbulence is created at a constriction in the vocal tract or at a constricted glottis as, for example, with unvoiced fricatives and breathy sounds, respectively.5 One model of the generation of this sound is the excitation of an all-pole vocal tract with a sample function of white noise uo[n] that represents the turbulence at the constriction and that is characterized by zero mean and unity variance, i.e., E(uo[n]) = 0 and E(u2o[n]) = 1, where the operator E() denotes expected value (Appendix 5.A) and where the subscript “o” denotes that the source turbulence can occur anywhere along the oral (or pharyngeal) tract and not necessarily at the glottis. The difference equation representation of the output waveform is given by

5 As we have seen in Example 3.4, for voiced fricatives, turbulence at a vocal tract constriction is modulated by the cyclic glottal flow. This nonlinear model and other nonlinear models of sound production, e.g., distributed stochastic sources due to vortical flow, for the moment, we do not consider.

(5.40)

Image

where s[n] and uo[n] are each sample functions of a random process and where A denotes the gain factor for the white noise source. The frequency-domain representation of the stochastic output is given by the power density spectrum or power spectrum (for clarity, notationally different from in Appendix 5.A) of the output waveform s[n] expressed as

(5.41)

Image

where there is no glottal pulse shape component as in the voiced speech case. Equation (5.41) is the stochastic counterpart to our frequency response model of voiced speech with an impulse input.

An estimate of the power spectrum is obtained by computing the squared magnitude of the Fourier transform of s[n] over an analysis window of finite duration Nw and dividing by Nw, i.e.,

Image

which is referred to as the periodogram and whose expected value is the desired power spectrum smoothed by the Fourier transform of the rectangular window w[n] = 1 for 0 ≤ n < Nw (Appendix 5.A) [17]. The periodogram fluctuates about the power spectrum with a variance that is proportional to the square of the power spectrum for large Nw. Various smoothing techniques have been developed to reduce this variance, but at the expense of spectral resolution [17]. For large Nw, where end effects at the boundaries of the data segment are negligible [17], we can write

Image

where |UoNw (ω)|2 is the Nw–point periodogram of the white noise input and is shaped by the square of the system frequency response magnitude, |H(ω)|2 = |AV (ω)R(ω)|2, to form the Nw–point periodogram of the speech waveform.

As with the voiced case, we define the prediction error between s[n] and its prediction Image

(5.42)

Image

where the prediction Image is given by

Image

And, as also with the voiced case, when the prediction coefficients equal the actual coefficients of the underlying all-pole model, i.e., αk = ak, then the prediction error equals the source input Auo[n]. Under this condition, the expected value of the squared prediction error can be shown to equal the variance of the input sequence. Specifically, because the unit variance white-noise input is weighted by the gain A, we have the prediction error variance E(e2[n]) = A2. As in the deterministic case, the resulting filter Image is called the inverse filter. When the underlying system is all-pole of order p and αk = ak, then A(z) “whitens” the measurement, i.e., as we stated, the output of A(z) with input s[n] is the white-noise process Auo[n]. This is analogous to the deterministic case where the given inverse filter results in an impulse or impulse train. Consequently, we can take a path to finding the unknown αk’s that is similar to that for the deterministic case. However, unlike the deterministic case, i.e., an impulse or impulse train as input, the prediction error is not zero over contiguous time intervals.

5.4.2 Error Minimization

In the stochastic case, we define an error function as the expected value of the squared prediction error

(5.43)

Image

where e[n] and s[n] are viewed as sample functions from a random process and Image is the prediction defined in Equation (5.42). Minimizing Equation (5.43) with respect to the unknown α, we obtain the normal equations for the stochastic case. We arrive at the optimizing solution by taking derivatives with respect to αk, as in the deterministic case, or by using the Projection Theorem for linear least-squared-error optimization that states that the minimum error e[n] must be orthogonal to the subspace spanned by the signal basis that makes up our prediction.6 In the later case, the “basis vectors” are the observations s[nk], so we require the orthogonality condition

6 We could have used this same approach in optimization for the deterministic case by replacing the expected value operator by the sum operator.

E(e[n]s[nk]) = 0,       k = 1,2, … p.

Because the white noise input is stationary in a stochastic sense (Appendix 5.A) and the vocal tract filter is linear and time-invariant, the output is also stationary. (The reader should argue the validity of this property.) Under this condition, the optimizing solution can be found as

(5.44)

Image

where the elements of the p × p matrix R, denoted by Φ[i, k], are defined by

Φ[i, k] = r[τ]|τ=ik = E(s[m]s[m + τ]),     1 ≤ ip,     0 ≤ kp

and where r[τ] is the autocorrelation function of the random process (denoted by Imagess[τ] in Appendix 5.A). The expected value of the squared prediction error, i.e., its variance, for the minimization solution is given by

Image

Because the random process is stationary, all elements along diagonal elements are equal and the matrix is symmetric about the center diagonal. Thus the matrix is Toeplitz. Consequently, the normal equations for the stochastic case are identical in structure to those for the autocorrelation method solution in the deterministic case and can be solved similarly. When s[n] follows a pth-order all-pole model, then Equation (5.44) for order p yields the true all-pole coefficients; this can be argued in a style similar to that for the deterministic case where Equation (5.17) was applied. Observe that r[τ] = E(s[m]s[m + τ]) represents an ensemble average or equivalently a time average (over all time) under an ergodicity constraint (Appendix 5.A). In practice, however, r[τ] must be estimated from a finite length sequence, and as with the deterministic case, the manner in which the boundaries of the data segment are considered will lead to the autocorrelation or covariance method of solution.

5.4.3 Autocorrelation Method

In the autocorrelation method, the correlation coefficients are estimated from a truncated speech segment, i.e.,

Image

where the data is assumed zero outside of the truncation interval. The discrete-time Fourier transform of this autocorrelation estimate is the periodogram of sn[m], as defined in the previous section. Except for the scale factor Image, this estimate is identical to that in the autocorrelation method for the deterministic case. As in the deterministic case, therefore, we form a matrix Rn which is identical (except for the scale factor Image to that given in Equation (5.16) of Section 5.3.3. As before, because the resulting matrix Rn is Toeplitz, the Levinson recursion can be used for the computation of αk by constructing successively higher-order predictors. This leads again to a lattice filter implementation of the inverse filter A(z) and lends itself to an interpretation of a “progressive whitening” by removing more and more of the correlation of the input signal at each stage. The formal stochastic development of this property is beyond the scope of our treatment and can be found in [26].

Finally, the gain A of the source remains to be determined. Given the stochastic nature of the signals, it is reasonable to select a gain such that the average power in the measurement equals the average power in the response to a white-noise input. In a stationary stochastic framework, the average power of a zero-mean sequence is its variance, i.e., the mean of the squared sequence (Appendix 5.A), so that we require that the variance of the measurement equal the variance of the model output. With a known autocorrelation function r[τ], this condition is equivalent to writing

r[0] = A2rh[0]

where rh[τ] is the autocorrelation function of the impulse response of the estimated all-pole transfer function. The squared gain is thus Image When the autocorrelation r[n] is not known then r[0] is replaced by its estimate rn[0].

5.5 Criterion of “Goodness”

We can evaluate the accuracy of the linear prediction analysis in either the time domain or the frequency domain, and either through the system or source estimation.

5.5.1 Time Domain

Suppose that the underlying speech model is all-pole of order p and suppose that the autocorrelation method is used in the estimation of the coefficients of the predictor polynomial P(z). If we estimate the predictor coefficients exactly then, with a measurement s[n] as input, the output of the inverse filter A(z) = 1 − P(z), i.e., the prediction error, is a perfect impulse train for voiced speech, a single impulse for a plosive, and white noise for noisy (stochastic) speech. In practice, however, the autocorrelation method of linear prediction analysis does not yield such idealized outputs when the measurement s[n] is inverse filtered by the estimated system function A(z). We have seen that the true solution is not obtained exactly by the autocorrelation method, even when the vocal tract response follows an all-pole model. Furthermore, in a typical waveform segment, the actual vocal tract impulse response is not all-pole for a variety of reasons, including the presence of zeros due to the radiation load, nasalization, and a back vocal cavity during frication and plosives.7 In addition, the glottal flow shape, even when adequately modeled by a two-pole function, is not minimum phase, introducing an undesirable phase component in the output of the inverse filter, as demonstrated in the following example:

7 We have seen that although a zero can be approximated by a large number of poles, this is not an efficient modeling strategy.

Example 5.6       Consider the vocal tract transfer function model H(z) in Example 5.4. In that example, the speech waveform s[n] consisted of simply a single impulse response h[n], i.e., s[n] =. h[n], in which the glottal flow waveform was embedded. We saw that application of the autocorrelation method converts the two maximum-phase poles of the glottal flow contribution to their minimum-phase counterparts. Writing the true frequency response as

S(ω) = H(ω) = Mh(ω)ej[θυ(ω)+θg(ω)]

where θυ(ω) is the minimum-phase contribution from vocal tract poles inside the unit circle and θg(ω) is the maximum-phase contribution from glottal poles outside the unit circle, then the frequency response of the resulting inverse filter A(z) from the autocorrelation method is given by

A(ω) = AMh(ω)−1ej[−θυ(ω)+θg(ω)]

where the maximum-phase glottal contribution becomes minimum-phase and where the source gain factor A is applied to cancel the gain in H(ω) because it is not part of A(ω). The frequency response of the output of the inverse filter A(z) with input s[n] is, therefore, given by

Image

The frequency response of the prediction error, therefore, can be viewed as having a flat Fourier transform magnitude distorted by an all-pass function ej2θg(ω). In the time domain, this is represented by a distorted impulse:

e[n] = [n] * ag[n]

where ag[n] is the inverse Fourier transform of ej2θg(ω). Image

Figure 5.11 Prediction error residuals for (a) voiced; (b) unvoiced plosive; (c) unvoiced fricative sounds. A 14th-order predictor is used for 5000 Hz-bandwidth signals.

Image

We conclude then that, either because our assumed minimum-phase all-pole model or our estimation method or both are not adequate, we cannot expect to see an idealized glottal source waveform at the output of the inverse filter. Figure 5.11 shows examples of inverse filtering segments of a vowel, unvoiced plosive, and unvoiced fricative with an A(z) derived from the autocorrelation method of linear prediction (order 14). The figure shows pictorially that the estimated all-pole model is not an adequate representation, particularly for the vowel and plosive sounds. The estimation was performed on 20 ms (Hamming) windowed speech segments. In reconstructing a residual from an entire utterance (we discuss synthesis methods shortly), typically one hears in the prediction error not a noisy buzz, as expected from an idealized residual comprising impulse trains, solitary impulses, and white noise, but rather roughly the speech itself, implying that some of the vocal tract spectrum is passing through the inverse filter.

5.5.2 Frequency Domain

Alternatively, we can study the behavior of linear prediction analysis in the frequency domain with respect to how well the spectrum derived from linear prediction analysis matches the spectrum of a sequence that follows an all-pole model. We are also interested in evaluating the method when the measurement does not follow an all-pole model. In this section, we present properties of spectral estimation for periodic and stochastic sounds using the autocorrelation method of linear prediction analysis.

Consider a periodic impulse train source Image with Fourier transform denoted by Ug(ω), and vocal tract impulse response with all-pole frequency response, H(ω). We have seen that the Fourier transform of the windowed output sn[m] is given by

Image

where W(ω) is the window transform, Image is the fundamental frequency, and where, for simplicity, we remove reference to the window location n. Linear prediction analysis attempts to estimate |H(ω)| (via the autocorrelation function) and hence the spectral envelope of the harmonic spectrum S(ω) (introduced in Chapter 3), as illustrated in Figure 5.12a.

For stochastic sounds, we saw that we can express the periodogram of our white-noise driven model as

|SNw(ω)|2 ≈ |H(ω)|2|UoNw(ω)|2

where |UoNw(ω)|2 is the periodogram of an Nw–point sample function of white noise. Typically the periodogram is “noisy” with high variance whose expectation is a smooth version of the desired power spectrum given by |H(ω)|2 (assuming a unity-variance input), i.e., the periodogram estimate is biased. For this case, we can interpret the goal of linear prediction analysis in the frequency domain as estimating the smoothed |H(ω)|2, which can be viewed as the spectral envelope of the periodogram, as illustrated in Figure 5.12b. This perspective of estimating the spectral envelope is thus analogous to that for periodic sounds. We now consider frequency-domain properties for these two signal classes.

Properties — P1: Recall the autocorrelation matching property: When the gain of the all-pole model is chosen so that energy in the model and the measurement are equal, then their respective autocorrelation functions match for the first p lags. This implies for large p that |H(ω)| matches the Fourier transform magnitude of the windowed signal, |S(ω)|.

Figure 5.12 Schematics of spectra of periodic and stochastic speech sounds: (a) harmonic spectrum of a voiced sound. For simplicity, we assume Image; (b) periodogram of a stochastic sound. For simplicity, we assume a unit variance white-noise source and that bias in the periodogram estimate is negligible.

Image

Figure 5.13 Linear prediction analysis of steady vowel sound with different model orders using the autocorrelation method: (a) order 6; (b) order 14; (c) order 24; (d) order 128. In each case, the all-pole spectral envelope (thick) is superimposed on the harmonic spectrum (thin), and the gain is computed according to Equation (5.30)

Image

Example 5.7       The spectral matching property for the autocorrelation method is shown in Figure 5.13 for a segment of a steady vowel sound. As the order p increases, the harmonic structure of S(ω) is revealed in the all-pole spectrum, indicating that we have overestimated the model order. Observe that this property can also be seen in the time domain; from Figure 5.7, as the order p increases, replicas of the autocorrelation function of h[n] have an increasingly stronger influence on the solution. Figure 5.13 also illustrates the result of underestimating the model order; as p decreases, the fine resonant structure of the underlying all-pole system is lost, with only the largest resonances being represented. The order p therefore can be thought of as controlling the degree of smoothing of the measured spectrum. In sound synthesis, over-estimation of p results in a “tonal” quality, with spurious resonances coming and going in time, while under-estimation of the order results in a “muffled” quality because resonant bandwidths are broadened and closely-spaced resonances are smeared. Image

From the autocorrelation matching property, we might expect H(ω) to converge to S(ω) as p → ∞. However, this convergence does not occur. Suppose in the autocorrelation method, we let the model order p go to infinity. Then, from the autocorrelation matching property, we have

Image

However, because the measurement will generally have both a minimum- and maximum-phase component,

Image

Therefore, the output of the inverse filter A(z) with p → ∞ and with input sn[m] is not the unit sample [m].

P2: Recall in our frequency-domain interpretation of the mean-squared prediction error (Section5.3.6) that we alluded to the fact that linear prediction spectral analysis tends to fit spectral peaks more accurately than spectral valleys. We now look at this property in more detail. Recall the definition of the function Q(ω) as

Image

and recall the function

ƒ(Q) = eQQ − 1

which is the integrand of the frequency-domain error in Equation (5.38). We see in Figure 5.14a that ƒ(Q) is asymmetric with a steeper slope for positive Q. This implies that for |S(ω)| > |H(ω| a larger error penalty is invoked than when |S(ω)| < |H(ω|. In other words, the linear prediction error favors a good spectral fit near spectral peaks, as illustrated schematically in Figure 5.14b This interpretation has implications when the underlying signal does not follow the assumed all-pole model [14]. For example, it implies that resonant peaks are better matched than valleys when zeros are present in the vocal tract transfer function. It also implies that the highest resonances are best matched when the model order is under-estimated. When the order is over-estimated, harmonics are best matched for voiced speech (consistent with our earlier observation), while spurious spectral peaks are best matched for unvoiced speech. This spectral matching property for an unvoiced plosive and unvoiced fricative is illustrated in Figure 5.15.

Figure 5.14 Illustration of favoring a match to spectral peaks in linear prediction analysis: (a) the asymmetric function ƒ(Q) = eQQ − 1; (b) schematic that shows favoring spectral peaks over spectral valleys. Upper (solid) envelope is favored over lower (dashed) envelope.

Image

Figure 5.15 Spectral peak matching in linear prediction analysis for (a) an unvoiced plosive (“t” in “tea”) and (b) an unvoiced fricative component of an affricate (“ch” in “which”) of Figure 5.11. Order 14 was used in the autocorrelation method.

Image

5.6 Synthesis Based on All-Pole Modeling

We are now in a position to synthesize the waveform from model parameters estimated using linear prediction analysis. The synthesized signal is given by

Image

where u[n] is either a periodic impulse train, an impulse, or white noise, and the parameters αk and A are estimated by way of the normal equations and specific gain requirements. We use here the autocorrelation method because it gives a stable and efficient solution. Nevertheless, the covariance method does have its role in speech analysis/synthesis, an example of which we see later in the chapter. There remain a number of considerations in the design of the analyzer and synthesizer.

Window Duration: Typically, we select the window duration to be fixed and to fall in the range of 20–30 ms to give a satisfactory time-frequency tradeoff (Exercise 5.10). The duration can be refined to adapt to pitch, voicing state, and even phoneme class to account for different time-frequency resolution requirements.

Frame Interval: A typical rate at which to perform the analysis is 10 ms, the frame interval with which the analysis window slides, selected to approximately capture the time evolution of the speech signal. There may be more complicated strategies in which different parameters are updated and interpolated at different rates [21]. For example, the pitch and voicing state may be updated at a 10-ms rate while the system parameters are updated at a 20-ms rate, under the assumption that the vocal tract is changing more slowly than the source parameters.

Model Order: There are three components to be considered: the vocal tract, the source, and the radiation. For the vocal tract, from Chapters 3 and 4, we found it was typical to assume an average “resonant density” of one resonance per 1000 Hz. We assume for illustration a bandwidth of 5000 Hz; then 5 poles are required and hence a model order of p = 10 for the vocal tract representation. The radiation load is modeled in discrete time by one zero inside the unit circle, which we have shown is represented by a product of poles inside the unit circle. By convention, it has been found that about 4 poles are adequate in this representation [14]. Finally, we must account for the glottal flow; here we select our two-pole maximum-phase model. In summary, we require a total of 10 (vocal tract) + 4 (radiation) + 2 (glottal flow) = 16 poles. Keep in mind that, although the magnitude of the speech frequency response is approximately matched by this model, we lose the shape of the measured time-domain speech waveform since the frequency response phase of the actual glottal flow waveform is not preserved. This is because, as we saw in Example 5.4, the maximum-phase poles of the modeled glottal flow are transformed to their minimum-phase counterparts by the linear prediction analysis. In addition, any maximum-phase zeros that arise in the vocal tract transfer function are also transformed to their minimum-phase counterparts.

Voiced/Unvoiced State and Pitch Estimation: Estimation of these parameters is described in Chapter 10. Current analysis and synthesis strategies do not account for distinguishing between plosive and fricative unvoiced speech sound categories; they are typically lumped into the unvoiced state of a binary voiced/unvoiced state decision. The unvoiced state corresponds to the condition where the vocal folds are not vibrating. The speaker’s pitch is estimated during voiced regions only. A degree of voicing may also be desired in a more complex analysis and synthesis, where voicing and turbulence occur simultaneously, as with voiced fricatives and breathy vowels; we save such refinements for later chapters. We assume then that we are provided a binary voiced/unvoiced decision on each frame and a pitch estimate on each voiced frame.

Synthesis Structures: In synthesis, the source excitation is created by concatenating an impulse train during voicing, for which spacing is determined by the time-varying pitch contour, and with white noise during unvoicing. In one strategy, the gain for the voicing state is determined by forcing the energy in the impulse response equal to the energy measured on each frame. In an unvoiced state, the gain is obtained by requiring that the variance of the measurement equal the variance of the model output. For each frame, either a series of periodic impulses or a random white noise sequence is convolved with an impulse response hk[n] that changes on each (kth) frame; the output of each convolution is then overlapped and added with adjacent frame outputs (Figure 5.16). Alternatively, the filtering with hk[n] can be performed with the direct-form implementation (Chapter 2) motivated by the difference equation representation of its all-pole z-transform (Figure 5.17a). Filter values are updated on each frame and a natural continuity exists in time across successive frames because old values are stored in delay elements. Other structures also exist, one derived directly from the concatenated-tube flow graph, and a second from an all-pole version of the all-zero lattice inverse filter in Figure 5.9 [16]. These two alternative implementations of Image are illustrated in Figure 5.17b,c. An advantage of the lattice structures is that they are expressed in terms of the reflection coefficients ri (or partial correlation coefficients ki) that are more suitable to quantization (Chapter 12) because their magnitude, as we saw earlier, does not exceed unity, unlike the linear prediction coefficients αk that have a large dynamic range. This property of not exceeding unity also makes the reflection coefficients amenable to interpolation, without the loss of the all-pole filter stability constraint (Exercise 5.15).

Figure 5.16 Overlap-add synthesis using an all-pole model. The waveform is generated frame by frame by convolutional synthesis and the filter output on each frame is overlapped and added with adjacent frame outputs.

Image

The synthesized waveform based on the autocorrelation method of linear prediction looks speech-like, but with loss of absolute phase structure because of its minimum-phase characteristic. As we see in the example of Figure 5.18, the reconstructed waveform is more “peaky” than the original; the idealized glottal flow, being made minimum-phase, is time-reversed with thus a sharper attack than the actual glottal flow. The peakiness at the origin is consistent with a minimum-phase sequence having the greatest energy concentration near the origin of all causal sequences with the same frequency response magnitude. (This property was expressed in Equation (2.23) of Chapter 2.). This peakiness is said to impart a “buzzy” quality to the linear prediction synthesis. A number of techniques have been applied to reduce this buzziness, such as post-filtering the synthesized speech by all-pass dispersion networks. In another method, pre-emphasis to reduce the glottal contribution prior to linear prediction analysis is followed by postfiltering with a glottal flow-like waveform with a gradual attack [22]. In addition, in this method the glottal pulse duration was varied with the pitch period, the ratio of the open-to-closed glottal phase being maintained. Both the reduction in peakiness and the nonstationarity of the glottal pulse were observed to help reduce buzziness in the synthesis.

Figure 5.17 Alternative synthesis structures: (a) direct form; (b) signal flow graph from lossless tube model [21]; (c) all-pole lattice [16]. Open circles denote addition.

SOURCES: L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals [21]. ©1978, Pearson Education, Inc. Used by permission; J.H. McClellan, “Parametric Signal Modeling” [16]. ©1988, Pearson Education, Inc. Used by permission.

Image

In doing speech synthesis, we can investigate the importance of the various linear prediction parameters and the limitations of the approach. Quality8 degrades as, for example, the predictor order decreases [2]. As the order decreases, the number of poles is underestimated, thus smearing the desired frequency response and removing the lower-energy formants (Figure 5.13), resulting in a muffled and lowpass character to the reconstructed sound. The synthesis quality is also found to be a function of the speaker’s characteristics. For example, as the rate of articulation or the pitch of a speaker increases, the quality is found to degrade, the former illustrating the importance of the stationarity assumption in all-pole estimation over a 20-ms analysis window, and the later illustrating the problematic nature of high-pitched speakers due to increased aliasing in the autocorrelation function with increasing pitch.

8 The term “quality” refers to speech attributes such as naturalness, intelligibility, and speaker recognizability. More discussion on quality is given in the introduction of Chapter 12 on speech coding.

Figure 5.18 Speech reconstruction based on linear prediction analysis using the autocorrelation method: (a) original; (b) synthesized. The reconstruction was performed with overlap-add synthesis and using a 14th-order all-pole model.

Image

5.7 Pole-Zero Estimation

In this section, we consider the more general problem of estimating parameters of a pole-zero rational z-transform model of a speech measurement s[n]. The rational transform is expressed as a quotient of a numerator and denominator polynomial, i.e.,

(5.45)

Image

where we assume that the signal model x[n] is causal with initial value at the origin n = 0 and that the gain term A is embedded in the numerator polynomial. The zeros in the numerator polynomial can be introduced by coupling of the nasal and oral cavities, by a back cavity behind a constriction, or by the radiation load, and the poles can be introduced by the vocal tract and glottal flow. In this section, however, we take a different perspective and view the glottal flow as a finite-length sequence, thus contributing zeros in the numerator of Equation (5.45). This view allows more flexibility in the glottal flow model, motivated by the observation that the glottal flow differs with speaker and speaking condition and contains fine structure such as aspiration and other perturbations.

Our objective is to estimate the numerator and denominator coefficients bk and ak, respectively. One approach is to minimize the mean-squared error between the model and the measurement, i.e., minimize Image with respect to the unknown coefficients. Using Parseval’s relation, we can map this error function to the frequency domain in terms of A(ω) and B(ω). Minimizing this error with respect to the unknown polynominal coefficients leads to equations that are nonlinear in these parameters and thus difficult to solve (Exercise 5.21) [16].

5.7.1 Linearization

An alternative approach to solving for the above polynomial coefficients is to linearize the parameter estimation problem by first cross-multiplying the model in Equation (5.45) to obtain

(5.46)

Image

which, in the time domain, is given by the difference equation

Image

from which we see that

(5.47)

Image

We can, therefore, set up an error minimization, involving the model x[n] and the measurement s[n], that solves for the predictor coefficients in Equation (5.47) and that is identical to the all-pole case. Specifically, let Image, be the prediction of the measurement Image. We then minimize the error function Image with respect to the coefficients αk. We have assumed here the correct model order p; in practice, we would set p equal to twice the expected number of resonances for a particular bandwidth. Let us now think of the estimated coefficients of A(z) as a time sequence −αn for n = 0,1,2 … p (αo = −1), With Equation (5.46) as the basis, we than “inverse filter” the measurement s[n] by −αn to obtain an estimate of the numerator “sequence,” i.e.,

(5.48)

Image

where we denote the estimates of bn, the coefficients of the numerator polynomial B(z), by βn. We have there fore separated the pole estimation problem from the zero estimation problem while keeping the solution linear in the unknown parameters.

In practice, however, the inverse filtering of s[n] does not lead to reliable estimates of the numerator coefficients, and thus the true zeros of the underlying signals. This is because the method relies on an accurate estimate of A(z) and because it assumes the relation S(z)A(z) = B(z), both of which, in reality, suffer from inexactness. More reliable estimates rely on formulating a least-squared error estimation for the unknown βn’s. In one such approach, the impulse response corresponding to the estimated all-pole component of the system function, i.e., Image, is first obtained. Denoting this impulse response by hα[n], we then define an error function

(5.49)

Image

where the sequence βn to be estimated represents the unknown coefficients of B(z) in the time domain, and where the term βn * hα[n] represents an estimate of the rational z-transform model in the time domain. We then minimize the error function Image with respect to the unknown βn for n = 0, 1, 2, … q. (As with the number of poles, the number of zeros is not known and must be surmised or estimated.) This leads to a set of normal equations that require the autocorrelation function of the impulse response hα[n] and the cross-correlation function of hα[n] with s[n] (Exercise 5.9). This approach was originally introduced by Shanks [12],[23]. Furthermore, it can be shown using Parseval’s Theorem that the mean-squared error E in the frequency domain is given by

(5.50)

Image

Shanks’ method thus can be interpreted as estimating the coefficients of the numerator B(z) by fitting a polynomial to the Fourier transform of the linear prediction error Â(ω)S(ω), i.e., of the estimated inverse filter output [12]. We also see in Equation (5.50) that the error function is given more weight in the important high energy resonant regions corresponding to the poles of Image.

Another approach, motivated by Shanks’ method, is to iteratively build up more accurate estimates of the poles and zeros simultaneously. Suppose on the i th iteration we have an estimate of A(ω) denoted by Ai(ω) and an estimate of B(ω) denoted by Bi(ω). Then we form the error function on the next iteration as [16]

Image

where, as before, we can view Image as more heavily weighting the error Ai+1(ω)S(ω) − Bi+1(ω) at the higher-energy resonant regions of the signal. We then minimize ei +1[n] (the reader should make the mapping to the time domain) with respect to the unknown αk ’s and βk ’s to form estimates for the next iteration.

In this section, we have introduced a few of many methods of estimating zeros of a rational z-transform. Two other methods of zero estimation are, for example, “inverse linear prediction” [12] (Exercise 5.22) and a method proposed by Steiglitz [24] (Exercise 5.6).

5.7.2 Application to Speech

We now return to our transfer function for voiced speech from the glottis to the lips that we have expressed in the form

Image

where G(z) is the z-transform of the glottal flow over one glottal cycle, Image represents the minimum-phase all-pole (resonant) component (assuming here that the vocal tract does not introduce zeros), and R(z) represents a zero inside the unit circle due to the radiation load. In the simplifying case when the glottal flow is an impulse, then H(z) = AV(z)R(z) and the methods of the previous section can be applied to estimating the single zero of R(z) and the poles of V(z) (Exercise 5.4). These methods are not constrained to estimating zeros inside the unit circle; the zeros can be minimum- or maximum-phase.

With the presence of the glottal flow, however, we are faced with a new problem. Consider our idealized glottal flow model of two maximum-phase poles for a single glottal cycle. We have not yet discussed methods for estimating maximum-phase poles of a z-transform; we will see such methods in Chapter 6. For now, we take a different and less constraining approach to modeling the glottal flow as a finite-length sequence. Our idealized two-pole model is, after all, simply an approximate representation of a certain class of smooth glottal flow sequences. The glottal flow typically consists of additional components of fine structure as, for example, aspiration and ripple components. As such, we represent the glottal flow, together with the radiation load and gain, by a finite number of zeros, i.e.,

Image

with Mi zeros inside and Mo zeros outside the unit circle and where we embedded the gain A in the coefficients bk . We now introduce the z-transform P(z) of a periodic impulse train Image with period P to represent the periodicity of a voiced sound. (Up to now in this section, we have assumed only one impulse response in the measurement s[n].) Our discrete-time model of voiced speech then becomes in the z-domain

Image

where B(z) = AG(z)R(z). The z-transform S(z) in the time-domain is then given by

Image

The sequence bn represents the convolution of the scaled all-zero glottal flow g[n] [inverse z-transform of AG(z)] with the radiation operator rL[n] = δ[n] − γδ[n − 1] [inverse z-transform of the single-zero radiation component R(z)], i.e., bn = Ag[n] * rL[n]. Because, as we have seen, the radiation sequence rL[n] acts as a differentiator, the sequence bn represents the scaled glottal flow derivative over a single glottal cycle.

We denote the length of the sequence bn by L that we assume is sufficiently less than a glottal cycle such that there exists a time region I within which a glottal cycle is “free of zeros.” In other words, there exists a region I in which the above difference equation is not driven by bn. This interval I is the closed-phase glottal region (minus one sample due to the radiation operator r[n] that for simplicity we ignore in our discussion), as illustrated in Figure5.19. Under this condition, we can write

Image

Therefore, if we are to use the two-step methods of the previous section for pole-zero estimation, the initial pole estimation must be performed over the limited region I which is “free of zeros.” This approach is referred to as pitch synchronous analysis [5],[19],[28],[30].

The duration of the interval I may be very short. For example, for a high-pitch female of 400 Hz pitch at a 10000 Hz sampling rate, the interval I consists of 25 samples minus the duration of bn—very short, indeed! With the autocorrelation method of linear prediction, where s[n] is set to zero outside the region I, the error function is therefore dominated by prediction error end effects. For this reason, i.e., under this condition of very limited data, the covariance method gives a more accurate solution to the a parameters. The glottal flow derivative estimate is then obtained by inverse filtering the speech waveform with this all-pole vocal tract filter. A consequence of the inverse filtering is an approximate separation of the vocal tract filter from the speech waveform. Indeed, the inverse-filtered speech is essentially unintelligible, perceived largely as a “buzz.” Although we do not currently have a quantitative way of measuring the accuracy of this separation on real speech, a frequency-domain view confirms that negligible vocal tract formant energy is present [19]; the inverse-filtered spectrum typically consists of a smooth lowpass function with an occasional weak peak in the vicinity of the first formant, due to the presence of a ripple component, consistent with the Ananthapadmanabha and Fant theory of nonlinear source/vocal tract interaction [1].

Figure 5.19 Schematic of the glottal closed-phase region I and its timing relation to the speech waveform, the glottal flow derivative duration L, and the pitch period P

Image

Of course, one must first estimate the glottal closed-phase region to obtain the interval I; this is a difficult task for which a number of methods have been developed. Detailed description of these methods is beyond our scope here, but we will give the reader a flavor for a few methods. Childers, for example, uses electroglottographic (EGG) analysis to identify the glottal closed phase over which the covariance method of linear prediction is applied to obtain an estimate of the vocal tract transfer function [4]. Wong, Markel, and Gray [28] and Cummings and Clements[5] perform, on the speech waveform itself, a sliding covariance analysis with a one sample shift, using a function of the linear prediction error to identify the glottal closed phase. This approach of relying on the prediction error was observed to have difficulty when the vocal folds do not close completely or when the folds open slowly.

Yet another method of glottal closed phase estimation relies also on a sliding covariance analysis but, rather than using the prediction error from this analysis, uses vocal tract formant modulation which, we have seen, is predicted by Ananthapadmanabha and Fant [1] to vary more slowly in the glottal closed phase than in its open phase and to respond quickly to a change in glottal area (Figure 4.25). A “stationary” region of formant modulation gives a closed-phase time interval, over which we estimate the vocal tract transfer function; a stationary region is present even when the vocal folds remain partly open9 [19].

9 The formant frequencies and bandwidths are expected to remain constant during the closed phase, but shift during the open phase. For voices in which the vocal folds never completely close, such as breathy voices, a similar formant modulation occurs. For such voices, during the nominally closed phase, the glottis should remain approximately constant, resulting in a nearly fixed change in formant frequencies. When the vocal folds begin to open, the formants move from their relatively stationary values during the closed phase. The first formant is found to be more stationary than higher formants during the closed phase and exhibits a more observable change at the start of the open phase [19].

With an estimate of the inverse filter A(z), we can apply the methods of the previous section to obtain an estimate of bn and thus the glottal flow convolved with the radiation load, i.e., the glottal flow derivative. An approximate estimate of the glottal flow, unaffected by radiation, can be obtained by pre-emphasizing s[n] at 12 dB/octave or by integrating the glottal flow derivative estimate. In the following example, the autocorrelation method applied over multiple pitch periods is compared against the above pitch synchronous covariance method for the purpose of inverse filtering the speech waveform.

Example 5.8       An example of a glottal flow derivative estimate using the pitch-synchronous covariance method followed by inverse filtering is shown in Figure 5.20c. In this example, the vocal tract poles are first estimated over the glottal closed-phase region I using the covariance method of linear prediction with a 14th-order predictor. The closed-phase glottal region corresponds to a stationary formant region as described above [19]. The resulting A(z) is then used to inverse filter s[n] to obtain an estimate of AG(z)R(z). Because of the differentiation by R(z), a large negative glottal pulse excitation is observed in the glottal flow derivative to occur at glottal closure. We also see a ripple component, consistent with the Ananthapadmanabha and Fant theory of nonlinear source/vocal tract interaction [1]. In addition, we see, in Figure 5.20b, the result of the counterpart autocorrelation method using a 14th-order predictor and a 20-ms analysis window covering multiple pitch periods. The method gives a more pulse-like residual consisting of a series of sharp negative-going peaks at roughly the location of glottal pulses, very different from the actual glottal flow derivative, because it attempts to estimate the combined glottal flow/vocal tract/radiation frequency response. The result of the covariance method, on the other hand, shows a clear view of the closed, open, and return glottal phases as well as the glottal pulse and ripple component overriding the coarse glottal flow derivative. Image

Figure 5.20 Example of estimation of glottal flow derivative: (a) original speech; (b) “whitened” speech using an inverse filter derived from the autocorrelation method over multiple pitch periods; (c) estimated glottal flow derivative using the pitch-synchronous covariance method over a closed glottal phase.

SOURCE: M.D. Plumpe, T.F. Quatieri, and D.A. Reynolds, “Modeling of the Glottal Flow Derivative Waveform with Application to Speaker Identification” [19]. ©1999, IEEE. Used by permission.

Image

The next example illustrates a somewhat more atypical glottal flow contribution of prominent secondary glottal pulses that were described in Chapter 3.

Example 5.9       Figure 5.21 shows glottal flow derivative estimates with the presence of secondary glottal pulses. As in the previous example, the vocal tract poles are first estimated over the glottal closed-phase region I using the pitch-synchronous covariance method of linear prediction with a 14th order predictor. As before, the closed-phase glottal region corresponds to a stationary formant region. Manifestation of the secondary pulses can also be seen in the speech waveform itself. Image

We note that the alternate zero estimation methods of Section 5.7.1 (e.g., Shanks’ method) are also applicable to glottal flow estimation. In addition, the basic method and its extensions can be applied to estimating zeros in the vocal tract transfer function, as in nasalized vowels. The separation of glottal/radiation zeros and vocal tract zeros in this case, however, poses a new and challenging problem (Exercise 5.25). Observe that the methods of zero estimation of this chapter cannot be extended to unvoiced speech modeled by stochastic processes because there does not exist a region “free of zeros” for this speech class. Finally, having a pole-zero representation of speech production, we can generalize the synthesis structures of Section 5.6. Lattice structures, for example, can be generalized to include zeros [18].

Figure 5.21 Two examples of multiple pulses in the glottal flow derivative estimate. In each case, the speech waveform is shown above the glottal flow derivative waveform.

SOURCE: M.D. Plumpe, T.F. Quatieri, and D.A. Reynolds, “Modeling of the Glottal Flow Derivative Waveform with Application to Speaker Identification” [19]. ©1999, IEEE. Used by permission.

Image

5.7.3 High-Pitched Speakers: Using Two Analysis Windows

For high-pitched speakers, it is possible that the above techniques invoke too small a glottal closed-phase interval I. For example, a speaker with a fundamental frequency of 200 Hz and an open phase 70% of a pitch period will have a closed phase of only 1.5 ms = 0.30/200 Hz. To address this problem, we use a covariance analysis which is based on two windows across two successive pitch periods [19]. Assuming that the rate of change of the vocal tract is dependent on time and not on the number of pitch periods, the vocal tract variation over two frames for a 200-Hz voice, for example, is approximately the same as one frame of a 100-Hz voice, since both last for 10 ms. By splitting the covariance method analysis window into two parts, we double the length of our interval I. Because this technique is dependent on stationarity of both the vocal tract and the source across multiple pitch periods, it is used only when the pitch period is small (chosen empirically at less than 6.5 ms).

As an extension to the covariance method of linear prediction, two windows of speech data can be used to calculate the matrix Φ i.e.,

(5.51)

Image

where M1 is the start of the first region, L1 is the length of the first region, M2 is the start of the second region, and L2 is the length of the second region. The only change required to convert the standard covariance linear prediction procedure into a two-window procedure is this change in the calculation of the matrix Φ. The properties of the matrix Φ with elements Φ[i, k] are the same as for the one-window case as long as the windows are non-overlapping, allowing efficient solution by Cholesky decomposition (Exercise 5.19).

5.8 Decomposition of the Glottal Flow Derivative

In previous sections of this chapter, we described a method of estimating the glottal flow derivative waveform from voiced speech that relies on inverse filtering the speech waveform with an estimate of the vocal tract transfer function. Estimation of the vocal tract transfer function is performed during the glottal closed phase within which the vocal folds are in a closed position and there is no dynamic source/vocal tract interaction. With the objective of modeling the temporal structure of the glottal flow and decomposing glottal flow components, in this section we develop a time-domain feature representation of the glottal flow derivative during voicing. The coarse structure of the flow derivative is represented by the piecewise-functional Liljencrants-Fant (LF) model [8] consisting of seven parameters, obtained by a nonlinear estimation method of the Newton-Gauss type. The coarse structure is then subtracted from the glottal flow derivative estimate to give its fine-structure component, reflecting characteristics not captured by the general flow shape such as aspiration and the ripple component we have previously described.

5.8.1 Model

The relation between the coarse structure of the glottal flow, denoted here by ugc(t), and its derivative, denoted by Image, was shown in Figure 4.21. The glottal flow derivative is characterized by a large negative impulse-like glottal pulse and the open- and closed-glottal phases. Fine structure of the glottal flow derivative, denoted by υgf (t), is the residual waveform υgf (t) = υg(t) – υgc(t), υg(t) being the complete glottal flow derivative. Two contributions of fine structure are considered, ripple and aspiration. As was illustrated in Figure 4.23, ripple is a sinusoidal-like perturbation that overlays the coarse glottal flow, and arises from the time-varying and nonlinear coupling of the glottal flow with the vocal tract cavity. Our second form of fine structure, aspiration at the glottis, arises when turbulence is created as air flows through constricted vocal folds, and is also dependent on the glottis for its timing and magnitude. For example, a long, narrow opening, which constricts the air flow along the entire glottal length, tends to produce more aspiration than, for example, a triangular-shaped opening with partial constriction. The creation of turbulence at the glottis is highly nonlinear and a satisfactory physical model has yet to be developed. A simplification is to model aspiration as a random noise process, which is the source to the vocal tract. We model the complete fine-structure source as the addition of the aspiration and ripple source components, although other fine-structure components may be present.

We now propose functional models of the coarse and fine structure useful for feature representation and, in particular, in speaker recognition described in Chapter 14. The features we want to capture through the coarse structure include the glottal open, closed, and return phases, the speeds of opening and closing, and the relationship between the glottal pulse and the peak glottal flow. To model the coarse component of the glottal flow derivative, vgc(t), we use the Liljencrants-Fant (LF) model [8], expressed over a single glottal cycle by (Figure 5.22)

(5.52)

Image

where Image and where the time origin, t = 0, is the start time of the closed phase (also the end of the return phase of the previous glottal cycle which we later also denote by Tc–1), To is the start time of the open phase (also the end of the closed phase), Te is the start time of the return phase (also the end of the open phase and time of the glottal pulse), and Tc is the end time of the return phase (also the beginning of the closed phase of the next glottal cycle). Three of the parameters, Eo, Ωo, and α, describe the shape of the glottal flow during the open phase. The two parameters Ee and β describe the shape of the return phase. Because at time t = Te, Eo can be calculated from Ee using the relation Imagewe reduce the number of waveshape parameters to four, i.e., Ωo, α, Ee, and β. For the LF model, we estimate Ee, not E or E1; Ee is the absolute value of the negative peak for which an initial estimate is easily obtained. The resulting four waveshape parameters do not include the glottal timing parameters; therefore, the times To, Te, and Tc must also be made variables. We thus have a 7-parameter model to describe the glottal flow, with the four LF model shape parameters, and three parameters indicating the timing of the flow components. A descriptive summary of the seven parameters of the LF model is given in Table 5.1.

Figure 5.22 LF Model for the glottal flow derivative waveform.

SOURCE: M.D. Plumpe, T.F. Quatieri, and D.A. Reynolds, “Modeling of the Glottal Flow Derivative Waveform with Application to Speaker Identification” [19]. ©1999, IEEE. Used by permission.

Image

Table 5.1 Description of the seven parameters of the LF model for the glottal flow derivative waveform.

SOURCE: M.D. Plumpe, T.F. Quatieri, and D.A. Reynolds, “Modeling of the Glottal Flow Derivative Waveform with Application to Speaker Identification” [19]. ©1999, IEEE. Used by permission.

Image

In characterizing the glottal flow derivative, we also define five time intervals within a glottal cycle. The first three intervals correspond to timing within the LF model of coarse structure, while the last two intervals come from timing measurements made on formant modulation described earlier. The time intervals are given as

1. I1 = [0, To] is the closed phase for the LF model.

2. I2 = [To, Te] is the open phase for the LF model.

3. I3 = [Te, Tc] is the return phase for the LF model.

4. I4 = [0, Tƒ] is the closed phase for the formant modulation.

5. I5 = [Tƒ, Te] is the open phase for the formant modulation.

5.8.2 Estimation

With a glottal source waveform estimate, we can estimate parameters of the coarse and fine model components of the glottal flow derivative. The seven parameters of the LF model to be estimated for each glottal cycle were summarized in Table 5.1. A least-squared error minimization problem can be formulated to fit the LF model of Equation (5.52) to the glottal flow derivative waveform. For the glottal flow derivative estimate υg[n], the error criterion is given as

Image

where No, Ne, and Nc are sample-time10 counterparts to the continuous-time variables To, Te, and Tc of Equation (5.52), ωo is a discrete-time frequency, and Image is a vector of the seven model parameters. The error Image is a nonlinear function of the seven model parameters with no closed-form solution, and thus the problem is solved iteratively using a nonlinear least-squared error minimization with calculation of first- and second-order gradients.

10 For notational convenience, the sampling time interval is normalized to unity.

Standard methods for solving nonlinear least-squares problems, such as the Gauss-Newton method, are not adequate when the minimum error Image is large [6]. This is often the case in fitting the LF model to the glottal flow derivative waveform because ripple and aspiration, not represented by the LF model, manifest themselves in Image. An algorithm more amenable to large optimization error is an adaptive nonlinear least-squares regression technique, referred to as the NL2SOL algorithm.11 For specifics on parameter estimation with the NL2SOL algorithm, the interested reader is referred to [19]. This algorithm also has the advantage of allowing bounds to enable parameters to be limited to physically reasonable values. For example, if the value ωo is less than π, the model will have no negative flow derivative during the open phase, which is inconsistent with a negative going glottal pulse. Another example of an unrealistic condition is the parameter Ee’s taking on a positive value or a value near zero. Therefore, π and 0 are the lower bounds for estimates of the model parameters ωo and Ee, respectively. When a resulting model parameter estimate is too close to its bound, in particular, a constraint of 1% of its bound obtained experimentally, we consider data for that frame to be unreliable. We refer to such parameters as “singularities.” Subtracting the estimated coarse structure from the glottal flow derivative waveform yields the fine structure with contributions of aspiration due to turbulence at the glottis and ripple due to source/vocal tract interaction, as illustrated in the following example:

11 The NL2SOL algorithm is the Association for Computing Machinery algorithm 573 [6],[7]. The acronym derives from its being a NonLinear secant approximation To the Second Order part of the Least-squared Hessian.

Example 5.10       Figure 5.23a shows an example of the coarse-structure estimate (dashed) superimposed on the glottal flow derivative estimate (solid), along with the timing estimates To, Te, and Tc of the LF model. The closed phase interval I1 = [0, To), as determined by the LF model, comprises an interval of aspiration followed by ripple, the ripple continuing into the open phase I2 = [To, Te), after which we see the occurrence of a sharp glottal pulse and a gradual return phase I3 = [Te, Tc). The residual (Figure 5.23b), formed by subtracting the coarse structure from the glottal flow derivative estimate, forms the fine-structure estimate. The starting time of the open phase, Tƒ, according to formant modulation, is also shown.12 In this case, the closed phase I4 = [0, Tƒ), as determined by formant modulation, consists of primarily aspiration, while the following open phase I5 = [Tƒ, Te) is initially dominated by ripple, with no significant glottal flow, and then a slowly rising glottal flow with superimposed ripple.Image

12 This example also illustrates the improved temporal resolution that can be gained by the timing parameter Tƒ in representing fine structure.

Figure 5.23 Example of a glottal flow derivative estimate and its coarse and fine structure: (a) estimated glottal flow derivative (solid) and overlaid LF model, i.e., the coarse structure (dashed); (b) the fine structure obtained by subtracting the coarse structure from the glottal flow derivative estimate. Aspiration and ripple are seen in different intervals of the glottal cycle.

SOURCE: M.D. Plumpe, T.F. Quatieri, and D.A. Reynolds, “Modeling of the Glottal Flow Derivative Waveform with Application to Speaker Identification” [19]. ©1999, IEEE. Used by Permission.

Image

Through experiments of this kind, it has been observed that coarse structure and fine structure of ripple and aspiration during voicing are speaker-dependent [19]. In addition, similar experiments have shown that such characteristics may have more general dependencies such as dialect [29]. We will exploit such dependencies in Chapter 14 in automatic speaker recognition.

5.9 Summary

In this chapter, we introduced the concept of linear prediction analysis of speech sounds whose underlying system function follows an all-pole model. We saw that linear prediction analysis can be formulated for both deterministic and stochastic sound classes using the autocorrelation or covariance method of solution. For the autocorrelation method, the solution is referred to as the normal equations. The autocorrelation method of linear prediction analysis assumes zero data outside of a short-time windowed speech segment and thus typically does not result in an exact solution, even when the data follows an all-pole mode l. Nevertheless, the autocorrelation method provides a stable and efficient solution to all-pole parameter estimation. Techniques of synthesis based on the linear prediction analysis were presented. Limitations of speech analysis/synthesis based on the autocorrelation method of linear prediction were also described. One problem that plagues this approach is the requirement of stationarity of the speech source and system under the analysis window. Exercise 5.13 gives an interesting alternate modeling scheme whereby predictor coefficients are themselves modeled as polynomials. This requirement of stationarity, however, is not exclusive to linear prediction-based analysis/synthesis; we will encounter it in all analysis/synthesis techniques throughout the text.

We also saw in this chapter that a more accurate approach to speech modeling is to allow zeros, as well as poles, in the model vocal tract transfer function, representing the “anti-resonances” of the vocal tract, as well as an arbitrary glottal flow function. We described a number of techniques for doing pole-zero estimation and decomposition. This modeling approach, together with the covariance method of linear prediction applied pitch-synchronously, provided a means for glottal flow estimation that led to a decomposition of the glottal flow derivative into its coarse and fine-structure (aspiration and ripple) components. These glottal components are used in Chapter 14 in an automatic speaker recognition application. Such methods of separating speech components represent just some of decomposition approaches we describe in the text. In the next chapter, for example, we describe a different approach to separating the pole and zero contributions of a vocal tract transfer function based on a method referred to as homomorphic deconvolution.

Appendix 5.A: Properties of Stochastic Processes

In this brief review, we assume familiarity with fundamental concepts of probability theory, including random variables, the probability density (distribution), and stochastic averages. Further review of this material can be found in [17],[26].

Random Processes

Consider tossing a coin. On each toss at time n, we assign value 1 to a sequence x[n] (heads) and – 1 (tails). Suppose the probability of heads is p and probability of tails is 1 – p. An example is Image. and Image We interpret the sequence value x[n] as a particular value of the random variable xn. That is, the value x[n] is the outcome of the experiment of tossing a coin. The random variable xn has associated with it the probability Pxn (x[n]). We now give numerous definitions associated with properties of random variables.

Definition: A random process is an indexed set of random variables

{xn},       −∞ < n < ∞

with a probability of each random variable Pxn (x[n]) which is a function of the index n.

Definition: A sample sequence (or “realization”) of the random process is a particular sequence of values

{x[n]},       −∞ < n < ∞

The number of possible sample sequences is infinite.

Definition: An ensemble (of sample sequences) is a collection of all possible sample sequences.

In applying random process theory to speech processing, we consider a particular sequence to be one of an ensemble of sample sequences. We don’t always know the underlying probability description of the random process, and so it must be inferred or estimated.

In a generalization of the binary coin tossing experiment, each random variable xn in the random process takes on a continuum of values, and has an associated probability density function (pdf) denoted by pxn (x[n]), where x[n] is a particular value of the random variable. The pdf integrates to unity (with certainty that one of all possible events must occur):

Image

Inter-dependence between two random variables xn and xm of the process is given by the joint pdf:

pxn, xm (x[n], [m]).

A complete characterization of the random process would require all possible joint density functions.

Definition: Statistical independence of two random variables xn and xm must satisfy the condition

pxn, xm (x[n], x[m]) = pxn (x[n]) pxm (x[m]).

If this is true for all n, m (nm), then the process is “white.”

Definition: Stationarity of a random process requires that all joint pdf’s are independent of a shift in the time origin. For a second-order pdf, for example, we must have

pxn+k, xm+k(x[n + k], x[m + k] ) = pxn,xm(x[n], x[m]).

The pdf depends on the distance between variables, not absolute time (Figure 5.24).

Figure 5.24 Illustration of stationarity of a random process. The distance between samples is the same for each depicted pair.

Image

Ensemble Averages

Average properties of the random process often serve as a useful, but incomplete, characterization of the process. Below we consider some essential definitions.

Definition: The mean of a random variable xn from a process is given by

Image

which satisfies the property of superposition, i.e.,

Image

mxn is sometimes called an ensemble average for a particular xn, i.e., the average over all sample sequences at time n.

Definition: The mean-squared value of a random variable xn from a process is given by

Image

which is sometimes called the “average power.”

Definition: The variance of xn is given by

Image

Taking the square root, we obtain σxn which is called the standard deviation.

Definition: The autocorrelation of two random variables xn and xm from a process is given by

Image

which is the measure of the dependence between variables of the random process at different times.

Definition: The autocovariance of two random variables from a process is given by

Image

which is the autocorrelation without the mean.

Definition: Uncorrelated random variables (or “linearly independent”) must satisfy the condition

E[xnyn] = E[xn]E[yn].

A sufficient condition for linear independence is given by

pxn, ym (x[n], y[m]) = pxn(x[n] )pym (y[m] ),

i.e., the variables are statistically independent. In general, linear independence is not sufficient for statistical independence. If the mean of either random variable is zero, then uncorrelated implies

E[xnyn] = 0.

Stationary Random Process

Definition: For wide sense stationarity, both the mean and variance of the random process variables are constant:

Image

and the autocorrelation is a function only of the time difference, τ, i.e.,

Image

Definition: For stationarity in the strict sense, all joint pdf’s are functions of time differences (corresponding to our previous definition of stationarity).

For a white noise sequence, we have

Image

where δ[τ] is the unit sample sequence. A white noise sequence is thus wide sense stationary and is uncorrelated for all time differences.

Time Averages

The notion of ensemble average (the above statistical averages) is conceptually important but not practical. Under a condition known as ergodicity, ensemble averages are equal to time averages.

Definition: An ergodic random process is a random process for which time averages equal ensemble averages. Here, wide-sense stationarity is implicit.

A consequence of ergodicity is that for any single sample sequence, (< · > denoting time average)

Image

and

Image

With finite limits on the sums, we obtain only estimates of mean and autocorrelation.

Power Density Spectrum

Definition: The discrete-time Fourier transform of the autocorrelation function of a wide sense stationary random process is referred to as the power density spectrum or power spectrum. Denoting the power spectrum of a process x[n] by Sx(ω), we have the Fourier transform pair

Image

The description “power density” is motivated by seeing from the above inverse Fourier transform that the average power in x[n] is given by [17],[26]

Image

and thus each frequency band of (infinitesimally small) width makes a contribution Image to the total average power. The power spectrum is sometimes normalized by its total average power in order to have the properties of a probability density. The interpretation of Sx(ω) as a power density is analogous to our formulation of the energy density for deterministic signals through Parseval’s Theorem in Chapter 2, i.e., Image

Definition: An estimate of the power spectrum is obtained by computing the squared magnitude of the discrete-time Fourier transform of x[n] over a finite interval N, i.e.,

Image

which is referred to as the periodogram. The periodogram is related to the estimate of the autocorrelation Image through the Fourier transform. The periodogram fluctuates about the power spectrum with a variance that is proportional to the power spectrum for large N [17]. The expected value of the periodogram is given by

Image

i.e., it is the circular convolution of the power spectrum of x[n] with the Fourier transform of the rectangular window w[n] = 1 for 0 ≤ n < N, and thus the periodogram is a biased estimate of the power spectrum in the sense that spectral resolution is lost. Various smoothing techniques have been developed to reduce the variance of the periodogram, such as windowing the autocorrelation estimate or applying non-rectangular windows to the time waveform, but at the expense of further loss in spectral resolution [17].

Often we are concerned with passing white noise through a linear time-invariant system. Because the power spectrum of a white noise input x[n] with variance σ2 is given by Sx(ω) = σ2, the power spectrum of the linear time-invariant system output y[n] can be expressed as

Sy(ω) = σ2|H(ω)|2

where H(ω) is the frequency response of the system. For an N-point measurement of the system output y[n], with large N, where end effects at the boundaries of the data segment are negligible [17], we can write the periodogram of the output as

|YN(ω)|2 ≈ |H(ω)|2|XN(ω)|2

where |XN(ω)|2 is the N-point periodogram of the white noise and is shaped by the system frequency response |H(ω)|2 to form the N-point periodogram of the waveform.

Appendix 5.B: Derivation of the Lattice Filter in Linear Prediction Analysis

Define the forward i th-order prediction error as

Image

with z-transform

(5.53)

Image

where

(5.54)

Image

and define the backward i th-order prediction error as

Image

with z-transform

(5.55)

Image

From Levinson’s recursion,

(5.56)

Image

Substituting Equation (5.56) into Equation (5.54), we have

(5.57)

Image

In the second term of Equation (5.57) letting k' = ik, Equation (5.57) becomes

Image

or

(5.58)

Image

Substituting Equation (5.58) into Equation (5.53),

(5.59)

Image

From Equations (5.59), (5.53), and (5.55),

Image

so that

(5.60)

Image

which, in the time domain, is given by

(5.61)

Image

Now, substituting Equation (5.58) into Equation (5.55),

Image

or

(5.62)

Image

which, in the time domain, is given by

(5.63)

Image

Exercises

5.1 Show that for the autocorrelation method of linear prediction of Section 5.2.3, the normal equations are written as

Image

where

Image

and where N is the data length.

5.2 In this problem, you show that if sn[m] is a segment of a periodic sequence, then its autocorrelation rn[τ] is periodic-like. You also investigate a small deviation from periodicity. In particular, consider the periodic impulse train

Image

(a) Compute and sketch the autocorrelation rn[τ] of the windowed sequence x[n], i.e., x[n]w[n], when the window applied to x[n] is rectangular over the interval 0 ≤ n < Nw with length Nw = 4P, showing that rn[τ] is periodic-like and falls off roughly linearly with increasing τ.

(b) How does your result from part (a) change if the pitch period increases by one sample on each period and the window length is long enough to cover the first three impulses of x[n]? Observe that the first three impulses occur at n = 0, P + 1, and 2P + 2.

(c) Suppose now that x[n] is convolved with a single decaying exponential

y[n] = x[n] * anu[n]

with u[n] the unit step. How does your result in parts (a) and (b) change with y[n]? Do not compute; just sketch the results.

5.3 Prove the autocorrelation properties (P1)-(P3) of Section 5.3.3.

5.4 Occasionally the vocal cords vibrate with a “secondary flap” within a pitch period. Consider a model for the response of the vocal tract to one primary glottal impulse δ[n], and one secondary glottal impulse at time no, δ[nno], within a pitch period (here we are assuming an idealized glottal pulse):

Image

where the scale factor α on the secondary pulse is typically less than one.

(a) Suppose you are given h[n] and the time delay of the secondary pulse no. Derive an estimate of the scaling factor α by minimization of the error criterion:

Image

with respect to the unknown α and where s[n] is the speech measurement.

(b) Suppose now that the z-transform of the vocal tract impulse response is given by

Image

consisting of a minimum-phase (all-pole) Image and one maximum-phase zero (1 − bz) with |b| < 1. Suppose you are given A(z), the speech measurement s[n], and the delay no. Estimate the maximum-phase zero.

(c) Suppose you do not know A(z). Describe a two-step procedure for estimating the zero (1 − bz) and the poles of A(z) in part (b).

5.5 In this problem you complete the missing steps of Example 5.3 in which we build a first-order all-pole model from the first N samples of x[n] = anu[n] by using the autocorrelation method of linear prediction.

(a) Show how the estimate α1 is found in Example 5.3 from the autocorrelation normal equations.

(b) Give the steps in finding the minimum squared error E0 in Example 5.3 and show that this error converges to unity as N → ∞.

5.6 Suppose we want to derive a rational model for an unknown system S using the technique in Figure 5.25. A known input x[n] is applied to the system and the output y[n] is measured. Then the parameters of two FIR filters are chosen to minimize the energy in the error signal e[n] in Figure5.25.

(a) Write the normal equations that describe the optimal least-squared error solution for A(z) and B(z).

(b) The philosophy of this method is that if the error is small, then Image is a reasonable model for S. Suppose S is an LTI rational system; show that this method identifies the parameters of S exactly when the orders of the numerator and denominator polynomials of S are known.

Figure 5.25 Pole-zero estimation for a rational model.

SOURCE: K. Steiglitz, “On the Simultaneous Estimation of Poles and Zeros in Speech Analysis” [24]. ©1977, IEEE. Used by permission.

Image

5.7 This problem addresses the difficulty inherent in linear prediction analysis for high-pitched speakers.

(a) Suppose h[n] is the impulse response of an all-pole system where

Image

so that

Image

Show that

Image.

Hint: Multiply both sides of the difference equation by h[ni] and sum over n. Note that h[n] is causal.

(b) Assume s[n] is a periodic waveform, given by

Image

where P is the pitch period. Show that the autocorrelation of s[n], windowed over multiple pitch periods, consists of periodically repeated replicas of rh[τ], i.e.,

Image

but with decreasing amplitude due to the window.

(c) Using your result in parts (a) and (b), explain the difference between your result in part (a) and the normal equations for the autocorrelation method using the windowed speech signal s[n].

(d) Using your results from parts (b) and (c), explain why linear prediction analysis is more accurate for low-pitched speakers than high-pitched speakers.

5.8 Unlike the autocorrelation method, the covariance method of linear prediction can determine the exact all-pole model parameters from a finite data segment. Consider a signal of the form x[n] = anu[n] with u[n] the unit step.

(a) Suppose the rectangular window w[n] has duration Nw ≥ 2. Determine the normal equations for the covariance method for w[n]x[n] = w[n]an (Figure 5.26). Solve for the one unknown prediction coefficient α1, as a function of the true value a. Compute the mean-squared prediction error Image

Figure 5.26 Sequence w[n]x[n] = w[n]an with w[n] a rectangular window.

Image

(b) Suppose now a “quasi-periodic” sequence is formed by two replicas of x[n]:

Image

as shown in Figure 5.27. Again use the covariance method to obtain an estimate of a, where the window is placed “pitch synchronously,” beginning at time sample n = P + 1 with Nw − 1 > P + 1 as shown in Figure 5.27. What is the implication for your solution of not requiring “pitch synchrony?”

(c) If you were to repeat part (b) using the autocorrelation method, how would your estimate of a change? Do not solve. Explain your reasoning.

Figure 5.27 Image formed from two replicas of x[n].

Image

5.9 Suppose the z-transform of a sequence x[n] consists of both poles and zeros:

Image

where

Image

(a) Given a measurement s[n], formulate a least mean-squared prediction error problem to solve for the coefficients of A(z), free of the coefficients bk. Assume you know the order p of A(z). Select the prediction-error window to be right-sided of infinite length. Derive the normal equations for the covariance method with the appropriate limits of summation. Do your normal equations give the true coefficients of A(z)? Do not solve.

(b) Assume now an estimate Image) of A(z). Denote Image) as the impulse response of Image. We want to formulate a linear least mean-squared error problem to solve for the coefficients bk of B(z). To do so, we form the error

Image

where

Image

Find the normal equations by minimization of the error function

Image

with respect to the unknown b[n] for n = 0, 1 … q. Set up the normal equations in matrix form. Is the governing matrix Toeplitz? Using Parsevel’s Theorem, show that the mean-squared error E has the frequency-domain interpretation Equation (5.50).

(c) Suppose now that Image is the z-transform of the vocal tract output volume velocity where Image represents the vocal tract poles and B(z) represents the glottal flow of one glottal cycle during voicing. Suppose, however, that during the glottal open phase, i.e., the time over which the glottal flow occurs, the effective vocal tract shape changes as the glottal slit area increases and then decreases. Given that the glottal pulse width is q samples, and A(z) is of order p, repeat part (a) estimating the poles (prediction coefficients) using only the waveform during the closed glottal phase, and also not allowing the time-varying vocal tract during the open glottal phase to influence the pole estimates. Hint: Consider a prediction error window in which x[n] is not a function of the changing vocal tract parameters, but rather only the stationary vocal tract parameters.

5.10 Describe the effect on the short-time autocorrelation function rn[τ] of a change in the vocal tract impulse response over successive periods when the analysis window falls over multiple pitch periods. What does this imply about the time resolution of the autocorrelation method of linear prediction analysis? Describe advantages and disadvantages in the autocorrelation method of using an analysis window one period in duration with respect to time and frequency resolution.

5.11 Consider a 2nd-order linear predictor where the autocorrelation method solution is used to obtain the predictor coefficients [α1, α2] through:

Image

where r[n] represents the estimated autocorrelation coefficients. This matrix equation can be solved recursively for Image using the Levinson recursion. Begin with the 0th-order predictor coefficient and mean-squared error given, respectively, by

Image

and show that for the first two passes through the Levinson recursion (i = 1, 2) the partial correlation coefficients ki, predictor coefficients αi, and mean-squared prediction error Ei can each be written as functions of autocorrelation coefficients, the final pass (i = 2) being given by

Image

where the solution for the optimal 2nd-order coefficients is given by

Image

with * denoting the result of the autocorrelation method. It is not surprising that the prediction error, as well as the partial correlation coefficients, are functions of the autocorrelation coefficients, because the Levinson recursion solves the normal equations for the autocorrelation method.

5.12 We saw in Exercise 5.11 that the forward Levinson recursion can be viewed as a mapping of a set of p autocorrelation coefficients r[τ] into a set of predictor filter coefficients αk for the autocorrelation method of linear prediction analysis. In this problem you are asked to look further into this correspondence and the correspondence among other all-pole representations, as well as the stability of the resulting all-pole filterImage

(a) Derive the backward Levinson recursion in Equation (5.20) that maps a set of predictor filter coefficients into a set of partial correlation coefficients.

(b) Using the forward Levinson recursion and backward Levinson recursion, argue for a one-to-one correspondence between the autocorrelation coefficients [r[0], r[1], … r[p]] and predictor coefficients and gain [α1, α2, … αp, A]. Then argue that a one-to-one mapping exists between the predictor coefficients along with the gain, [α1, α2, … αp, A], and the partial correlation coefficients along with the gain, [k1, k2, … kp, A], i.e., one coefficient set can be obtained from the other.

(c) State a stability test for the all-pole filter Image in terms of the reflection coefficients (as derived from the prediction filter coefficients). Give your reasoning.

(d) Is the filter

H(z) = (1 − 2z−1 − 6z−2 + z−3 − 2z−4)−1

stable? Use your stability test from part (c).

5.13 Suppose that a speech sound is characterized by time-varying pth order all-pole predictor coefficients that change linearly as

Image

Assume a speech segment is extracted with a rectangular window so that the segment is zero outside the interval [0,N). Determine the normal equations for the autocorrelation method of linear prediction that give the unknown parameters Image and Image. Do not solve.

5.14 Derive the minimum- and maximum-phase contributions in Equation (5.19) that arise when the autocorrelation method is applied to a sequence having a Fourier transform with a minimum- and maximum-phase component.

5.15 Suppose that we are given a linear prediction analysis/synthesis system whose synthesizer is shown in Figure 5.28. The control parameters available to the synthesizer on each frame are:

1. Pitch and voicing;

2. The first autocorrelation coefficient r[0] that is the energy in the windowed speech signal;

3. α, a pth order set of predictor coefficients, that are guaranteed to represent a stable filter.

(a) Describe a method for determining the gain A from the received parameters.

(b) Suppose that the synthesizer filter is a direct-form realization of the transfer function T(z) using the α. It is determined experimentally that interpolation of the filter parameters between frame updates is desirable. Given α1 and α2, the sets of predictor coefficients from two successive speech frames at a 20-ms frame update, as well as the first autocorrelation coefficients on the successive frames, denoted by r1[0] and r2[0], describe qualitatively a method that produces a stable pth order set of interpolated filter coefficients at a 10-ms frame update. In developing your method, assume the partial correlation coefficients ki can be generated from the reverse Levinson algorithm. Also assume there is a one-to-one correspondence between the predictor coefficients and partial correlation coefficients, i.e., we can derive one from the other. Do not prove this correspondence, but use it in the development of your method. Hint: If the partial correlation coefficients are such that |ki| < 1, then the predictor coefficients represent a stable all-pole system. Note also that the interpolation of two sets of stable predictor coefficients does not necessarily result in a stable system.

Figure 5.28 Linear prediction synthesizer.

Image

(c) Consider the lattice realization of the inverse filter corresponding to your interpolated predictor coefficients from part (b). Is the lattice filter stable? Explain.

5.16 We have seen that the lattice formulation of linear prediction bears a strong resemblance to the lossless concatenated tube model. This relationship, based on the equivalence of the Levinson recursion and the vocal tract transfer function recursion, leads to a procedure for obtaining the vocal tract tube cross-sectional areas from the partial correlation coefficients. List reasons for possible inaccurate prediction of the cross-sectional areas. Consider, for example, the validity of the planar-wave equation in individual tubes, energy loss, glottal and radiation effects, and nonlinear phenomena.

5.17 Consider the all-pole model of speech production:

Image

with

Image

In this problem, you will show that the error function

(5.64)

Image

with

Image

where S(ω) is the Fourier transform of the short-time segment of speech being modeled, leads to the autocorrelation method of linear prediction.

(a) Show that Q(ω) can be written as

(5.65)

Image

where E(ω) = A(ω)S(ω) is the Fourier transform of the prediction error sequence.

(b) Suppose that the zeros of A(z) are constrained to lie inside the unit circle. Show13 that

Image

13 One approach to this solution is through the complex cepstrum (described in Chapter 6) associated wit |A(ω)|2 = A(ω)A*(ω).

(c) Using your result in part (b), show that

Image

(d) Substitute Equation (5.65) into Equation (5.64) and use your result from part (c) to show that minimization of I, with respect to the unknown filter coefficients αk, is equivalent to minimization of the average squared filter output Image and hence yields the same solution as the autocorrelation method of linear prediction.

(e) Show that minimization of I with respect to the unknown A2 gives the optimal gain solution

Image

Argue, therefore, that the optimal A2 is the same gain solution of Section 5.3.4 as obtained by requiring that the energy in the all-pole impulse response h[n] equals the energy in the measurement s[n]. The error function I thus provides a rigorous way of obtaining both the unknown filter coefficients and the unknown gain.

(f) Using Figure 5.14a, illustrating the function f (Q) = eQQ − 1, argue that linear prediction spectral analysis tends to fit spectral peaks more accurately than spectral valleys.

(g) Using a Taylor series expansion for f (Q) = eQQ − 1, show that f (Q) is symmetric for small Q. What does this imply for spectral matching?

5.18 It has been observed that the linear prediction residual (from the inverse filter A(z)) often contains “intelligible” speech. What may this imply about an all-pole model and the linear prediction estimate of the model parameters?

5.19 Derive Equation (5.51) for the use of two analysis windows in the covariance method of linear prediction analysis. Argue that the covariance properties of the matrix Φ, with elements Φ[i, k] of Equation (5.51), are the same as for the one-window analysis case provided that the two windows are non-overlapping.

5.20 Consider the two-pole model for the glottal flow over one cycle given by

(5.66)

Image

with α and β both real, positive, and less than one, and where the region of convergence includes the unit circle. In the time domain, g[n] can be expressed as the convolution of two decaying exponentials:

(5.67)

Image

(a) Sketch g[n] and |G(ω)|. Assume α and β are close to unity, say about 0.95. Why are Equations (5.66) and (5.67) a reasonable model for the spectral magnitude, but not for the shape of the glottal flow?

(b) Explain why a better model for the glottal flow is given by

(5.68)

Image

Derive the z-transform of Image. Where are the poles of Image in relation to those of g[n]? Describe the difficulty in estimating the model parameters of Image by using the autocorrelation method of linear prediction.

(c) Suppose now that the speech waveform for a voiced segment is given by

Image

where Image is given by Equations (5.67) and (5.68), where u[n] consists of two pulses

u[n] = δ[n] + b1δ[np]

with P the pitch period and b1 real, positive, and less than unity, and where υ[n] is a minimum-phase vocal tract response with only one resonance:

Image

Suppose that a fourth-order linear prediction analysis (using the autocorrelation method) is applied to Image (i.e., assume no influence by u[n]) resulting in a system function Image such that

Image

What are the prediction coefficients αk, k = 1, 2, 3, 4 of the function A(ω)? Now apply the inverse filter A(ω) to the speech signal s[n]. Determine an analytical expression for the Fourier transform of the resulting source estimate. Show that the Fourier transform magnitude of your result is only a function of the pulse train u[n]. Express the Fourier transform phase in terms of the phase of the glottal pulse Image and the phase of u[n].

5.21 Suppose we model the z-transform of a measured sequence with the rational function in Equation (5.45). Consider minimizing the mean-squared error Image where s[n] is the measurement and x[n] is the inverse z-transform of our model. Map this error criterion to the frequency domain in terms of A(ω) and B(ω) using Parseval’s Theorem. Then show that least mean-squared error minimization with respect to the unknown model numerator and denominator coefficients [bk and ak of Equation (5.45)] leads to equations that are nonlinear in the unknown coefficients.

5.22 Consider the problem of estimating zeros of the numerator polynomial of a rational z-transform model. Develop a method of “inverse linear prediction” by converting zero estimation to pole estimation.

5.23 (MATLAB) In this exercise, use the voiced speech signal speech1_10k (at 10000 samples/s) in the workspace ex5M1.mat located in companion website directory Chap_exercises/chapter5. This problem illustrates the autocorrelation method of linear prediction.

(a) Window speech1_10k with a 25-ms Hamming window. Compute the autocorrelation of the resulting windowed signal and plot.

(b) Assume that two resonances represent the signal and model the vocal tract with 4 poles. Set up the autocorrelation matrix Rn, using your result from part (a). The autocorrelation matrix is of dimension 4 × 4.

(c) Solve for the linear predictor coefficients by matrix inversion.

(d) Plot the log-magnitude of the resulting frequency response:

Image

where the gain is given by Equation (5.30). Compare your result with the log-magnitude of the Fourier transform of the windowed signal. What similarities and differences do you observe?

(e) Using your estimates of the predictor coefficients from part (c), compute the prediction error sequence associated with speech1_10k and plot. From the prediction error sequence, what conclusions might one draw about the model (i.e., all-pole/impulse-train-driven) and estimation accuracy?

5.24 (MATLAB) In this problem you will use your results from Exercise 5.23 to perform speech synthesis of the speech waveform speech1_10k in the workspace ex5M1.mat located in companion website directory Chap exercises/chapter5.

(a) Using your estimates of the predictor coefficients from Exercise 5.23, compute an estimate of the vocal tract impulse response.

(b) Using the prediction error sequence you computed in Exercise 5.23, estimate an average pitch period of the voiced sequence speech1_10k.

(c) Using your results from parts (a) and (b), synthesize an estimate of speech1_10k. How does your waveform estimate differ from the original? Consider the minimum-phase nature of the impulse response estimate.

(d) Using the MATLAB number generator randm.m, synthesize the “whispered” counterpart to your voiced synthesis of part (c). Using the MATLAB sound.m function, listen to your two estimates and compare to the original.

5.25 (MATLAB) Consider a nasalized vowel where the nasal tract introduces zeros into the vocal tract system function. The source to the vocal tract is a periodic glottal flow that we assume has an effective duration L over a single glottal cycle. Under this condition, the glottal flow over one period can be modeled as L − 1 zeros, and we will assume that all of the zeros lie outside the unit circle. Suppose the vocal tract consists of N poles inside the unit circle and M zeros (due to the nasal passage) also inside the unit circle. Then we can express the system function of the nasalized vowel as

Image

Suppose also that the pitch period is P samples in duration. Our goal is to obtain the glottal waveform and poles and zeros of the vocal tract system function.

(a) Find a time interval I within a pitch period over which the speech waveform is “free” of the effect of zeros.

(b) Set up the covariance method normal equations over the interval I to solve for the vocal tract poles. (The corresponding polynomial coefficients is sufficient.) Comment on the drawback of this method when a speech formant has a very wide bandwidth.

(c) We have assumed that zeros of the glottal flow are maximum phase and the zeros due to the nasal branch are minimum phase. Using your result from part (b), describe a method for obtaining the unknown zeros by inverse filtering. (The corresponding polynomial coefficients are sufficient.) Then propose a method for separating the minimum and maximum phase zeros. Hint: Consider applying linear prediction analysis to the reciprocal of the zero polynomial.

(d) Implement parts (b) and (c) in MATLAB using the synthetic waveform speech_10k (at 10000 samples/s) in workspace ex5M2.mat located in the companion website directory Chap_exercises/chapter5. Assume 4 vocal tract poles, 2 vocal tract zeros, a glottal flow width of 20 samples, and a pitch period of 200 samples. You should compute the predictor coefficients associated with the vocal tract poles and then inverse filter. Then find the coefficients of the vocal tract numerator polynomial and the glottal flow. You will find in ex5M2.readme located in Chap_exercises/chapter5 a description of how the synthetic waveform was produced and some suggested MATLAB routines of possible use.

(e) Repeat the MATLAB exercise in part (d), but now use the basic covariance method, i.e., the prediction-error window lies over multiple pitch periods (not synchronized within a glottal cycle). Here your goal is to match the spectrum of the combined vocal tract response and glottal flow, and not to separate these components. Therefore, show only the result of your inverse filtering. Keep in mind that you need to use a large number of poles to model the vocal tract zeros. Plot the log-magnitude spectrum of one glottal cycle of the result of your inverse filter and compare (superimpose) it with the result of your inverse filter over one glottal cycle for the synchronized covariance method of part (d). Explain your results.

Bibliography

[1] T.V. Ananthapadmanabha and G.Fant, “Calculation of True Glottal Flow and Its Components,” Speech Communications, vol. 1, pp. 167–184, 1982.

[2] B.S. Atal and S.L. Hanauer, “Speech Analysis and Synthesis by Linear Prediction of the Speech Waveform,” J. Acoustical Society of America, vol. 50, pp. 637–655, 1971.

[3] D.G. Childers and C. Ahn, “Modeling the Glottal Volume-Velocity Waveform for Three Voice Types,” J. Acoustical Society of America, vol. 97, no. 1, pp. 505–519, Jan. 1995.

[4] D.G. Childers and C.K. Lee, “Vocal Quality Factors: Analysis, Synthesis, and Perception,” J. Acoustical Society of America, vol. 90, no. 5, pp. 2394–2410, Nov. 1991.

[5] K.E. Cummings and M.A. Clements, “Analysis of Glottal Waveforms Across Stress Styles,” Proc. Int. Conf. Acoustics, Speech, and Signal Processing, pp. 369–372, Albuquerque, NM, 1990.

[6] J.E. Dennis, D.M. Gay, and R.E. Welsch, “An Adaptive Nonlinear Least-Squares Algorithm,” ACM Trans. on Mathematical Software, vol. 7, no. 3, pp. 348–368, Sept. 1981.

[7] J.E. Dennis, D.M. Gay, and R.E. Welsch, “Algorithm 573 NL2SOL7—An Adaptive Nonlinear Least-Squares Algorithm,” ACM Transactions on Mathematical Software, vol. 7, no. 3, pp. 369–383, Sept. 1981.

[8] G. Fant, “Glottal Flow: Models and Interaction,” J. Phonetics, vol. 14, pp. 393–399, 1986.

[9] J.L. Flanagan, Speech Analysis, Synthesis, and Perception, Springer-Verlag, New York, NY, 1972.

[10] J.L. Flanagan and K. Ishizaka, “Computer Model to Characterize the Air Volume Displaced by the Vibrating Vocal Cords,” J. Acoustical Society of America, vol. 63, pp. 1559–1565, Nov. 1978.

[11] F.I. Itakura and S. Saito, “A Statistical Method for Estimation of Speech Spectral Density and Formant Frequencies,” Elec. and Comm. in Japan, vol. 53-A, no. 1, pp. 36–43, 1970.

[12] G.E. Kopec, A.V. Oppenheim, and J.M. Tribolet, “Speech Analysis by Homomorphic Prediction,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–25, no. 1, pp. 40–49, Feb. 1977.

[13] N. Levinson, “The Wiener RMS Error Criterion in Filter Design and Prediction,” J. Mathematical Physics, vol. 25, pp. 261–278, Jan. 1947.

[14] J. Makhoul, “Linear Prediction: A Tutorial Review,” Proc. IEEE, vol. 63, no. 4, pp. 561–580, April 1975.

[15] J.D. Markel and A.H. Gray, Linear Prediction of Speech, Springer-Verlag, New York, NY, 1976.

[16] J.H. McClellan, “Parametric Signal Modeling,” chapter in Advanced Topics in Signal Processing, J.S. Lim and A.V. Oppenheim, eds., Prentice Hall, Englewood Cliffs, NJ, 1988.

[17] A.V. Oppenheim and R.W. Schafer, Digital Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1975.

[18] A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1989.

[19] M.D. Plumpe, T.F. Quatieri, and D.A. Reynolds, “Modeling of the Glottal Flow Derivative Waveform with Application to Speaker Identification,” IEEE Trans. Speech and Audio Processing, vol. 1, no. 5, pp. 569–586, Sept. 1999.

[20] M.R. Portnoff, A Quasi-One-Dimensional Digital Simulation for the Time-Varying Vocal Tract, SM Thesis, Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, May 1973.

[21] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice Hall, Englewood Cliffs, NJ, 1978.

[22] M.R. Sambur, A.E. Rosenberg, L.R. Rabiner, and C.A. McGonegal, “On Reducing the Buzz in LPC Synthesis,” J. Acoustical Society of America, vol. 63, no. 3, pp. 918–924, March 1978.

[23] J.L. Shanks, “Recursion Filters for Digital Processing,” Geophysics, vol. 32, pp. 33–51, 1967.

[24] K. Steiglitz, “On the Simultaneous Estimation of Poles and Zeros in Speech Analysis,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 25, pp. 229–234, June 1977.

[25] G. Strang, Linear Algebra and Its Applications, Academic Press, New York, NY, 1976.

[26] C.W. Therrien, Discrete Random Signals and Statistical Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1992.

[27] H. Wakita, “Estimation of the Vocal Tract Shape by Optimal Inverse Filtering and Acoustic/Articulatory Conversion Methods,” SCRL Monograph, no. 9, July 1972.

[28] D.Y. Wong, J.D. Markel, and A.H. Gray, Jr., “Least-Squares Glottal Inverse Filtering from the Acoustic Speech Waveform,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–27, no. 4, pp. 350–355, Aug. 1979.

[29] L.R. Yanguas, T.F. Quatieri, and F. Goodman, “Implications of Glottal Source for Speaker and Dialect Identification,” Proc. Int. Conf. Acoustics, Speech, and Signal Processing, vol. 2, pp. 813–816, Phoenix, AZ, 1999.

[30] B. Yegnanarayana and R.N.J. Veldhuis, “Extraction of Vocal-Tract System Characteristics from Speech Signals,” IEEE Trans. Speech and Audio Processing, vol. 6, no. 4, pp. 313–327, July 1998.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset