12 Speech Coding

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12
Speech Coding

12.1 Introduction

We define speech coding as any process that leads to the representation of analog waveforms by sequences of binary digits (i.e., 0’s and 1’s) or bits. Although high-bandwidth channels and networks are becoming more viable (as with fiber optics), speech coding for bit-rate reduction has retained its importance. This is due to the need for low rates with, for example, cellular and Internet communications over constrained-bandwidth channels, and with voice storage playback systems. In addition, coded speech, even at high bit rates, is less sensitive than analog signals to transmission noise and is easier to error-protect, encrypt, multiplex, and packetize. An example of speech coding in a digital telephone communication scenario is illustrated in Figure 12.1. An analog waveform is first digitized with an A/D converter and then analyzed with one of the speech analysis techniques of the previous chapters. The resulting speech parameters are quantized and then encoded into a bit pattern that may also be encrypted. Next, the bits are converted back to an analog communication signal, i.e., they are modulated using, for example, phase or frequency shift keying [65], and transmitted over an analog telephone channel. With a digital link, e.g., a digital satellite channel, this step is not performed. Because the channel may introduce distortion, the encrypted bits are error-protected before modulation. Finally, at the receiver the inverse operations are performed.

Speech coders are typically categorized in three groups: waveform coders, hybrid coders, and vocoders. Waveform coders quantize the speech samples directly and operate at high bit rates in the range 16–64 kbps (bps, denoting bits per second). Hybrid coders are partly waveform-based and partly speech model-based and operate in the 2.4–16 kbps range. Finally, vocoders are largely model-based, operate at the low bit rate range of 1.2–4.8 kbps, and tend to be of lower quality than waveform and hybrid coders. In this chapter, a representative set of coders from each class is developed, using speech modeling principles and analysis/synthesis methods described throughout the text.

Figure 12.1 An example digital telephone communication system.

Observe that we have used the word “quality” without definition. We can think of quality in terms of the closeness of the processed speech (e.g., coded, enhanced, or modified) to the original speech or to some other desired speech waveform (e.g., a time-scaled waveform). Quality has many dimensions. For example, a processed speech waveform can be rated in terms of naturalness (e.g., without a machine-like characteristic), background artifacts (e.g., without interfering noise or glitches), intelligibility (e.g., understandability of words or content), and speaker identifiability (i.e., fidelity of speaker identity). Different subjective and objective tests have been developed to measure these different attributes of quality. For example, in subjective testing, based on opinions formed from comparative listening tests, the diagnostic rhyme test (DRT) measures intelligibility, while the diagnostic acceptability measure (DAM) and mean opinion score (MOS) tests provide a more complete quality judgment. Examples of objective tests include the segmental signal-to-noise ratio (SNR), where the average of SNR over short-time segments is computed, and the articulation index that relies on an average SNR across frequency bands. Although useful in waveform coders, SNR in the time domain is not always meaningful, however, for model-based coders because the phase of the original speech may be modified, disallowing waveform differencing (Exercise 12.19). Alternative objective measures have thus been developed that are based on the short-time Fourier transform magnitude only. A more complete study of subjective and objective measures can be found in the literature, as in [17],[70], and a number of these tests will be further described as they are needed. Our purpose here is to give the reader a qualitative understanding of the term “quality” to be used in this chapter in describing the output of speech coders (as well as the output of speech processing systems in other parts of the text); a careful assessment of individual quality attributes for each coder is beyond our scope.

We begin in Section 12.2 with statistical models of speech that form the backbone of coding principles and lead naturally to the waveform coding and decoding techniques of Section 12.3. These methods are based on one-dimensional scalar quantization where each sample value is coded essentially independently. An optimal method of scalar quantization, referred to as the Max quantizer, is derived. Extensions of this basic approach include companding, adaptive quantization, and differential quantization, and they form the basis for waveform coding standards at high rates such as 32–64 kbps. In Section 12.4, scalar quantization is then generalized to vector quantization that quantizes a set of waveform or speech parameters rather than single scalars, exploiting the correlation across values. Vector quantization is an important tool in the development of mid-rate and low-rate hybrid and vocoding strategies. In Section 12.5, we move to the frequency domain and describe two important hybrid coding techniques: subband coding and sinusoidal coding that are based on the STFT and sinewave analysis/synthesis techniques in Chapters 8 and 9, respectively. These coding strategies provide the basis for 2.4–16 kbps coding.

Section 12.6 then develops a number of all-pole model-based coding methods using the linear prediction analysis/synthesis of Chapter 5. The section opens with one of the early classic vocoding techniques at 2.4 kbps, referred to as linear prediction coding (LPC) that is based on a simple binary impulse/noise source model. To design higher-quality linear-prediction-based speech coders at such low rates or at higher rates, generalizations of the binary excitation model are required. One such approach is the mixed excitation linear prediction (MELP) vocoder that uses different blends of impulses and noise in different frequency bands, as well as exploits many excitation properties which we have seen in earlier chapters such as jittery and irregular pitch pulses and waveform properties that reflect the nonlinear coupling between the source and vocal tract. Another such approach is multipulse LPC which uses more than one impulse per glottal cycle to model the voiced speech source and a possibly random set of impulses to model the unvoiced speech source. An important extension of multipulse LPC is code excited linear prediction (CELP) that models the excitation as one of a number of random sequences, or “codewords,” together with periodic impulses. Although these later two approaches, studied in the final Section 12.7 of this chapter, seek a more accurate representation of the speech excitation, they also address the inadequacies in the basic linear all-pole source/filter model; as such, these methods are referred to as linear prediction residual coding schemes.

12.2 Statistical Models

When applying statistical notions to speech signals, it is necessary to estimate probability densities and averages (e.g., the mean, variance, and autocorrelation) from the speech waveform which is viewed as a random process. One approach to estimate a probability density function (pdf) of x[n] is through the histogram. Assuming the speech waveform is an ergodic process (Appendix 5.A), in obtaining the histogram, we count up the number of occurrences of the value of each speech sample in different ranges, (covering all speech values). We do this for many speech samples over a long time duration, and then normalize the area of the resulting curve to unity.

A histogram of the speech waveform was obtained by Davenport [13] and also by Paez and Glisson [66]. The histogram was shown to approximate a gamma density which is of the form [71]

where σ_x is the standard deviation of the pdf. An even simpler approximation is given by the Laplacian pdf of the form

(In the above pdfs, we have simplified x[n] notation as x.) The two density approximations, as well as the speech waveform histogram, are shown in Figure 12.2. Features of the speech pdf include a distinct peak at zero which is the result of pauses and low-level speech sounds. In spite of this characteristic, there is a significant probability of high amplitudes, i.e., out to . Histogram analysis is useful not only in determining the statistical properties of the speech waveform, but also of speech model parameters such as linear prediction and cepstral coefficients or of non-parametric representations such as filter-bank outputs. The mean, variance, and autocorrelation of the speech pressure waveform can be obtained from its pdf or from long-time averages of the waveform itself when ergodicity holds (Appendix 5.A). Alternatively, given the nonstationarity of speech, a counterpart pdf over a short time duration, and associated short-time measures of the mean, variance and autocorrelation (as we saw in Chapter 5), may be a more appropriate characterization for certain coding strategies.

12.3 Scalar Quantization

Assume that a speech waveform has been lowpass-filtered and sampled at a suitable rate giving a sequence x[n] and that the sampling is performed with infinite amplitude precision, i.e., the A/D is ideal. We can view x[n] as a sample sequence of a discrete-time random process. In the first step of the waveform coding process, the samples are quantized to a finite set of amplitudes which are denoted by . Associated with the quantizer is a quantization step size Δ that we will specify shortly. This quantization allows the amplitudes to be represented by a finite set of bit patterns, or symbols, which is the second step in the coding process. The mapping of to a finite set of symbols is called encoding the quantized values and this yields a sequence of codewords, denoted by c[n], as illustrated in Figure 12.3a. Likewise, a decoder takes a sequence of codewords c′[n], the prime denoting a codeword that may be altered in the transmission process, and transforms them back to a sequence of quantized samples, as illustrated in Figure 12.3b. In this section, we look at the basic principles of quantizing individual sample values of a waveform, i.e., the technique of scalar quantization, and then extend these principles to adaptive and differential quantization techniques that provide the basis of many waveform coders used in practice.

Figure 12.2 Comparison of histograms from real speech and gamma and Laplacian probability density fits to real speech. The densities are normalized to have mean m_x = 0 and variance = 1. Dots (and the corresponding fitted curve) denote the histogram of the speech.

12.3.1 Fundamentals

Suppose we quantize a signal amplitude into M levels. (We use the term “amplitude” to mean signed signal value.) We denote the quantizer operator by Q(x) specifically,

where denotes M possible reconstruction levels, also referred to as quantization levels, with 1 ≤ i ≤ M and where x_i denotes M + 1 possible decision levels with 0 ≤ i ≤ M. Therefore, if x_i−1 < x[n] ≤ x_i, then x[n] is quantized (or mapped) to the reconstruction level . As above, we denote the quantized x[n] by . We call this scalar quantization because each value (i.e., scalar) of the sequence is individually quantized, in contrast to vector quantization where a group of values of x[n] are coded as a vector. Scalar quantization is best illustrated through an example.

Figure 12.3 Waveform (a) coding and (b) decoding. Coding involves quantizing x[n] to obtain a sequence of samples and encoding them to codewords c[n]. Decoding takes a sequence of codewords c′[n] back to a sequence of quantized samples . c′[n] denotes the codeword c[n] that may be distorted by a channel and Δ is the quantization step size.

Example 12.1 Suppose that the number of reconstruction levels M = 4 and assume that the amplitude of the input x[n] falls in the range [0, 1]. In addition, assume that the decision and reconstruction levels are equally spaced. Then the decision levels are , as shown in Figure 12.4. There are many possibilities for equally spaced reconstruction levels, one of which is given in Figure 12.4 There are many possibilities for equally spaced reconstruction levels, one of which is given in Figure 12.4 as .

Example 12.1 illustrates the notion of uniform quantization. Simply stated, a uniform quantizer is one whose decision and reconstruction levels are uniformly spaced. Specifically, we write the uniform quantizer as

(12.1)

where Δ is the step size equal to the spacing between two consecutive decision levels, which is the same as the spacing between two consecutive reconstruction levels (Exercise 12.1).

Having selected the reconstruction levels, we then attach to each reconstruction level a symbol, i.e., the codeword. The collection of codewords is called the codebook. In most cases, it is convenient to use binary numbers to represent the quantized samples. Figure 12.4 illustrates the codeword assignment for the 4-level uniform quantizer of Example 12.1, given by binary codewords [00, 01, 10, 11]. We think of this set as a 2-bit binary codebook because for each of the two places in the binary number we can select a 0 or 1, so that there exists 2² = 4 possible binary numbers. More generally, it is possible with a B-bit binary codebook to represent 2^B different quantization (or reconstruction) levels; that is, B bits gives 2^B quantization levels. Bit rate in a coded representation of a signal is defined as the number of bits B per sample multiplied

Figure 12.4 An example of uniform 2-bit quantization where the reconstruction and decision levels are uniformly spaced. The number of reconstruction levels M = 4 and the input falls in the range [0, 1].

by the number of samples per second f_s, i.e.,

I = Bf_s

which is sometime referred as the “information capacity” required to transmit or store the digital representation. Finally, the decoder inverts the coder operation, taking the codeword back to a quantized amplitude value, e.g., . Often the goal in speech coding/decoding is to maintain the bit rate as low as possible while maintaining a required level of quality. Because the speech sampling rate is fixed in most applications, this goal implies that the bit rate be reduced by decreasing the number of bits per sample.

In designing the decision regions for a scalar quantizer, we must consider the maximum value of the sequence. In particular, it is typical to assume that the range of the speech signal is proportional to the standard deviation of the signal. Specifically, in designing decision regions, we often assume that −4σ_x ≤ x[n] ≤ 4σ_x, where σ_x is the signal’s standard deviation. There will, however, occur values outside of this region. For example, if σ_x is derived under the assumption that speech values are characterized by a Laplacian pdf, then approximately 0.35% of speech samples fall outside of the range −4σ_x ≤ x[n] ≤ 4σ_x. Nevertheless, under this assumed range we select parameters for the uniform quantizer. If we want to use a B-bit binary codebook, then we want the number of quantization (reconstruction) levels to be 2^B. Denoting the (almost) maximum signal value x_max = 4σ_x, then with a uniform quantization step size by Δ, we have

so that the quantization step size is given by

(12.2)

The size of Δ is related to the notion of quantization noise.

12.3.2 Quantization Noise

There are two classes of quantization noise, also referred to as quantization error. The first is granular distortion. Let

where x[n] is the unquantized signal and e[n] is the quantization noise. Consider a quantization step size Δ. For this step size, the magnitude of the quantization noise e[n] can be no greater than , i.e.,

This can be seen by plotting the error , as illustrated in Figure 12.5.

The second form of quantization noise is overload distortion. From our maximum-value constraint, x_max = 4σ_x (with −4σ_x ≤ x ≤ 4σ_x). Assuming a Laplacian pdf, we noted that 0.35% of the speech samples fall outside of the range of the quantizer. These clipped samples incur a quantization error in excess of . Nevertheless, the number of clipped samples is so small that it is common to neglect the infrequent large errors in theoretical calculations.

Figure 12.5 Quantization noise for a linearly changing input x[n] = n. The quantization noise is given by , where , with Q the quantization operator. For this case of a linearly-changing input, the quantization noise is seen to be signal-dependent.

In some applications, it is useful to work with a statistical model of quantization noise. In doing so, we assume that the quantization error is an ergodic white-noise random process. The autocorrelation function of such a process is expressed as

where the operator E denotes expected value. That is, the process is uncorrelated (Appendix 5.A). We also assume that the quantization noise and the input signal are uncorrelated, i.e.,

E(x[n]e[n + m]) = 0, for all m.

Finally, we assume that the pdf of the quantization noise is uniform over the quantization interval, i.e.,

We must keep in mind, however, that these assumptions are not always valid. Consider for example a signal that is constant, linearly changing, or, more generally, slowly varying. In particular, when the signal is linearly changing, then e[n] also changes linearly and is signal-dependent, thus violating being uncorrelated with itself as well as with the input. An example of this signal type was illustrated in Figure 12.5. Correlated quantization noise can be aurally quite annoying. On the other hand, our assumptions for the noise being uncorrelated with itself and the signal are roughly valid when the signal fluctuates rapidly among all quantization levels, and when Δ is small. Then the signal traverses many quantization steps over a short-time interval and the quantization error approaches a white-noise process with an impulsive autocorrelation and flat spectrum [33],[71].

It is interesting to observe that one can force e[n] to be white and uncorrelated with x[n] by the deliberate addition of white noise to x[n] prior to quantization. This is called dithering or the Roberts’ pseudo noise technique [72]. Such decorrelation can be useful in improving the performance of the quantizer and in improving the perceptual quality of the quantization noise. This technique has been used not only with speech and audio, but also with image signals.¹

¹ Dithering is used in image coding to reduce the signal-dependent quantization noise that can give an image a “contouring” distortion, which is the result of abrupt changes of output level in a slowly changing gray area of the input scene. The use of dithering breaks up these contours [41].

In order to quantify the severity of the quantization noise, we define signal-to-noise ratio (SNR) by relating the strength of the signal to the strength of the quantization noise. As such, we define SNR as

where the latter estimates, over a time duration N, are based on a zero-mean ergodic assumption. Given our assumed quantizer range 2x_max and a quantization interval for a B-bit quantizer, and the uniform pdf, it is possible to show that (Exercise 12.2)

(12.3)

We can then express the SNR as

or in decibels (dB) as

(12.4)

Because x_max = 4σ_x, then

SNR(dB) ≈ 6B − 7.2

and thus each bit contributes 6 dB to the SNR.

This simple uniform quantization scheme is called pulse code modulation (PCM) [33],[71]. Here B bits of information per sample are transmitted as a codeword. The advantages of the scheme are that it is instantaneous, i.e., there is no coding delay, and it is not signal-specific, e.g., it does not distinguish between speech and music. A disadvantage is that at least 11 bits are required for “toll quality,” i.e., equivalent to typical telephone quality. For a sampling rate of 10000 samples/s, for example, the required bit rate is B = (11 bits) × (10000 samples/s) = 110,000 bps in transmission systems.

Example 12.2 Consider a compact disc (CD) player that uses 16-bit PCM. This gives a SNR = 96 − 7.2dB = 88.8 dB for a bit rate of 320,000 bps. This high bit rate is not of concern because space is not a limitation in this medium.

Although uniform quantization is quite straightforward and appears to be a natural approach, it may not be optimal, i.e., the SNR may not be as small as we could obtain for a certain number of decision and reconstruction levels. To understand this limitation, suppose that the amplitude of x[n] is much more likely to be in one particular region than in another, e.g., low values occurring much more often than high values. This certainly is the case for a speech signal, given the speech pdf of Figure 12.2. Large values do not occur relatively often, corresponding to a very large peak-to-rms ratio of about 15 dB. (Recall the definition of peak-to-rms value in Chapter 9, Section 9.5.2.) Thus it seems that we are not effectively utilizing decision and reconstruction levels with uniform intervals over ±x_max. Rather, for a random variable x[n] with such a pdf, intuition tells us to select small intervals where the probability of occurrence is high and select large intervals when the probability of occurrence is low. Quantization in which reconstruction and decision levels do not have equal spacing is called nonuniform quantization. A nonuniform quantization that is optimal (in a least-squared-error sense) for a particular pdf is referred to as the Max quantizer; an example is illustrated in Figure 12.6 for a Laplacian pdf, where spacing is seen to be finer for low signal values than for high values. Scalar quantization of speech based on this method can achieve toll quality at 64 kbps and 32 kbps. We study the optimal Max quantizer in the following section.

Figure 12.6 3-bit nonuniform quantizer: (a) Laplacian pdf; (b) decision and reconstruction levels.

12.3.3 Derivation of the Max Quantizer

Max in 1960 [48] solved the following problem in scalar quantization:

To determine the optimal (generally nonuniform) quantizer in this sense for a speech sequence, suppose x is a random variable denoting a value of the sequence x[n] with pdf, p_x(x). For simplicity, we have removed time notation. Then, using the minimum MSE criterion, we determine x_i and by minimizing

(12.5)

Noting that is one of the M reconstruction levels = Q[x], then we can write D as

where the integral is divided into contributions to D over each decision interval (Figure 12.7).

To minimize D, the optimal decision and reconstruction levels must satisfy

(12.6)

(12.7)

Figure 12.7 Contribution of the MSE over each decision interval in the derivation of the Max quantizer.

where we assume (Figure 12.8)

x₀ = −∞, x_M = +∞

In order to differentiate in Equations (12.6) and (12.7), we note that there are two contributions to the sum for each k. For Equation (12.7), we thus have

and flipping the limits on the second term,

Now, recall the Fundamental Theory of Calculus:

Therefore,

and thus with some algebraic manipulation, we obtain

(12.8)

The optimal decision level x_k is then the average of the reconstruction levels x_k and .

Figure 12.8 Nonuniform quantization example with number of reconstruction levels M = 6.

To solve (12.6), we have

and setting this result to zero

and thus

The optimal reconstruction level is the centroid of p_x(x) over the interval x_k−1 ≤ x ≤ x_k (Figure 12.9). Alternately,

(12.9)

which is interpreted as the mean value of x over interval x_k−1 ≤ x ≤ x_k for the normalized pdf .

Figure 12.9 Centroid calculation in determining the optimal MSE reconstruction level.

Solving Equations (12.8) and (12.9) simultaneously for x_k an is a nonlinear problem in these two variables. For a known pdf, Max proposed an iterative technique involving a numerical solution to the required integrations [48]. This technique, however, is cumbersome and, furthermore, requires obtaining the pdf, which can be difficult. An alternative approach proposed by Lloyd [43], not requiring an explicit pdf, is described in a following section where we generalize the optimal scalar quantizer to more than one dimension.

We now look at an example of the Max quantizer designed using the iterative solution as proposed by Max and which compares quantization schemes that result from uniform and Laplacian pdfs [48],[66],[71].

Example 12.3 For a uniform pdf, a set of uniform quantization levels result using the Max quantizer, as expected, because all values of the signal are equally likely (Exercise 12.5). Consider, however, a nonuniform Laplacian pdf with mean zero and variance σ² = 1, as illustrated in Figure 12.6a, and let the number of quantization levels M = 8. The resulting optimal quantization levels were shown in Figure 12.6b, where we see smaller decision and reconstruction levels for low values of x[n] because with a Laplacian pdf of zero mean, low signal values occur most often.

For a certain number of bits, the variance of the quantization noise for the uniform quantizer is always greater than the optimal design (or equal if the pdf is uniform). Generally, the more the pdf deviates from being uniform, the higher the gain in this sense from a nonuniform quantizer; the performance gain also increases with an increasing number of bits [33].

In this section, we have designed an optimal quantizer for a given signal. An alternative quantizer design strategy is to achieve the effect of nonuniform quantization by designing the signal to match a uniform quantizer, rather than designing the quantizer to match the signal. Companding is one such approach described in the following section.

12.3.4 Companding

An alternative to the nonuniform quantizer is companding, a method that is suggested by the fact that the uniform quantizer is optimal for a uniform pdf. The approach is illustrated in Figure 12.10. In the coding stage, a nonlinearity is applied to the waveform x[n] to form a new sequence g[n] whose pdf is uniform. A uniform quantizer is then applied, giving , which at the decoder is inverted with the inverse nonlinearity. One can show that the following nonlinear transformation T results in the desired uniform density:

Figure 12.10 The method of companding in coding and decoding: (a) coding stage consisting of a nonlinearity followed by uniform quantization and encoding; (b) an inverse nonlinearity occurring after decoding.

(12.10)

which acts like a dynamic range compression, bringing small values up and large values down (Exercise 12.6). This procedure of transforming from a random variable with an arbitrary pdf to one with a uniform pdf is easier to solve than the Max quantizer nonlinear equations; it is not, however, optimal in a mean-squared-error sense [41]. This is because the procedure minimizes the distortion measure

and does not minimize the optimal criterion

In addition, the method still requires a pdf measurement. Nevertheless, the transformation has the effect of a nonuniform quantizer.

Other nonlinearities approximate the companding operation, but are easier to implement and do not require a pdf measurement. These transformations also give the effect of a nonuniform quantizer, and tend to make the SNR resistant to signal characteristics. The logarithm, for example, makes the SNR approximately independent of signal variance and dependent only upon quantization step size (Exercise 12.6). The μ-law transformation approximates the logarithm operation, but avoids the infinite dynamic range of the logarithm. The μ-law companding, ubiquitous in waveform coding, is given by [71]

Together with uniform quantization, this nonlinearity for large values of μ yields an SNR approximately independent of the ratio of the signal peak x_max and rms σ_x over a large range of signal input, unlike the SNR of the uniform quantizer in Equation (12.4) [71],[80]. An example of the use of the μ-law companding is the international (CCITT) standard coder at 64 kbps [37]. Here, a μ-law transformation is applied followed by 7-bit uniform quantization (some bits remaining for channel bit-error protection), giving toll quality speech. This quality is equivalent to that which would be produced by a uniform 11-bit quantization on the original speech.

12.3.5 Adaptive Quantization

The optimal nonuniform quantizer helps in addressing the problem of large values of a speech signal occurring less often than small values, i.e., it accounts for a nonuniform pdf. We might ask at this point, however, given the time-varying nature of speech, whether it is reasonable to assume a single pdf derived from a long-time speech waveform, as we did with a Laplacian or gamma density at the beginning of this chapter. There can be significant changes in the speech waveform as in transitioning from unvoiced to voiced speech. These changes occur in the waveform temporal shape and spectrum, but can also occur in less complex ways such as with slow, as well as rapid, changes in volume. Motivated by these observations, rather than estimate a long-time pdf, we may, on the other hand, choose to estimate a short-time pdf. Pdfs derived over short-time intervals, e.g., 20–40 ms, are also typically single-peaked, as is the gamma or Laplacian pdf, but are more accurately described by a Gaussian pdf regardless of the speech class [33]. A pdf derived from short-time speech segments, therefore, more accurately represents the speech nonstationarity, not blending many different sounds. One approach to characterizing the short-time pdf is to assume a pdf of a specific shape, in particular a Gaussian, with an unknown variance σ². If we can measure the local variance σ², then we can adapt a nonuniform quantizer to the resulting local pdf. This quantization method is referred to as adaptive quantization.

The basic idea of adaptive quantization then is to assume a known pdf, but with unknown variance that varies with time. For a Gaussian, we have the form

The variance of values for a sequence x[n] is measured as a function of time, and the resulting pdf is used to design the optimal max quantizer. Observe that a change in the variance simply scales the time signal, i.e., if , then so that one needs to design only one nonuniform quantizer with unity variance and then scale the decision and reconstruction levels according to a particular variance. Alternatively, one can fix the quantizer and apply a time-varying gain to the signal ² according to the estimated variance, i.e., we scale the signal to match the quantizer. When adapting to a changing variance, as shown in Figure 12.11, in a coder/decoder we now need to transmit both the quantizer signal and the variance. This implies, therefore, that the variance itself must be quantized and transmitted; such a parameter is sometimes referred to as side information because it is used “on the side” to define the design of the decoder at the receiver. Observe in Figure 12.11 that errors in decoding can now occur due to errors in both the signal codewords and variance codewords .

² Motivation for a time-varying gain is similar to that for applying dynamic range compression prior to audio transmission, as we did in Chapter 9. In coding or in audio transmission, we preprocess the signal so that the SNR is improved in the face of quantization or background noise, respectively.

Figure 12.11 Adapting a nonuniform quantizer to a local pdf. By measuring the local variance , we characterize the assumed Gaussian pdf. c₁[n] and c₂[n] are codewords for the quantized signal and time-varying variance respectively. This feed-forward structure is one of a number of adaptive quantizers that exploit local variance.

Two approaches for estimation of a time-varying variance σ²[n] are based on a feed-forward method (shown in Figure 12.11) where the variance (or gain) estimate is obtained from the input, and a feedback method where the estimate is obtained from a quantizer output. The feedback method has the advantage of not needing to transmit extra side information; however, it has the disadvantage of additional sensitivity to transmission errors in codewords [71]. Because our intent here is not to be exhaustive, we describe the feed-forward approach only, and we refer the reader to [33],[71] for more complete expositions. In one feed-forward estimation method, the variance is assumed proportional to the short-time energy in the signal, i.e.,

(12.11)

where h[n] is a lowpass filter and β is a constant. The bandwidth of the time-varying variance σ²_x[n] is controlled by the time width of the filter h[n]. There is thus a tradeoff between tracking the true variance of a signal and the amenability of the time-varying variance estimate for adaptive quantization (Exercise 12.8). Because σ²_x[n] must also be sampled and transmitted, we want to make this slowly varying, i.e., with small bandwidth, but without sacrificing the time resolution of the estimate. In general, adaptive quantizers can have a large advantage over their nonadaptive counterparts. A comparison for a feed-forward quantizer by Noll [60],[61] is given in the following example. This example also illustrates the importance of assuming a Gaussian function for the local pdf, in contrast to a pdf tailored to long-term statistics.

Example 12.4 Noll considered a feed-forward scheme in which the variance estimate is [60],[62]

and where the variance is evaluated and transmitted every M samples., Table 12.1 shows that the adaptive quantizer can achieve as high as 8 dB better SNR for the speech material used by Noll. Both a Gaussian and a Laplacian pdf are compared, as well as a μ−law companding scheme which provides an additional reference. Observe that the faster smoothing with M = 128, although requiring more samples/s, gives a better SNR than the slower smoothing with M = 1024. In addition, with M = 128 the Gaussian pdf provides a higher SNR than the Laplacian pdf.

Although we see in the above example that the optimal adaptive quantizer can achieve higher SNR than the use of μ-law companding, μ-law companding is generally preferred for high-rate waveform coding because of its lower background noise when the transmission channel is idle [37]. Nevertheless, the methodology of optimal adaptive quantization is useful in a variety of other coding schemes.

Table 12.1 Comparison of 3-bit adaptive and nonadaptive quantization schemes [60]. Adaptive schemes use feed-forward adaptation.

12.3.6 Differential and Residual Quantization

Up to now we have investigated instantaneous quantization, where individual samples are quantized. We have seen, however, that speech is highly correlated both on a short-time basis (e.g., on the order of 10–15 samples) and on a long-time basis (e.g., a pitch period). In this section, we exploit the short-time correlation to improve coding performance; later in this chapter, we will exploit the long-time correlation.

The meaning of short-time correlation is that neighboring samples are “self-similar,” not changing too rapidly from one another. The difference between adjacent samples should, therefore, have a lower variance than the variance of the signal itself, thus making more effective use of quantization levels. With a lower variance, we can achieve an improved SNR for a fixed number of quantization levels [33],[71]. More generally, we can consider predicting the next sample from previous ones and finding the best prediction coefficients to yield a minimum mean-squared prediction error, just as we did in Chapter 5. We can use in the coding scheme a fixed prediction filter to reflect the average correlation of a signal, or we can allow the predictor to short-time adapt to the signal’s local correlation. In the latter case, we need to transmit the quantized prediction coefficients as well as the prediction error. One particular prediction error encoding scheme is illustrated in Figure 12.12 where the following sequences are required:

= prediction of the input sample x[n]; this is the output of the predictor P(z) whose input is a quantized version of the input signal x[n], i.e., .

r[n] = prediction error signal.

= quantized prediction error signal.

The prediction error signal r[n] is also referred to as the residual and thus this quantization approach is sometimes referred to as residual coding. The quantizer can be of any type, e.g., fixed, adaptive, uniform, or nonuniform. In any case, its parameters are adjusted to match the variance of r[n]. Observe that this differential quantization approach can be applied not only to the speech signal itself, but also to parameters that represent the speech, e.g., linear prediction coefficients, cepstral coefficients from homomorphic filtering, or sinewave parameters. In fact, our focus in the later part of this chapter will be on quantization of such representations.

Figure 12.12 Differential coding (a) and decoding (b) schemes. The coefficients of the predictor P(z), as well as the variance of the residual r[n], must be quantized, coded, and decoded (not shown in the figure).

In the differential quantization scheme of Figure 12.12, we are interested in properties of the quantized residual and, in particular, its quantization error. As such, we write the quantized residual as

where e[n] is the quantization error. Figure 12.12 shows that the quantized residual is added to the predicted value to give the quantized input , i.e.,

Therefore, the quantized signal samples differ from the input only by the quantization error e[n], which is the quantization error of the residual. If the prediction of the signal is accurate, the variance of r[n] will be smaller than the variance of x[n] so that a quantizer with a given number of levels can be adjusted to give a smaller quantization error than would be possible when quantizing the signal directly [71]. Observe that the procedure for reconstructing the quantized input from the codewords associated with the quantized residual is implicit in the encoding scheme, i.e., if c′[n] = c[n], i.e., no channel errors occur, then the quantized signal at the encoder is simply , as illustrated in Figure 12.12.

The differential coder of Figure 12.12, when using a fixed predictor and fixed quantization, is referred to as differential PCM (DPCM). Although this scheme can improve SNR, i.e., improve the variance in the signal x[n] relative to the variance in quantization noise e[n] over what we achieve by a direct quantization scheme, its improvement is not dramatic [71]. Rather, DPCM with both adaptive prediction (i.e., adapting the predictor to the local correlation), and adaptive quantization (i.e., adapting the quantizer to the local variance of r[n]), referred to as ADPCM, yields the greatest gains in SNR for a fixed bit rate. The international coding standard, CCITT G.721, with toll quality speech at 32 kbps (8000 samples/s × 4 bits/sample) has been designed based on ADPCM techniques. Both the predictor parameters and the quantizer step size are calculated using only the coded quantizer output (feedback adaptation), which means that there is no need to transmit this information separately [7]. The interested reader is referred to [33],[37],[71] for a thorough comparison of DPCM and ADPCM approaches, in addition to other classic differential waveform coding schemes such as delta modulation and continuous variable slope delta modulation where the sampling rate is increased to many times the Nyquist rate, thus increasing sample-to-sample correlation, and exploiting one-bit quantization. These methods are also used for high bit-rate waveform coding. To achieve lower rates with high quality requires further dependence on speech model-based techniques and the exploiting of long-time prediction, as well as short-time prediction, to be described later in this chapter.

Before closing this section, we describe an important variation of the differential quantization scheme of Figure 12.12. Observe that our prediction has assumed an all-pole, or autoregressive, model. Because in this model a signal value is predicted from its past samples, any error in a codeword (due, for example, to bit errors over a degrading channel) propagate over considerable time during decoding. Such error propagation is particularly severe when the signal values represent speech model parameters computed frame-by-frame, rather than sample-by-sample. An alternative is a finite-order moving-average predictor derived from the residual. One common approach to the use of a moving-average predictor is illustrated in Figure 12.13 [75]. At the coder stage, for a predictor parameter time sequence a[n], we write the residual as the difference of the true value and the value predicted from the moving average of K quantized residuals, i.e.,

where p[k] represents the coefficients of P(z). Then the predicted value at the decoder, also embedded within the coder, is given by

for which one can see that an error propagation is limited to only K samples (or K analysis frames for the case of model parameters).

Figure 12.13 Differential coding (a) and decoding (b) schemes with moving average predictor.

12.4 Vector Quantization (VQ)

We have in the previous section investigated scalar quantization, which is the basis for high-rate waveform coders. In this section, we describe a generalization of scalar quantization, referred to as vector quantization, in which a block of scalars are coded as a vector, rather than individually. As with scalar quantization, an optimal quantization strategy can be derived based on a mean-squared error distortion metric. Vector quantization is essential in achieving low bit rates in model-based and hybrid coders.

12.4.1 Approach

We motivate vector quantization with a simple example in which the vocal tract transfer function is characterized by only two poles, thus requiring four reflection coefficients. Furthermore, suppose that the vocal tract can take on only one of four possible shapes. This implies that there exist only four possible sets of the four reflection coefficients, as illustrated in Figure 12.14. Now consider scalar quantizing each of the reflection coefficients individually. Because each reflection coefficient can take on 4 different values, 2 bits are required to encode each coefficient. Since there are 4 reflection coefficients, we need 2 × 4 = 8 bits per analysis frame to code the vocal tract transfer function. On the other hand, we know that there are only four possible positions of the vocal tract corresponding to only four possible vectors of reflection coefficients. The scalar values of each vector are, therefore, highly correlated in the sense that a particular value of one reflection coefficient allows a limited choice of the remaining coefficients. In this case, we need only 2 bits to encode the 4 reflection coefficients. If the scalars were independent of each other, treating them as a vector would have no advantage over treating them individually. In vector quantization, we are exploiting correlation in the data to reduce the required bit rate.

We can think of vector quantization (VQ) as first grouping scalars into blocks, viewing each block as a unit, and then quantizing each unit. This scheme is, therefore, also sometimes called block quantization. We now look at VQ more formally. Consider a vector of N continuous

Figure 12.14 Comparison of bit rates required by scalar and vector representations of reflection coefficients for four two-pole vocal tract frequency responses. The four required reflection coefficients are highly correlated, the vocal tract being limited to only four configurations.

scalars

where T denotes matrix transpose. With quantization, the vector is mapped to another N-dimensional vector

The vector is chosen from M possible reconstruction (quantization) levels, thus representing a generalization of the scalar case, i.e.,

Figure 12.15 Comparison of scalar and vector quantization.

where

VQ = vector quantization operator,

= M possible reconstruction levels for 1 ≤ i < M,

C_i = i th “cell” or cell boundary,

and where if is in the cell C_i, then is mapped to . As in the scalar case, we call the codeword and the complete set of codewords {} the codebook. An example of one particular cell configuration is given in Figure 12.15 in which the order of the vector N = 2 and the number of reconstruction levels is M = 6. The dots represent the reconstruction levels and the solid lines the cell boundaries. Also in Figure 12.15, we summarize the comparison between scalar and vector quantization. In this figure, we see the following properties that we elaborate on shortly:

P1: In vector quantization a cell can have an arbitrary size and shape. By contrast, in scalar quantization a “cell” (region between two decision levels) can have an arbitrary size, but its shape is fixed.

P2: As in the scalar case, we define a distortion measure , which is a measure of dissimilarity or error between and .

12.4.2 VQ Distortion Measure

Vector quantization noise is represented by the vector so that the distortion is the average of the sum of squares of scalar components given by where E denotes expected value. We then can write for a multi-dimensional pdf

(12.12)

where C_i are the cell boundaries. We can think of this distortion as a generalization of the 1-D (scalar) case in Equation (12.5). Note the change of notation for the reconstruction levels from the formulation of the 1-D case: . Our next goal is to minimize with respect to unknown reconstruction levels and cell boundaries C_i (similar to the unknown decision levels for the 1-D scalar case). The solution to this optimization problem is nonlinear in the unknown reconstruction levels and cell boundaries. Following an approach similar to that used in deriving the Max quantizer for scalars results in two necessary constraints for optimization. These two constraints have been formulated by Lim as [41]:

C1: A vector must be quantized to a reconstruction level that gives the smallest distortion between and . This is because each of the elements of the sum in Equations (12.5) and (12.12) must be minimized.

C2: Each reconstruction level must be the centroid of the corresponding decision region, i.e., of the cell C_i.

Observe that the first condition implies that, having the reconstruction levels, we can vector-quantize without explicit need of the cell boundaries; i.e., to quantize a given vector, we find the reconstruction level that minimizes its distortion. This requires a very large search, but in theory the search is possible. The second condition gives a way to determine a reconstruction level from a cell boundary.

The two conditions motivate the following iterative solution [41]. Assume that we are given an initial estimate of . Then from the first condition, we can determine all vectors from an ensemble that quantize to each , and thus it follows that we have the corresponding cell boundaries for the ensemble. Having the cell boundaries, we use the second condition to obtain a new estimate of the reconstruction levels, i.e., the centroid of each cell. We continue the iteration until the change in the reconstruction levels is small. A problem with this algorithm is the requirement of using an ensemble of all vectors and their corresponding joint pdfs, the latter required in the determination of the distortion measure and the multi-dimensional centroid. An alternative algorithm proposed by Lloyd for 1-D [43] and extended to multi-D by Forgy [18] uses a finite set of input vectors as training vectors to replace difficult-to-obtain joint pdfs and the ensemble requirement. The algorithm is referred to as the k-means algorithm, where k is the number of reconstruction levels and (nearly) minimizes D. The k-means algorithm is given by the following steps:

S1: Replace the ensemble average D with an average denoted by , where x_k are the training vectors and are the quantized vectors that are functions of cell boundaries and reconstruction levels. (We can think of the unknown pdf as having been replaced by the set of training vectors.) Our new goal is to minimize D′ with respect to cell boundaries and reconstruction levels.

Figure 12.16 Two steps in the k-means algorithm: (a) cluster formation; (b) mean (centroid) formation.

S2: Pick an initial guess at the reconstruction levels {}.

S3: For each select an closest to . The set of all nearest forms a cluster (Figure 12.16a), which is why this iteration is sometimes referred to as a “clustering algorithm.”

S4: Find the mean of in each cluster which gives a new , approximating a centroid of the cluster from S3, denoted by (Figure 12.16b). Calculate D′.

S5: Stop when the change in D′ over two consecutive iterations is insignificant.

The iteration can be shown to converge to a local minimum of D′. Note that minimizing D′ does not necessarily minimize D, but it is close. Because a local minimum may be found, we can try a number of different initial conditions and pick the solution with the smallest mean-squared error.

An advantage of VQ over scalar quantization is that it can reduce the number of reconstruction levels when the distortion D′ is held constant, and thus lower the number of required bits (Exercise 12.10). Likewise, when the number of reconstruction levels is held fixed, it can lower the distortion D′. As indicated, these advantages are obtained by exploiting the correlation (i.e., linear dependence) among scalars of the vector.³ Therefore, any transformation⁴ of the vector that reduces this correlation also reduces the advantage of vector quantization [41]. Nevertheless, there are conditions under which VQ has an advantage even without correlation across vector elements because VQ gives more flexibility than scalar quantization in partitioning an M-dimensional space [44]; this advantage is clarified and illustrated later in this chapter in the description of code-excited linear prediction coding. A quantitative discussion of these VQ properties is beyond our scope here; the interested reader can find further tutorials and examples in [20],[41],[44].

³ Scalars may be linearly independent yet statistically dependent, corresponding to nonlinear dependence between scalars. VQ can also exploit such nonlinear dependence [41],[44].

⁴ For example, the elements of a waveform vector may be linearly dependent, but its DFT or cepstral coefficients may be uncorrelated.

12.4.3 Use of VQ in Speech Transmission

A generic data transmitter/receiver based on VQ is described as follows. At the transmitter, the first step is to generate the codebook of vector codewords. To do so, we begin with a set of training vectors, and then use the k-means algorithm to obtain a set of reconstruction levels and possibly cell boundaries C_i. Attached to each reconstruction level is a codeword with an associated index in a codebook. Our second step is to quantize the vectors to be transmitted. As mentioned above, we can vector quantize without explicit need of the cell boundaries, i.e., to quantize a given vector, we find the reconstruction level that minimizes its distortion. Therefore, we do not need to store the cell boundaries. We then transmit the selected codeword. At the receiver, the codeword is decoded in terms of a codebook index and a table lookup procedure is used to extract the desired reconstruction level. This VQ strategy has been particularly useful in achieving high-quality coded speech at middle and low bit rates, as we will see in the following sections.

12.5 Frequency-Domain Coding

We have seen that with basic PCM waveform coding and adaptive and differential refinements of PCM, we can obtain good quality at about 32 kbps for telephone-bandwidth speech. These are purely time-domain approaches. To attain lower rates, as required, for example, over low-bandwidth channels, we take a different strategy. One approach is to exploit the frequency-domain structure of the signal using the short-time Fourier transform (STFT). From Chapter 7, we saw that the STFT can be interpreted by either a filter-bank or Fourier-transform perspective. In this section, we first describe speech coding techniques based, in particular, on the filter-bank view and using the scalar quantization schemes of the previous section. This strategy is called subband coding and leads naturally into sinusoidal coding based on sinewave analysis/synthesis and its multi-band variations of Chapter 9; these methods exploit both scalar and vector quantization. Sinusoidal coders are hybrid coders in the sense of imposing speech model-based structure on the spectrum, in contrast to subband coders.

12.5.1 Subband Coding

Subband coding is based on the generalized filter-bank summation method introduced in Chapter 7, which is essentially the FBS method discretized in time. In this analysis/synthesis method, to reduce the number of samples used in coding, the output of the kth complex⁵ analysis filter, h_k[n] = w[n]e^j2πnk/N, is first decimated by the time-decimation factor L and modulated down to DC by e^−j2πnk/N to form the subband sequence X(nL, k). At the receiving end of the coder, the decimated filter outputs are interpolated by the synthesis filter ƒ[n] and the resulting interpolated sequences are modulated and summed, thus giving Equation (7.17), i.e.,

⁵ In practice, the subbands are implemented as a lowpass translation of a frequency band to DC in a manner similar to single-sideband modulation, giving real signals rather than the complex signals [81].

We saw in Chapter 7 that the condition for perfect reconstruction of x[n] is given by Equation (7.18). In coding, the time-decimated subband outputs are quantized and encoded, then are decoded at the receiver.

In subband coding, a small number of filters (≤ 16) with wide and overlapping bandwidths⁶ are chosen and each output is quantized using scalar quantization techniques described in the previous section. Typically, each bandpass filter output is quantized individually. The subband coder derives its advantage over (fullband) waveform coding by limiting the quantization noise from the coding/decoding operation largely to the band in which it is generated and by taking advantage of known properties of aural perception. To satisfy the first condition, although the bandpass filters are wide and overlapping, careful design of the filters results in a cancellation of quantization noise that leaks across bands, thus keeping the bandpass quantization noise independent across channels from the perspective of the synthesis. Quadrature mirror filters are one such filter class; for speech coding, these filters have the additional important property of, under certain conditions, canceling aliasing at the output of each decimated filter when the Nyquist criterion is not satisfied [81]. Figure 12.17 shows an example of a two-band subband coder using two overlapping quadrature mirror filters [81]. Quadrature mirror filters can be further subdivided from high to low filters by splitting the fullband into two (as in Figure 12.17), then the resulting lower band into two, and so on. Because of the octave-band nature of the resulting bandpass filters, repeated decimation is invoked. This octave-band splitting, together with the iterative decimation, can be shown to yield a perfect reconstruction filter bank (i.e., its output equals its input) [77]. In fact, such octave-band filter banks, and their conditions for perfect reconstruction, are closely related to wavelet analysis/synthesis structures and their associated invertibility condition as described in Chapter 8 [12],[77].

⁶ In contrast, ideal and nonoverlapping bandpass filters imply a long time-domain filter impulse response and thus difficulty in achieving the perfect reconstruction filter-bank property.

We also saw in Chapter 8 that the octave-band filters, having a constant-Q property, provide an auditory front-end-like analysis. In the context of speech coding, we are interested in the effect of quantization noise in each auditory-like subband on speech perception. In particular, it is known that the SNR in each subband is correlated with speech intelligibility, and that for high intelligibility, as evaluated in speech-perception tests, it is desirable to maintain a constant SNR across the near constant-Q subbands that span the speech spectrum⁷ [39]. In addition, when the SNR in a subband is high enough, then the quantization noise in a subband can be perceptually masked, or “hidden,” by the signal in that subband.⁸

⁷ As mentioned in this chapter’s introduction, speech intelligibility is sometimes represented by the articulation index, which uses a weighted sum of the SNR in each auditory-like subband [39].

⁸ Other principles of auditory masking that involve hiding of small spectral components by large adjacent components have also been exploited in speech coding, as well as in speech enhancement. We study one such approach in Chapter 13.

Figure 12.17 Example of a two-band subband coder using two overlapping quadrature mirror filters and ADPCM coding of the filter outputs: (a) complete analysis, coding/decoding, and synthesis configuration; (b) overlapping quadrature mirror filters.

As in fullband waveform coding, the variance of each subband signal, denoted by for the kth subband on the nth frame, is determined and used to control the decision and reconstruction levels of the quantization of each subband signal.⁹ For example, the adaptive quantization scheme of Section 12.3.5 can be used, where the subband variance estimate controls the quantizer step size in each band. Given that we have a fixed number of bits, one must decide on the bit allocation across the bands. Suppose that each band is given the same number of bits. Then, bands with lower signal variance have smaller step sizes and contribute less quantization noise. Subbands with larger signal variance have larger step sizes and, therefore, contribute more quantizing noise. The quantization noise, by following the input speech spectrum energy, results in a constant SNR across subbands.

⁹ The variance of each subband signal can be obtained by local averaging of the spectrum in frequency, and thus it increases with increasing spectral energy [81]. This variance is sent as side information by the coder.

It is of interest to ask whether this uniform bit allocation is “optimal” according to minimizing an average measure of quantization noise by allowing the number of bits assigned to each frequency band to vary according to the time-varying variance of each band. Suppose we are given a total of B bits for N channels; then

(12.13)

where b(nL, k) denotes the number of bits for the kth subband at time nL. In order to determine an optimal bit assignment, we first define the quantization noise over all bands as

(12.14)

where is the quantized version of the STFT X(nL, k) and where denotes the variance of the quantization noise on the kth channel at time nL. Suppose, for example, that x[n] is a Gaussian random process. Then each subband signal has a Gaussian pdf because a linear transformation of a Gaussian random process remains a Gaussian random process [63]. Then it can be shown (Exercise 12.11) that the minimum-mean-squared error results in the optimal bit assignment given approximately by [14],[27]

(12.15)

where are the variances of the channel values and where D* denotes the minimum-mean-squared error. The number of bits b(nL, k), therefore, increases with increasing variance. It can be shown that this bit assignment rule, based on the minimum-mean-squared error criterion, leads to a flat noise variance (power spectrum) in frequency, i.e., (Exercise 12.11). A flat noise variance, however, is not the most desirable because the SNR decreases with decreasing subband signal energy and thus a loss in intelligibility is incurred due to subbands with relatively low energy (assuming the importance of equal SNR in all bands).

We can modify the shape of the quantization noise variance across subbands by allowing a positive weighting factor Q(k) that weights the importance of the noise in different frequency bands. The new distortion measure becomes

(12.16)

which, when minimized, gives the resulting noise spectrum

where α is a constant. The quantization noise variance, therefore, is inversely proportional to the weighting. The corresponding optimal bit assignment is given by (Exercise 12.11).

(12.17)

When the weighting function is given by the reciprocal of the subband signal variance, i.e., , giving less weight to high-energy regions, then the optimal bit assignment is constant. In this case, the quantization noise spectrum follows the signal spectrum (with more error in the high-energy regions) and the SNR is constant as a function of frequency. Thus the uniform bit assignment that we had described previously is not only perceptually desirable, but also approximately satisfies an optimal error criterion with a reasonable perceptual weighting. This subband coding strategy using ADPCM techniques, along with variations of these approaches, has been applied to speech with good quality in the range of 12–16 kbps [10],[11].

The coding strategies of this section have also been applied to the STFT viewed from the Fourier transform perspective [81]. Typically, the number of frequency channels (e.g., transform sizes of 64 to 512) used in this approach is much greater than in subband coding in order to capitalize on the spectral harmonic structure of the signal, as well as the general spectral formant shape. This technique, referred to as adaptive transform coding, can be thought of as narrowband analysis/synthesis, in contrast to subband coding (with far fewer and wider bands) classified as wideband analysis/synthesis, analogous to the difference between narrowband and wideband spectrogram analysis.

Although subband and transform coding methods can attain good quality speech in the range of 12–16 kbps, they are hard-pressed to attain such quality at lower bit rates. One approach to attaining these lower bit rates with frequency-domain approaches is to incorporate more speech-specific knowledge, as we did in the framework of the phase vocoder (which can also be thought of as falling in the class of subband coders). One of the earliest such coders, even prior to the introduction of the phase vocoder, was the channel vocoder first introduced by Dudley [15],[23] and extended by Gold and Radar [22]. This coder uses a representation of the speech source, as well as a subband spectral representation (Exercise 12.14). In the remainder of this chapter, we investigate two current approaches to explicitly using knowledge of the speech production mechanism in the frequency domain. These coders, based on a sinewave representation of speech, are capable of attaining high quality at lower bit rates by speech-dependent frequency-domain modeling.

12.5.2 Sinusoidal Coding

In Chapter 10, it was shown that synthetic speech of high quality could be synthesized using a harmonic set of sinewaves, provided the amplitudes and phases are the harmonic samples of a magnitude and phase function derived from the short-time Fourier transform at the sinewave frequencies. Specifically, the harmonic sinewave synthesis is of the form

(12.18)

where ω_o is the fundamental frequency (pitch) estimate [but where ω_o is fixed above an adaptive voicing frequency cutoff, as in Equation (10.31)], where are samples of the vocal tract piecewise-linear spectral envelope, derived from the SEEVOC estimation strategy of Chapter 10, and where is the phase obtained by sampling a piecewise-flat phase function derived from the measured STFT phase, using the same strategy to generate the SEEVOC envelope. (Recall from Chapter 10 that use of a piecewise-flat phase, in contrast to a piecewise-linear function, avoids the difficult problem of phase interpolation across frequency.)

In the context of speech coding, even though the harmonic synthesis eliminates the need to code the sinewave frequencies, the amplitudes and phases must be quantized, and, in general, there remain too many parameters to encode and achieve operation at low bit rates. A further reduction of required sinewave parameters was obtained in Chapter 10 for voiced speech by using speech production to model the measured phase samples. It was shown that, using the properties of the speech production mechanism, a minimum-phase model of the vocal tract and an onset model of the source allows a simplified representation during voicing which was said to provide a reduced parameter set for coding. In this section, this model for the glottal excitation and vocal tract transfer function is generalized to both voiced and unvoiced speech. Then a coding scheme based on the resulting reduced parameter set is developed.

Minimum-phase harmonic sinewave speech model —
Voiced speech sinewave model: In Chapter 9, the excitation impulse train during voicing is represented by a sum of sinewaves “in phase” at glottal pulse times, and the amplitude and phase of the excitation sinewaves are altered by the glottal airflow and vocal tract filter. Letting H(ω) = |H(ω)| exp[jΦ(ω)] denote the composite transfer function for these latter effects, which we have also been referring to as the system function, then the speech signal at its output due to the excitation impulse train at its input can be written as

where the onset time n_o corresponds to the time of occurrence of the impulse nearest the center of the current analysis frame. (In this section, we avoid frame index notation unless necessary.) We saw that |H(ω)| can be estimated as the SEEVOC envelope, denoted by . Furthermore, if we assume that the vocal tract is minimum-phase, then (as we saw in Chapter 10), a system phase can be obtained through the cepstral representation of |H(ω)| that we denote by . Then for harmonic frequencies, these minimim-phase amplitude and phase functions can be sampled at ω_k = kω_o.

Since the function of the onset time is to bring the sinewaves into phase at times corresponding to the occurrence of a glottal pulse, then, rather than attempt to estimate the absolute onset time from the data, we saw in Chapters 9 and 10 that it is possible to achieve the same perceptual effect simply by keeping track of successive onset times generated by a succession of pitch periods that are available at the synthesizer. This relative onset time was given by Equation (9.38). Another way to compute the onset times while accounting for the effects of time-varying pitch is to define the phase of the fundamental frequency to be the integral of the instantaneous frequency. For the lth frame, we express this phase as

(12.19)

where Ω_o(t) is the time-varying pitch frequency in continuous time t, T is the sampling time, l denotes the analysis frame index, and L is the frame interval in discrete time. Because this phase is monotonically increasing with time n, a sequence of onset times can be found at the values of n for which Φ_o[n_o] = 2πM for integer values of M. If and denote the estimated pitch frequencies on frames l − 1 and l, respectively, then a reasonable model for the frequency variation in going from frame l − 1 to frame l is

which can be used to numerically compute the phase in Equation (12.19) and subsequently the onset time. If all of the sinewaves are harmonically related, then the phase of the kth sinewave is simply k times the phase of the fundamental, which means that the excitation sinewaves are in phase for every point in time. This leads to a phase model for which it is unnecessary to compute the onset time explicitly (i.e., the absolute onset time). Therefore, using the minimum-phase system phase representation, denoted above by , and the excitation phase onset representation, we have with harmonic frequencies the sinewave voiced speech model

(12.20)

where is given in Equation (12.19). This shows that for voiced speech the sinewave reconstruction depends only on the pitch, and, through the SEEVOC envelope, the sinewave amplitudes, the system phase being derived from the amplitude envelope under a minimum-phase assumption. In synthesis, where cubic phase interpolation (or overlap-add) is invoked, we define the excitation phase offset for the kth sinewave as k times the phase of the fundamental [Equation (12.19)], evaluated at the center of the current synthesis frame [50].

Unvoiced speech sinewave model: If the above phase model is used in place of the measured sinewave phases, the synthetic speech is quite natural during voiced speech, but “buzzy” during the unvoiced segments. As noted in Chapter 10, an important difference between the measured and synthetic phase is a phase residual which typically contains a noise component that is particularly important during unvoiced speech and during voicing with a high aspiration component. On the other hand, if the phases are replaced by uniformly distributed random variables on [−π, π], then the speech is quite natural during unvoiced speech but sounds like whispered speech during the voiced segments (look ahead to Figure 14.12). This suggests that the phase model in Equation (12.20) can be generalized by adding a voicing-dependent component which would be zero for voiced speech and random on [−π, π] for unvoiced speech. However, it should be expected that such a binary voiced/unvoiced phase model would render the sinewave system overly dependent on the voicing decision, causing artifacts to occur in the synthetic speech when this decision was made erroneously. The deleterious effects of the binary decision can be reduced significantly by using a more general mixed excitation model based on a voicing-dependent frequency cutoff described in Chapters 9 and 10. In this model, a voicing transition frequency is estimated below which voiced speech is synthesized and above which unvoiced speech is synthesized. Letting ω_c denote the voicing-dependent cutoff frequency, then the unvoiced phase component can be modeled by

(12.21)

representing an estimate to the actual phase residual ∈(ω) introduced in Section 10.5.4 and where U[−π, π] denotes a uniformly distributed random variable on [−π, π]. If this is added to the voiced-speech phase model in Equation (12.20), the complete sinewave model becomes

(12.22)

where the residual is given in Equation (12.21). As in the basic sinewave reconstruction system described in Chapter 9, speech is synthesized over contiguous frames using the linear amplitude and cubic phase interpolation algorithms, the amplitude and phase functions in Equation (12.22) being given at each frame center prior to interpolation. Using this interpolation, a frame rate of 10 ms was found adequate; above 10 ms perceptual artifacts become apparent. Approaches to estimate the voicing-adaptive cutoff frequency, based on the degree of harmonicity in frequency bands, were described in Chapter 10.

Postfiltering: While the synthetic speech produced by this system is of good quality, a “muffling” effect can be detected particularly, for certain low-pitched speakers. Such a quality loss has also been found in code-excited linear prediction systems (which we study later in this chapter), where it has been argued that the muffling is due to coder noise in the formant nulls. Because the synthetic speech produced by the minimum-phase harmonic sinewave system has not yet been quantized, the muffling cannot be attributed to quantization noise, but to the front-end analysis that led to the sinewave representation.¹⁰ The origin of this muffling effect is not completely understood. Speculations include sidelobe leakage of the window transform filling in the formant nulls, or harmonic sampling of the vocal tract SEEVOC estimate that broadens the formant bandwidths. These speculations, however, are in question because muffling is effectively removed when the synthetic phase function of Equation (12.22) is replaced by the piecewise-flat phase derived from the measured phase at sinewave frequencies. This reduced muffling is due, perhaps, to the auditory system’s exploiting of the original phase to obtain a high-resolution formant estimate, as we saw in Chapter 8. In addition, the original phase recovers proper timing of speech events, such as plosives and voiced onset times.

¹⁰ Because some additional muffling can be introduced with quantization, the postfilter should be applied at the decoding stage.

Techniques have been developed for filtering out quantization noise in formant nulls and sharpening formant bandwidths by passing the synthesized speech through a postfilter [9]. A variant of a code-excited linear prediction postfilter design technique, that uses a frequency-domain design approach, has been developed for sinewave systems [49],[52]. Essentially, the postfilter is a normalized, compressed version of the spectrally-flattened vocal tract envelope, which, when applied to the vocal tract envelope, results in formants having deeper nulls and sharper bandwidths that, in turn, result in synthetic speech that is less muffled. The first step in the design of the postfiltering procedure is to remove the spectral tilt from the log-spectrum. (Spectral tilt was described in Chapter 3, Example 3.2.) One approach to estimating the spectral tilt is to use the first two of the real cepstrum coefficients. These coefficients are computed using the equation

where is the SEEVOC envelope. The spectral tilt is then given by

log T(ω) = c[0] + 2c[1] cosω

and this is removed from the measured speech spectral envelope to give the residual envelope

(12.23)

This is then normalized to have unity gain, and compressed using a root-γ compression rule (γ ≈ 0.2). That is, if R_max is the maximum value of the residual envelope, then the postfilter is taken to be

The idea is that at the formant peaks the normalized residual envelope has unity gain and is not altered by the compressor (Figure 12.18). In the formant nulls, the compressor reduces the fractional values so that overall, a Wiener-like filter characteristic will result (see Chapter 13). In order to insure that excessive spectral shaping is not applied, giving an unnatural tonal characteristic, a clipping rule [32] is introduced such that the final postfilter value cannot fall less than 0.5. The resulting compressed envelope then becomes the postfilter and is applied to the measured envelope to give

(12.24)

Figure 12.18 Frequency-domain postfilter design: (a) spectral tilt, log T(ω), as provided by the order-two cepstrum; (b) spectral tilt removal to obtain flattened spectral envelope, log R(ω), and spectral compression to obtain postfilter, log P(ω); (c) postfiltered log-spectral envelope.

which cause the formants to narrow and the formant nulls to deepen, thereby reducing the effects of the analysis and coder noise. Unfortunately, the spectral tilt does not always adequately track the formant peaks and the resulting postfilter can introduce significant spectral imbalance. The effect of this imbalance can be reduced somewhat by renormalizing the postfiltered envelope so that the energy after postfiltering is equal to the energy before postfiltering. Other methods of estimating spectral tilt have also been developed based on all-pole modeling [9],[50].

Experimental results: When the voicing-dependent synthetic-phase model in Equation (12.22) is used to replace the measured phases in the harmonic sinewave synthesizer and the above postfilter is applied, the speech is found to be of very high quality. It is particularly notable that the synthetic speech does not have a “reverberant” characteristic, an effect which, as we saw in Chapter 9, arises in sinewave synthesis when the component sinewaves are not forced to be phase locked. The minimum-phase model preserves a (controlled) dispersive characteristic in the voiced and unvoiced sinewave phases, helping the synthetic speech to sound natural. Moreover, the effect of the postfilter is to make the synthetic speech sound more crisp, considerably reducing the muffled quality often perceived in low-rate speech coders.

Sinewave parameter coding at low data rates — We saw in the previous section that a synthetic minimum-phase function can be obtained from the SEEVOC magnitude estimate and that the excitation sinewave phases can be modeled in terms of a linear phase due to the onset times and voicing-adaptive random phases due to a noise component. Because the pitch can be scalar-quantized using ≈ 8 bits and the voicing probability using ≈ 3 bits, then good quality speech at low data rates appears to be achievable provided the sinewave amplitudes can be coded efficiently. A number of approaches can be applied to code the sinewave amplitudes. To obtain a flavor for the required coding issues, we describe a cepstral modeling approach and then briefly allude to other techniques.

The cepstral model: The first step in coding the sinewave amplitudes is to develop a parametric model for the SEEVOC envelope¹¹ using the cepstral representation. Because the SEEVOC estimate is real and even, such a model can be written as

¹¹ The estimate of the log-magnitude of the vocal tract transfer function can also be taken as the cubic spline fit to the logarithm of the sinewave amplitudes at the frequencies obtained using the SEEVOC peak-picking routine, providing a smoother function than the SEEVOC estimate [50].

(12.25)

The real cepstral c[m] is then truncated to M points (M ≤ 40 has been found to be adequate) and the cepstral length becomes a design parameter that is varied depending on the coder rate. This representation provides a functional form for warping the frequency axis prior to quantization, which can exploit perceptual properties of the ear to reduce the number of required cepstral coefficients [49].

Spectral warping: We saw in Chapter 8 that, above about 1000 Hz, the front-end human auditory system consists of filters whose center frequency increases roughly logarithmically and whose bandwidth is near constant-Q with increasing frequency. This increasing decimation and smoothing with frequency, said to make the ear less sensitive to details in the spectral envelope at higher frequencies, can be exploited by a nonlinear sampling at the higher frequencies, thereby providing a more efficient allocation of the available bits. Let ω′ = W(ω) and ω = W⁻¹(ω′) represent the warping function and its inverse. An example of a warping function is shown in Figure 12.19. Then the warped envelope is computed from the SEEVOC envelope through the cepstral representation in Equation (12.25) using

where the mapping occurs on uniform DFT frequency samples. In order to simulate the effect of auditory filters, including uniformly-spaced filters at low frequency, we select a mapping that is linear in the low-frequency region (≤ 1000 Hz) and exponential in the high-frequency region, a relationship which is described parametrically as

Figure 12.19 Typical spectral warping function (a) and its inverse (b).

SOURCE: R.J. McAulay and T.F. Quatieri, “Low-Rate Speech Coding Based on the Sinusoidal Model,” chapter in Advances in Speech Signal Processing [49]. ©1992, Marcel Dekker, Inc. Courtesy of Marcel Dekker, Inc.

If M cepstral coefficients are used in the representation, then the amplitude envelope on the original frequency scale is approximately recovered using the relation

The value of M, as well as the warping parameters (α, ω_L), are varied to give the best performance at a given bit rate. Some typical designs which have been found to give reasonable performance are given in [49],[50].

Cepstral transform coding: A potential problem in quantizing the speech envelope using the cepstral coefficients is their large dynamic range (Exercise 12.24) (see also Chapter 6). To avoid this problem, the M cepstral coefficients (derived from the warped spectral envelope) are transformed back into the frequency domain to result in another set of coefficients satisfying the cosine transform pair [49]:

The advantage in using the transform coefficients g_k instead of the cepstral coefficients c[m] is that the transform coefficients correspond to equally-spaced samples on the warped frequency scale (and thus nonlinearly-spaced on the original frequency scale) of the log-magnitude and, hence, can be quantized using the subband coding principles described earlier in Section 12.5.1. In addition, because of the spectral warping, the transform coefficients can be coded in accordance with the perceptual properties of the ear. Finally, the lowpass filtering incurred in truncation to M coefficients implies that the coefficients are correlated and thus differential PCM techniques in frequency can be exploited¹² [49]. The overall spectral level is set using 4–5 bits to code the first channel gain. Depending on the data rate, the available bits are partitioned among the M−1 remaining channel gains using step-sizes that range from 1–6 dB with the number of bits decreasing with increasing frequency. Bit rates from 1.2–4.8 kbps have been obtained with a variety of bit allocation schemes. An example of a set of coded channel gains is shown in Figure 12.20 for 2.4 kbps operation. The reconstructed (decoded) envelope is also shown in Figure 12.20. Coding strategies include methods for frame-fill interpolation with decreasing frame rate [49],[58]. This is a simple method to reduce bit rate by transmitting the speech parameters every second frame, using control information to instruct the synthesizer how to reconstruct or “fill in” the missing data [58]. The frame-fill method has also proved useful at higher rates because it adds temporal resolution to the synthetic waveform [49].

¹² Techniques have been invoked that protect against a positive slope-overload condition in which the coder fails to track the leading edge of a sharply-peaked format [49],[74].

This section has described a cepstral approach to parameterizing and quantizing the sinewave amplitudes. An advantage of the cepstral representation is that it assumes no constraining model shape, except that the SEEVOC envelope must represent a vocal tract transfer function that is minimum-phase. Other methods further constrain the vocal tract envelope to be all-pole [50],[53]. This more constrained model leads to quantization rules that are more bit-rate efficient than those obtained with cepstral modeling methods, leading to higher quality at lower bit rates such as 2.4 kbps and 1.2 kbps [50]. In these schemes, the SEEVOC envelope itself is all-pole-modeled, rather than the speech waveform. Using the frequency-domain view in Chapter 5 of linear prediction analysis, we can derive an all-pole fit to the SEEVOC envelope simply by replacing the autocorrelation coefficients of the speech by those associated with the squared SEEVOC envelope (Exercise 12.17). An advantage of the all-pole fit to the SEEVOC (rather than the waveform) is that it removes dependency on pitch and allows variable order models which in the limit fit the SEEVOC exactly, i.e., there is no problem with autocorrelation aliasing, as illustrated in Figure 5.7. In addition, spectral warping can be applied and further coding efficiencies can be gained [50] by applying frequency-differential methods to a transformation of the all-pole predictor coefficients, referred to as line spectral frequencies (LSFs) which we study later in the chapter.

Figure 12.20 Typical unquantized and quantized channel gains derived from the cepstral transform coefficients. (Only one gain is shown when the two visually coincide.) The dewarped spectral envelope after decoding is also illustrated.

Finally, we note that higher rates than 4.8 kbps (e.g., 6.4 kbps) can be achieved with improved quality by utilizing measured sinewave phases, rather than the synthetic minimum phase used at the lower rates. In one scheme, the measured sinewave phases for the eight lowest harmonics are coded and the remaining high-frequency sinewave phases are obtained from the synthetic phase model of Equation (12.22) [1]. However, because the linear phase component of the measured phases differs from that of the synthetic phase model, a linear phase from an absolute onset time must be estimated and subtracted from the measured phases. The measured phases are then aligned with the synthetic phases by adding to them the synthetic linear phase. Other methods of incorporating phase have also led to improved synthesis quality [2].

Multi-Band Excitation Vocoder — One of the many applications of the sinewave modeling technique to low-rate speech coding is the multi-band excitation (MBE) speech coder developed by Griffin, Lim, and Hardwick [24],[25]. The starting point for MBE is to represent speech as a sum of harmonic sinewaves. This framework and the resulting pitch and multi-band voicing estimates were described in Chapter 10 (Section 10.6). Versions of this coder have been chosen as standards for the INMARSAT-M system in 1990 [28], the APCO/NASTD/Fed Project 25 in 1992 [32], and the INMARSAT-Mini-M system in 1994 [16], demonstrating good quality in a multitude of conditions, including acoustic noise and channel errors. The purpose of this section is to show the similarities and differences between MBE and the generic sinewave coding methods described in previous sections.

Sinewave analysis and synthesis: The amplitudes that were computed in Chapter 10 as a byproduct during the MBE pitch estimation process have proven not to be reliable estimates of the underlying sinewave amplitudes (Exercise 10.8). Consequently, the MBE coder uses a different method to estimate sinewave amplitudes for synthesis and coding. Because the pitch has been determined, the discrete-time Fourier Transform of the windowed harmonic sinewave speech model, ŝ[n] [Equation (10.46)], within the region Δ_k of the kth harmonic [Equation (10.49)] can be written explicitly as (ignoring a 2π scaling)

where ω_o is the fundamental frequency estimate and where we have used the notation of Chapter 10 in which denotes the discrete-time Fourier Transform of with w[n] the analysis window. In the MBE coder, the amplitude is chosen such that over each harmonic region, Δ_k, the energy in the model matches the measured energy. This leads to the amplitude estimator

An additional processing step that is required before the sinewave amplitude data is presented to the synthesizer is the postfiltering operation based on the frequency-domain design principles introduced in Section 12.5.2. The phase function in MBE uses a synthetic excitation phase model described in Section 12.5.2 in which the excitation phase of the kth sinewave is k times the phase of the fundamental.

In MBE, two distinct methods are used to synthesize voiced and unvoiced speech (Figure 12.21). Voiced speech is synthesized using either sinewave generation similar to that in Chapter 9 with frequency matching and amplitude and phase interpolation, or using the overlap-add method of Section 9.4.2. Details of the conditions under which different harmonics use each synthesis method are given in [37].

For those speech bands for which the sinewaves have been declared unvoiced, MBE synthesis is performed using filtered white noise. Care must be taken to insure that the effect of the analysis window has been removed so that the correct synthesis noise level is achieved. The details of the normalization procedure are given in [37]. This approach to unvoiced synthesis is in contrast to the sinewave synthesis of Section 12.5.2 which uses random phases in the unvoiced regions. The advantage of using random sinewave phases is that the synthesizer is simpler to implement and more natural “fusing” of the two components may occur, as the same operations are performed for voiced and unvoiced speech.

Sinewave parameter coding: In order to operate MBE as a speech coder, the pitch, voicing, and sinewave amplitudes must be quantized. For low-rate coding, measured phase is not coded and the above synthetic phase model is derived from pitch and a sinewave amplitude envelope when minimum-phase dispersion is invoked. Consider, for example, the particular MBE coding scheme used in the INMARSAT-M system. Here 8 bits are allocated to pitch quantization. Because allowing all voiced/unvoiced multi-band decisions for each harmonic requires an excessive number of bits, the decisions are made on groups of adjacent harmonics. A maximum of 12 bands (and thus 12 bits for one bit per binary decision) is set; if this maximum is exceeded, as with low-pitched speakers, then all remaining high-frequency harmonics are designated unvoiced. For the INMARSAT-M system, 83 bits per frame are allocated. The bits allocated for the sinewave amplitudes per frame are then given by the required total number of bits per frame reduced by the sum of the pitch and voicing bits, i.e., 83 − 8 − b, where b denotes the bits (≤ 12) for multi-band voicing decisions. For a 20-ms frame interval, 83 bits result in 4.15 kbps. The gross bit rate in the INMARSAT-M system is 6.4 kbps with 45 bits per frame allocated for bit transmission error correction [37]. As with the basic sinewave coder of Section 12.5.2, lower rates (e.g., 1.2 and 2.4 kbps) and more efficient quantization can be obtained using an all-pole model of the sinewave amplitude estimates [37],[83]. In higher-rate MBE coding, time-differential coding of both the sinewave amplitudes and measured phases is used [24].

Figure 12.21 Schematic of the MBE synthesis as a basis for speech coding.

12.6 Model-Based Coding

We have already introduced the notion of model-based coding in using an all-pole model to represent the sinewave-based SEEVOC spectral envelope for an efficient coding of sinewave amplitudes; this can be considered a hybrid speech coder, blending a frequency-domain representation with the all-pole model. The purpose of model-based approaches to coding is to increase the bit efficiency of waveform-based or frequency-domain coding techniques either to achieve higher quality for the same bit rate or to achieve a lower bit rate for the same quality. In this section we move back in history and trace through a variety of model-based and hybrid approaches in speech coding using, in particular, the all-pole speech representation. We begin with a coder that uses the basic all-pole linear prediction analysis/synthesis of the speech waveform. The deficiencies of scalar quantization in a simple binary excitation impulse/noise-driven linear-prediction coder lead to the need for applying vector quantization. Deficiencies in the binary source representation itself lead to a hybrid vocoder, the mixed excitation linear prediction (MELP) coder. In this coder, the impulse/noise excitation is generalized to a multi-band representation, and other important source characteristics are incorporated, such as pitch jitter, glottal flow velocity, and time-varying formant bandwidths from nonlinear interaction of the source with the vocal tract. A different approach to source modeling, not requiring explicit multi-band decisions and source characterization, uses a multipulse source model, ultimately, resulting in the widely used code-excited linear prediction (CELP) coder.

12.6.1 Basic Linear Prediction Coder (LPC)

Recall the basic speech production model in which the vocal tract system function, incorporating the glottal flow velocity waveform during voicing, is assumed to be all-pole and is given by

where the predictor polynomial

and whose input is the binary impulse/noise excitation. Suppose that linear prediction analysis is performed at 100 frames/s and 13 parameters (10 all-pole spectrum parameters, pitch, voicing decision, and gain) are extracted on each frame, resulting in 1300 parameters/s to be coded per frame. For telephone bandwidth speech of 4000 Hz, 1300 parameters/s is clearly a marked reduction over 8000 samples/s required by strict waveform coding. Hence we see the potential for low bit rate coding using model-based approaches. We now look at this linear prediction coder (LPC), first introduced by Atal and Hanauer [5], in more detail.

Rather than use the prediction coefficients a_i, we use the corresponding poles b_i, the partial correlation (PARCOR) coefficients k_i (equivalently, from Chapter 5, the reflection coefficients), or some other transformation on the prediction coefficients. This is because, as we noted in Chapter 5, the behavior of the prediction coefficients is difficult to characterize, having a large dynamic range (and thus a large variance); in addition, their quantization can lead to an unstable system function (poles outside the unit circle) at synthesis. On the other hand, the poles b_i or PARCOR coefficients k_i, for example, have a limited dynamic range and can be enforced to give stability because |b_i| < 1 and |k_i| < 1. There are many ways to code the linear prediction parameters. Ideally, the optimal quantization uses the Max quantizer based on known or estimated pdfs of each parameter. One early scenario at 7200 bps involves the following bit allocation per frame [5]:

1. Voiced/unvoiced decision: 1 bit (on or off)

2. Pitch (if voiced): 6 bits (uniform)

3. Gain: 5 bits (nonuniform)

4. Poles b_i: 10 bits (nonuniform), including 5 bits for bandwidth and 5 bits for center frequency, for each of 6 poles giving a total of 60 bits

At 100 frames/s with (1+6+5+60) bits per frame, we then have the desired 7200 bps. This gives roughly the same quality as the uncoded synthesis, although a mediocre quality by current standards, limited by the simple impulse/noise excitation model.

Refinements to this basic coding scheme involve, for example, companding in the form of a logarithmic operator on pitch and gain, typical values being 5-bit or 6-bit logarithmic coding of pitch and 5-bit logarithmic coding of gain. Another improvement uses the PARCOR coefficients that are more amenable to coding than the pole locations. As noted earlier, the stability condition on the PARCOR coefficients k_i is |k_i| < 1 and is simple to preserve under quantization; therefore, interpolation between PARCOR coefficients of stable filters guarantees stable filters.

It has been shown empirically (using histogram analysis) that the first few PARCOR coefficients, k₁ and k₂, have an asymmetric pdf for many voiced sounds, k₁ being near − 1 and k₂ being near +1 [46]. The higher-order coefficients have a pdf closer to Gaussian, centered around zero. Therefore, nonuniform quantization is desirable. Alternatively, using companding, the PARCOR coefficients can be transformed to a new set of coefficients with a pdf close to uniform. In addition, there is a second reason for the companding transformation: The PARCOR coefficients do not have as good a spectral sensitivity as one would like. By spectral sensitivity, we mean the change in the spectrum with a change in the spectral parameters, a characteristic that we aim to minimize. It has been found that a more desirable transformation in this sense is the logarithm of the vocal tract area function ratio (Chapter 5), i.e.,

where the parameters g_i have a pdf close to uniform and a smaller spectral sensitivity than the PARCOR coefficients, i.e., the all-pole spectrum changes less with a change in g_i than with a change in k_i¹³ (and less than with a change in pole positions). Typically, these parameters can be coded at 5–6 bits each, a marked improvement over the 10-bit requirement of pole parameters. Therefore, with 100 frames/s and an order 6 predictor (6 poles), we require (1+6+5+36 ) 100 = 4800 bps for about the same quality at 7200 bps attained by coding pole positions for telephone bandwidth speech (4000 Hz) [46].

¹³ The log-area ratio represents a step improvement over the PARCOR coefficients in quantization properties. An even greater advantage is gained by line spectral frequencies that we introduce in a following section.

Observe that by reducing the frame rate by a factor of two to 50 frames/s, we attain a bit rate of 2.4 kbps. This basic LPC structure was used as a government standard for secure communications at 2.4 kbps for about a decade. Although the quality of this standard allowed a usable system, in time, quality judgments became stricter and the need for a new generation of speech coders arose. This opened up research on two primary problems with speech coders based on all-pole linear prediction analysis: (1) The inadequacy of the basic source/filter speech production model, and (2) The unaccounting for possible parameter correlation with the use of one-dimensional scalar quantization techniques. We first explore vector quantization methods to exploit correlation across prediction parameters.

12.6.2 A VQ LPC Coder

One of the first speech coders that exploited linear dependence across scalars applied vector quantization (VQ) to the LPC PARCOR coefficients. The components of this coder are illustrated in Figure 12.22. A set of PARCOR coefficients provides the training vectors from which reconstruction levels r_i are derived with the k-means algorithm. For given PARCOR coefficients k on each analysis frame, the closest reconstruction level is found to give quantized value and a table index is transmitted as a codeword. At the receiver, a binary codeword is decoded as an index for finding the reconstruction level in a table lookup. The pitch and voicing decision are also coded and decoded.

There are two cases that have been investigated with the VQ LPC structure shown in Figure 12.22. First, we try to achieve the same quality as obtained with scalar quantization of the PARCOR coefficients, but at a lower bit rate. Wong, Juang, and Gray [82] experimented with a 10-bit codebook (1024 codewords) and found they could achieve with roughly a 800 bps vector quantizer a quality comparable to a 2400-bps scalar quantizer. At 44.4 frames/s, 440 bits were required to code the PARCOR coefficients each second; 8 bits were used for pitch, voicing, and gain on each frame, and 1 bit for frame synchronization each second, which roughly makes up the remaining bits. In the second case, the goal is to maintain the bit rate and achieve a higher quality. A higher-quality 2400 bps coder was achieved with a 22-bit codebook, corresponding to 2²² = 4200000 codewords that need be searched. These systems, which were developed in the early 1980s, were not pursued for a number of reasons. First, they were intractable, with respect to computation and storage. Second, the quality of vector-quantized LPC was characterized by a “wobble” due to the LPC-based spectrum’s being quantized. When a spectral vector representation is near a VQ cell boundary, the quantized spectra “wobble” to and from two cell centroids from frame to frame; perceptually, the quantization never seems fine enough. Therefore, in the mid 1980s the emphasis changed from improved VQ, but with a focus on the spectrum and to better excitation models and ultimately to a return to VQ on the excitation. We now look at some of these approaches, beginning with a refinement of the simple binary impulse/noise excitation model.

Figure 12.22 A VQ LPC vocoder. In this example, the synthesizer uses an all-pole lattice structure.

12.6.3 Mixed Excitation LPC (MELP)

A version of LPC vocoding developed by McCree and Barnwell [55] exploits the multi-band voicing decision concept introduced in Section 12.5.2 and is referred to as Mixed Excitation (multi-band) LPC (MELP). MELP addresses shortcomings of conventional linear prediction analysis/synthesis primarily through a realistic excitation signal, time-varying vocal tract formant bandwidths due to the nonlinear coupling of the glottal flow velocity and vocal tract pressure, and production principles of the “anomalous” voice that we introduced in Chapters 3, 4, and 5. Aspects of this improved linear prediction excitation were also explored by Kang and Everett [34].

Model — As with the foundation for MBE, in the model on which MELP is based, different mixtures of impulses and noise are generated in different frequency bands (4–10 bands). The impulse train and noise in the MELP model are each passed through time-varying spectral shaping filters and are added together to form a fullband signal. Important unique components of MELP are summarized as:

1. An auditory-based approach to multi-band voicing estimation for the mixed impulse/noise excitation.

2. Aperiodic impulses due to pitch jitter, the creaky voice, and the diplophonic voice.

3. Time-varying resonance bandwidth within a pitch period accounting for nonlinear source/system interaction and introducing the truncation effect.

4. More accurate shape of the glottal flow velocity source.

We look briefly at how each component is incorporated within the LPC framework.

To estimate the degree of voicing (i.e., the degree of harmonicity) within each band, a normalized correlation coefficient of each output x[n] from a bank of constant-Q bandpass filters is calculated over a duration N at the pitch period lag P as

A problem with this approach is its sensitivity to varying pitch, the value of r[P] being significantly reduced, resulting in a slightly whispered quality to the synthetic speech. On the other hand, motivated by auditory signal processing principles, McGree and Barnwell [55] found that the envelope of the bandpass signal output is less sensitive to nonstationarity, characterized by a broad peak at the pitch period. Therefore, the maximum of the normalized correlation at the pitch lag of the waveform and of the envelope is selected as a measure of the degree of voicing, and this determines how much noise is added in each band.

To reduce buzziness due to a pitch contour that is unnaturally stationary (characteristic of conventional LPC), pitch periods are varied with a pulse position jitter uniformly distributed up to ±25%, mimicking the erratic behavior of the glottal pulses that can occur in regions such as during transitions, creakiness, or diplophonia. However, to avoid with jitter excessive hoarseness (recall from Chapter 3 that the hoarse voice is characterized by irregularity in the pitch period), the “jittery state” is detected using both correlation and peakiness in the speech waveform as given by a peak-to-rms value, i.e., the jitter increases with decreasing correlation (at the pitch period) and increasing peak-to-rms. Each speech frame is classified as either voiced, jittery voiced, or unvoiced. In both voiced states, the synthesizer uses a mixed impulse/noise excitation, but in the jittery voiced state the synthesizer uses controlled aperiodic pulses to account for irregular movement of glottal pulses. Allowing irregular pulses also allows better representation of plosives, detected through the peakiness measure for the jittery voiced state, without the need for a separate plosive class. This finding is consistent with the suggestion by Kang and Everett [34] of improving the representation of plosives in linear prediction synthesis with the use of randomly spaced pulses.

The synthesizer also includes a mechanism for introducing a time-varying bandwidth into the vocal tract response that we saw in Chapters 4 and 5 can occur within a pitch period. Specifically, a formant bandwidth can increase as the glottis opens, corresponding to an increasing decay rate of the response into the glottal cycle. This effect can be mimicked in linear prediction synthesis by replacing z⁻¹ in the all-pole transfer function by αz⁻¹ and letting α decrease, thus moving the filter poles away from the unit circle; this and other pole bandwidth modulation techniques, as well as an approach to their estimation, were considered in [55]. Finally, the simple impulsive source during voicing is dispersed with an FIR filter to better match an actual glottal flow input.

MELP Coding — A 2.4-kbps coder has been implemented based on the MELP model and has been selected as a government standard for secure telephone communications [56]. In the original version of MELP, 34 bits are allocated to scalar quantization of the LPC coefficients [specifically, the line spectral frequencies (LSFs) described below], 8 bits for gain, 7 bits for pitch and overall voicing (estimated using an autocorrelation technique on the lowpass filtered LPC residual), 5 bits to multi-band voicing, and 1 bit for the jittery state (aperiodic) flag. This gives a total of 54 bits per 22.5-ms frame equal to 2.4 kbps. In the actual 2.4-kbps standard, greater bit efficiency is achieved with vector quantization of the LSF coefficients [56]. Further quality improvement was obtained at 2.4 kbps by combining the MELP model with sinusoidal modeling of the low-frequency portion of the linear prediction residual, formed as the output of the vector-quantized LPC inverse filter [55], [57]. Similar approaches were also used to achieve a higher rate 4.8-kbps MELP coder.

We have indicated that in MELP, as well as in almost all linear prediction-based coders, a more efficient parameter set for coding the all-pole model is the line spectral frequencies (LSFs) [78], [79]. We saw that the LSF parameter set is also used in the sinewave coder to represent the all-pole model of the sinewave amplitudes for coding at 2.4–4.8 kbps. The LSFs for a pth order all-pole model are defined as follows. Two polynomials of order p + 1 are created from the pth order inverse filter A(z) according to

(12.26)

The line spectral frequencies ω_i correspond to the roots of P(z) and Q(z) which are on the unit circle (i.e., at z = e^jωi), where the trivial roots that always occur at ω_i = π and ω_i = 0 are ignored. Substantial research has shown that the LSFs can be coded efficiently and the stability of the resulting synthesis filter can be guaranteed when they are quantized. This parameter set has the advantage of better quantization and interpolation properties than the corresponding PARCOR coefficients [79]. However, it has the disadvantage that solving for the roots of P(z) and Q(z) can be more computationally intensive than computing the PARCOR coefficients. Finally, the polynomial A(z) is easily recovered from the LSFs (Exercise 12.18).

12.7 LPC Residual Coding

In linear prediction analysis/synthesis based on the all-pole model , we first estimate the parameters of (and the gain A) using, for example, the autocorrelation method of linear prediction. A residual waveform u[n] is obtained by inverse filtering the speech waveform s[n] by A(z), i.e,

u[n] ↔ U(z) = S(z)A(z).

In the binary impulse/noise excitation model, this residual is approximated during voicing by a quasi-periodic impulse train and during unvoicing by a white noise sequence. We denote this approximation by . We then pass through the filter and compare the result to the original speech signal. During voicing, if the residual does indeed resemble a quasi-periodic impulse train, then the reconstructed signal is close to the original. More often, however, the residual during voicing is far from a periodic impulse train, as illustrated through examples in Chapter 5 (Figure 5.11). As with MELP, the motivation for the two coding schemes in this section is the inability of basic linear prediction analysis/synthesis to accurately model the speech waveform due, in part, to the complexity of the excitation. Unlike MELP, the methods of this section view the inverse filter output without necessarily explicit recourse to the underlying physics of source production, hence the use of the term “residual.”

12.7.1 Multi-Pulse Linear Prediction

Multi-pulse linear prediction analysis/synthesis can be considered the basis of a hybrid speech coder that blends waveform-based and all-pole model-based approaches to coding. During voicing, the basic goal of the multi-pulse coder is to represent the excitation waveform, which passes into , with additional impulses between the primary pitch impulses in order to make the resulting reconstruction a closer fit to s[n] than can be achieved with pitch impulses only. There are numerous justifications for such additional impulses, including the inadequacy of the all-pole model in representing the vocal tract spectrum and glottal flow derivative, and including “secondary” impulses due, for example, to vocal fry and diplophonia, as well as aspiration from turbulent air flow and air displacements resulting from small movements of the surface vocal folds [26]. Other events that help explain the complexity of the residual are due to nonlinear phenomena, including, for example, the glottal flow ripple we encountered in Chapter 5 (Section 5.7.2) and the vortical shedding proposed in Chapter 11 (Section 11.4). In addition to an improved representation of speech during voicing, multi-pulse prediction provides a more accurate representation of speech during unvoicing and in regions where the binary voiced/unvoiced decision is ambiguous, such as in voiced/unvoiced transitions, voiced fricatives, and plosives. The beauty of this approach is that a voicing decision is not required, the analysis being freed of a binary decision, over the full speech band or its multi-bands.

Analysis-by-Synthesis— The objective in the analysis stage of multi-pulse linear prediction is to estimate the source pulse positions and amplitudes. To do so, we define an error criterion between the reconstructed waveform and its original over the lth synthesis frame as

(12.27)

where the sequence is the output of the filter (where we assume here the gain is embedded in the excitation) with a multi-pulse input , i.e.,

(12.28)

where each input impulse has amplitude A_k and location n_k. For each speech frame, we minimize E [lL] with respect to the unknown impulse positions and amplitudes. Because we need to synthesize the speech waveform for different candidate impulse parameters in doing the minimization, this approach falls under the class of analysis-by-synthesis methods whereby analysis is accomplished via synthesis, as illustrated in Figure 12.23. Finding A_k and n_k for k = 1, 2, … , M by this minimization problem is difficult, given that the impulse positions are nonlinear functions of . Before addressing this general problem, we simplify the optimization in the following example:

Figure 12.23 Multi-pulse linear prediction as an analysis-by-synthesis approach.

Example 12.5 Consider one impulse Aδ[n − N] so that the error in Equation (12.27) becomes

with h[n] the all-pole impulse response and with impulse position N and amplitude A unknown. Differentiating with respect to A, we have

resulting in the optimal value of A:

To obtain the optimal value of N, we substitute the expression for A* into E(lL; N, A) which becomes a function of only the unknown position N. We then have a nonlinear minimization problem in solving for the optimal position N*. One solution is to perform an exhaustive search over the frame interval [lL, (l + 1)L − 1] to obtain N* from which the value for A* follows.

Example 12.5 motivates a suboptimal solution to the general nonlinear multi-pulse minimization problem. For the general multi-pulse problem, an efficient but suboptimal solution is obtained by determining the location and amplitude of impulses, one impulse at a time. Thus an optimization problem with many unknowns is reduced to two unknowns.¹⁴ A closed-form solution for an impulse amplitude is given as above, followed by determination of the impulse location by an exhaustive search. We then subtract out the effect of the impulse from the speech waveform and repeat the process. More specifically, we have the following steps for a particular frame [6]:

14 The optimal solution requires consideration of the interaction among all impulses. One iterative approach first assumes an initial pulse positioning and solves the resulting linear optimization problem for the unknown amplitudes (Exercise 12.21). Search methods can then be used to re-estimate the impulse positions and the iteration proceeds until a small enough error is achieved. Although this approach can give greater bit efficiency, it does so at great computational expense [37].

Figure 12.24 Illustration of the suboptimal (iterative) impulse-by-impulse multi-pulse solution. Each panel shows the waveform of a 5-ms frame interval, the excitation, the memory hangover from the previous frame, and the error resulting from estimating and subtracting successive impulses, and shows the error decreasing with each additional impulse.

S1: We begin with no excitation and an estimate of the all-pole vocal tract impulse response. Contribution from the past all-pole filter memory is subtracted from the current frame.

S2: The amplitude and location from one single impulse, which minimizes the mean-squared error, is determined.

S3: A new error is determined by subtracting out the contribution from step S2.

S4: Repeat steps S2–S3 until a desired minimum error is obtained.

Figure 12.24 illustrates the method [6]. Panel (a) shows the waveform of a 5-ms frame interval, an initial zero excitation, the memory hangover from the previous frame, and the error resulting from subtracting the memory hangover. Panels (b)-(e) show the result of estimating and subtracting successive impulses, and show the error decreasing with each additional impulse. It has been found that negligible error decrease occurs after eight impulses over a 10-ms interval and the resulting speech is perceptually close to the original. The synthetic speech is not characterized by the unnatural or buzzy quality of basic linear prediction analysis/synthesis.

Example 12.6 An all-pole filter with 16 poles is estimated every 10 ms using the covariance method of linear prediction with a 20-ms analysis window. Impulse locations and amplitudes are determined with the method of the above steps S1–S4 over successive 5-ms frame intervals, i.e., the 10-ms all-pole response estimation time interval was split into two sub-frames. Two examples from this procedure are illustrated in Figure 12.25 [6]. As seen, the multi-pulse excitation is able to follow rapid changes in the speech, as during transitions. It is also interesting to observe various secondary pulses, including pulses in the excitation very close to the primary pulse, contributed perhaps from secondary glottal pulses, secondary nonacoustic sources, or zeros in the vocal tract system function that are not adequately modeled by the all-pole transfer function (Exercise 12.20).

Figure 12.25 Two illustrations of multi-pulse synthesis. Each case shows the (a) original speech, (b) synthetic speech, (c) excitation signal, and (d) error.

We have omitted up to now an important component of the error minimization. This component is illustrated in Figure 12.23 as a perceptual weighting filter, accounting for the inadequacy of the mean-squared error criterion for human perception. This weighting de-emphasizes error in the formant regions by applying a linear filter that attenuates the error energy in these regions, i.e., we give less weight to the error near formant peaks, much as we did in selecting quantization noise in subband speech coders. The result is a synthetic speech spectrum with more error near formant peaks, roughly preserving the signal-to-noise ratio across frequency. In the frequency domain, the error criterion becomes

where Q(ω) is the weighting function chosen to de-emphasize the error near formant peaks, and where S(ω) and ( ω) refer to the Fourier transforms of s[n] and , respectively, over a frame duration. Thus, qualitatively, we want Q(ω) to take an inverse-like relation to S(ω) as we did in the subband coding context. Let P(z) be the prediction error filter corresponding to an all-pole vocal tract transfer function. Then the inverse filter A(z) = 1 − P(z). One choice of Q(z) is given by

where 0 ≤ α ≤ 1 [6]. The filter changes from Q(z) = 1 for α = 1 to Q(z) = 1 − P(z) (the inverse filter) for α = 0. The particular choice of α, and thus the degree to which error is de-emphasized near formant peaks, is determined by perceptual listening tests. Figure 12.26 shows an example for the typical value α =0.8 used in Example 12.6 [6].

Figure 12.26 Example of perceptual weighting filter.

Multi-Pulse Parameter Coding — In multi-pulse speech coding, the predictor coefficients (or transformations thereof such as the log area ratios or LSFs), pulse locations, and pulse amplitudes are quantized. In one 9.6-kbps coder, for example, 8 impulses/10-ms frame are used or 800 impulses/s. In this particular scheme, the corresponding bit allocation is 2.4 kbps for the all-pole system function A(z) and 7.2 kbps for the impulses (or 7200/800 = 9 bits/impulse) [37].

Given only 9 bits/impulse for impulse amplitudes and locations, the quantization strategy must be carefully selected. The impulse locations are coded differentially to reduce dynamic range, i.e., the difference Δ_i = n_i − n_i−1 between impulse locations is quantized rather than the absolute locations. The amplitudes are normalized to unity to push the dynamic range into one gain parameter, making all amplitudes lie in the range [0, 1]. The result is good quality for 9.6 kbps, but quality that degrades rapidly with bit distributions for bit rates < 9.6 kbps. This is particularly true for high-pitched voices where a large portion of the 8 impulses/frame is used in representing the primary pitch pulses, leaving few impulses for the remaining excitation. For good quality below 9.6 kbps, as well as greater bit efficiency at 9.6 kbps, a different approach is taken invoking long-time periodicity prediction and vector quantization of the excitation.

12.7.2 Multi-Pulse Modeling with Long-Term Prediction

Primary pitch pulses, i.e., from the periodic glottal flow, can waste many of the available bits for the above 8 impulses/frame in multi-pulse speech coding. Therefore, one would like to represent the periodic correlation in the speech with a few parameters per frame, freeing up impulses in the remaining excitation. One strategy that avoids the coding of individual pitch impulses first performs pitch estimation and then introduces pitch impulses in the excitation. This eliminates the need for primary pitch impulses in the multi-pulse representation. The problem with this approach is that synchrony of the primary impulses with the original speech waveform periodicity is required; in addition, primary pitch pulses in speech vary in amplitude from frame-to-frame. An alternative is to introduce long-term prediction.

Long-term prediction is based on the observation that primary pitch pulses due to glottal pulses are correlated and predictable over consecutive pitch periods, so that

s[n] ≈ bs[n − P]

where P is the pitch period and b is a scale factor. In fact, we can consider the speech waveform as having a short-term and long-term correlation. As illustrated in Figure 12.27, the short-term correlation (with which we are already familiar from our linear prediction analysis of Chapter 5) occurs over the duration of the vocal tract response within a pitch period, while the long-term correlation occurs across consecutive pitch periods. The approach that we take, therefore, is to first remove short-term correlation by short-term prediction followed by removing long-term correlation by long-term prediction.

The short-term prediction-error filter is the pth-order polynomial , where p is typically in the range 10–16. The result of the short-term prediction error is a residual function u[n] that includes primary pitch pulses (long-term correlation). The long-term prediction-error filter is of the form

B(z) = 1 − bz^−P

where bz^−P is the long-term predictor in the z -domain. The output of the long-term prediction-error filter is a residual

υ[n] = u[n] − bu[n − p]

with fewer large (long-term correlated) pulses to code than in u[n] (Figure 12.28). After removing the long-term prediction contribution, the residual υ[n] forms the basis for an efficient coding of a multi-pulse excitation. Having the short-term predictor and long-term predictor, we can then invert the process and recover the original speech waveform as shown in Figure 12.29 where we assume knowledge of the residual υ[n] as well as inverse filters A(z) and B(z). In synthesis, with a frame-by-frame implementation, the memory hangover from a previous frame is added into the result of filtering with and on the current frame.

In estimating the long-term predictor, we must estimate both the pitch period P and the scale factor b. The pitch period can be estimated independently with any pitch estimator. However, it is preferred to tie the estimation of P to the prediction problem because our goal is to remove pulses correlated over consecutive periods, reducing the prediction error. In the time domain, the long-term prediction error filter B(z) = 1 − bz^−p is expressed by

b[n] = δ[n] − bδ[n − P].

Figure 12.27 Illustration of short- and long-term correlation in the speech waveform.

Figure 12.28 Illustration of short- and long-term prediction. The residual sequence u[n] is the short-term prediction error and the residual υ[n] is the long-term prediction error.

We then define the error criterion for the lth frame as

where L is a frame duration. The objective is to minimize E[lL; P, b] with respect to P and b.

Suppose that we know the pitch period P. Then we differentiate with respect to the unknown b, i.e.,

leading to

(12.29)

Observe that this procedure corresponds to the covariance method of linear prediction because the fixed limits of summation imply that we have not truncated the data. Due to the long delay P, it is important to use the covariance method rather than the autocorrelation method; with the autocorrelation at long lags, we would obtain poor estimates of the correlation (Exercise 12.22).

Figure 12.29 Illustration of speech reconstruction from short- and long-term prediction filter outputs.

Now substituting b* in the expression for E[lL; P, b], we obtain (Exercise 12.23)

(12.30)

where

We now want to maximize E′[lL; P] (which minimizes E [lL; P]). This maximization is similar to the operation required in the autocorrelation pitch estimator that we derived in Chapter 10. To do so, we compute E′[lL; P] for all integer values over some expected pitch period range. When speech is voiced, this results in local maxima at delays that correspond to the pitch period and its multiples.

The analysis and synthesis of Figures 12.28 and 12.29, respectively, form the basis for the multi-pulse residual estimation algorithm in Figure 12.30. Observe in Figure 12.30 that multi-pulse estimation is performed to attain a more accurate waveform match at the receiver. As in the previous section (12.7.1), we select impulse amplitudes and locations to minimize the mean-squared error between the original and synthetic speech where the frequency-domain error function is perceptually weighted. Now, however, impulses are selected to represent υ[n], the output of the long-term predictor, rather than u[n], the output of the short-term predictor. It is interesting to observe that, although long-term prediction intends to remove quasi-periodicity in the signal, multi-pulse modeling of the residual still results in impulses that are placed at or near the primary pitch pulses, i.e., some of the time multi-pulses lie near pitch pulses [37], as we indicated schematically by υ[n] in Figure 12.28. This is probably due to the inadequacy of the linear-prediction analysis in modeling a time-varying vocal tract that contains (among other contributions) zeros that can be modeled as impulses near the primary pitch pulses, as we alluded to earlier. Another interesting observation is the nature of the long-term predictor filter estimate used in synthesis, i.e.,

which, when magnitude-squared, along the unit circle, results in

This filter has a “comb” structure that imparts a quasi-harmonicity onto an estimate of the multi-pulse residual even when the residual υ[n] is not reproduced exactly [73].

Observe, as indicated in Figure 12.30, that we have solved for A(z) and B(z) “open-loop” in the sense that we are not doing waveform matching at these steps. In obtaining B(z), for example, the matching is performed on the predicted residual, rather than on the speech waveform itself. In contrast, we solve for the multi-pulse model of the residual υ[n] “closed-loop” because we are waveform-matching at this final stage of the optimization. Thus, we have performed the open-loop and closed-loop analysis sequentially; the multi-pulse approach (as is the CELP coder in the following system) is considered “analysis-by-synthesis,” but this is not strictly true, due to this sequential nature of the processing. On the other hand, we could have optimized all system components in closed-loop, but this would be computationally infeasible.

Figure 12.30 Closed-loop/open-loop multi-pulse analysis/synthesis with short- and long-term prediction. The polynomials Â(z) and represent the short- and long-term predictor estimates with possibly quantized coefficients.

With the analysis/synthesis scheme of Figure 12.30, we can obtain higher quality in the bit range 8–9.6 kbps than with the basic multi-pulse approach. In fact, a multi-pulse coder based on the configuration of Figure 12.30, including coding of the long-term predictor gain and delay, short-term predictor coefficients, and locations and amplitudes of residual impulses, forms the essence of a 9.6 kbps coder that was selected for public communication between commercial in-flight aircraft and ground, as well as a government standard secure telephone unit (STU 3) and an international standard for aeronautical mobile satellite telecommunications [37]. A variant of this multi-pulse scheme has also been selected for the Pan-European Digital Cellular Mobile Radio System (GSM) at 12.2 kbps. In this coder, referred to as regular pulse-excited LPC (RPELPC), the residual from the long-term predictor is constrained to consist of equally spaced impulses of different amplitudes [19], [38]. Because the impulse locations are constrained, optimal selection of the impulse amplitudes becomes a linear problem (Exercise 12.21).

Although the multi-pulse approach with long-term prediction provides the basis for satisfactory coders near 9.6 kbps, achieving high-quality coding much below this rate requires a different way to represent and quantize the residual υ[n], i.e., the coding of individual impulse amplitudes and locations is too taxing for lower rate coders. In the next section, we present the fundamentals of the code-excited linear prediction (CELP) coder designed to more efficiently represent the residual based on vector quantization.

12.7.3 Code-Excited Linear Prediction (CELP)

Concept — The basic idea of code-excited linear prediction (CELP) is to represent the residual from long-term prediction on each frame by codewords from a VQ-generated codebook, rather than by multi-pulses. This codeword approach can be conceptualized by replacing the residual generator in Figure 12.30 by a codeword generator; on each frame, a codeword is chosen from a codebook of residuals such as to minimize the mean-squared error between the synthesized and original speech waveforms. The length of a codeword sequence is determined by the analysis frame length. For example, for a 10-ms frame interval split into 2 inner frames of 5 ms each, a codeword sequence is 40 samples in duration for an 8000-Hz sampling rate. The residual and long-term predictor is estimated with twice the time resolution (a 5-ms frame) of the short-term predictor (a 10-ms frame) because the excitation is more nonstationary than the vocal tract. As with multi-pulse linear prediction, a perceptual weighting is used in the selection of the codewords (Figure 12.30). The weighting function is chosen similarly to that in multi-pulse, and thus again yields a quantization error that is roughly proportional to the speech spectrum. In addition, a postfilter, similar to that described in previous sections, is introduced in synthesis for formant bandwidth sharpening. Two approaches to formation of the codebook have been considered: “deterministic” and “stochastic” codebooks. In either case, the number of codewords in the codebook is determined by the required bit rate and the portion of bits allocated to the residual υ[n]. A 10-bit codebook, for example, allows 2¹⁰ = 1024 codewords in the codebook.

A deterministic codebook is formed by applying the k-means clustering algorithm to a large set of residual training vectors. We saw earlier in this chapter that k-means clustering requires a distortion measure in training, as well as in selection of reconstruction levels in coding. For CELP, the natural distortion measure used for clustering in training is the waveform quantization error over each inner frame, i.e.,, where L here denotes the inner frame length and is the quantized speech waveform. Successful codebooks of this type can be constructed using a large, representative set of speech training data, and have yielded good-quality speech [38], [76]. Nevertheless, CELP coders typically do not apply k-means training [38]. One reason is that often the training data and the speech being coded (the “test data”) are recorded over different channels, thus resulting in a channel mismatch condition when the VQ selection process occurs. Other reasons for not using trained codebooks are that trained codebooks may limit the coverage of the signal space and their lack of structure may make them unsuitable for fast codebook search algorithms.

One of the first alternative stochastic codebooks was motivated by the observation that the histogram of the residual from the long-term predictor follows roughly a Gaussian probability density function (pdf) [76]. This is roughly valid except at plosives and voiced/unvoiced transitions. In addition, the cumulative distributions (integral of the pdf) derived from histograms for actual residuals are nearly identical to those for white Gaussian random variables. As such, an alternative codebook is constructed of white Gaussian random numbers with unit variance.¹⁵ A gain must thus also be estimated for each codeword selection. Although suboptimal (in theory), this stochastic codebook offers a marked reduction in computation for codebook generation, with speech quality equivalent to that of trained codebooks [38], [76]. Codebooks are also used that consist of codeword vectors representing residual impulses of ±1 gain and with certain predetermined positions [75]. These are called algebraic codebooks because they are obtained at a receiver from the transmitted index using linear algebra rather than a table lookup as described earlier in Section 12.4.3 for a typical VQ speech coding scenario. Algebraic codebooks have an advantage with respect to coding efficiency and search speed.

¹⁵ It appears that there should be no gain in coding efficiency with VQ used with vectors consisting of white Gaussian random variables because VQ was designed to exploit correlation among the vector elements. This dilemma is resolved by observing that for a fixed number of bits per vector, VQ allows fractional bits. Consider, for example, a 100-element vector at a 10-ms frame interval. Scalar quantization allows a minimum of one bit per vector element for 100 bits per frame or 10000 bits/s. On the other hand, VQ allows any number of bits per frame and thus fractional bits per element, e.g., 50 bits per frame correspond to bit per vector element for 5000 bits/s. In addition, VQ gives flexibility in partitioning the vector space not possible with scalar quantization [44].

There have evolved many variations of the basic CELP concept, not only with respect to codebook design and search (lookup), but also with respect to the short- and long-term predictors. For example, in the multi-pulse scheme, we noted that the short-term and long-term predictors were estimated open-loop while the residual was estimated close-loop. In various renditions of CELP, the short-term predictor is estimated in open-loop form, while the long-term predictor is estimated either in open-loop or closed-loop form and sometimes in both [8],[37],[75]. The open-loop long-term predictor estimation is performed, as in our previous discussion, whereby the long-term prediction bu[n − P] attempts to match the short-term prediction residual u[n], i.e., we minimized the mean-square of the error

υ[n] = u[n] − bu[n − P].

This minimization resulted in a solution for the long-term gain and delay (“pitch”) given by Equations (12.29) and (12.30), respectively. In the closed-loop estimation of the long-term predictor, rather than matching the short-term prediction error waveform, we match the speech waveform itself [37]. Suppose that we have estimated and quantized the short-term predictor polynomial ; then we denote by the impulse response of . In order to determine the long-term predictor delay and gain closed-loop, one needs to simultaneously determine the optimal excitation codewords. Here the prediction error to be minimized seeks to match the speech waveform and is formed as

where the synthetic waveform is given by

where is the quantized excitation. Because solution by exhaustive search is computationally infeasible, various two-step (sub-optimal) procedures have been developed. The interested reader is referred to [37],[38]. The advantage of this closed-loop strategy is that the parameter estimation is performed with respect to the desired signal, rather than to an intermediary residual. As such, it can attain higher quality than the original CELP coder that performs estimation of the long-term predictor open-loop [37],[38].

CELP Coders — The CELP strategy has been widely used in a variety of government and international standard coders. One example is that of a CELP coder used in an early 1990s government standard for secure communications at 4.8 kbps at a 4000-Hz bandwidth (Fed-Std–1016) [8], one of three coders in a secure STU–3 telephone that invokes three bit rates: 9.6 kbps (multi-pulse), 4.8 kbps (CELP), and 2.4 kbps (LPC). In this 4.8-kbps standard, the short-time predictor is determined at a 30-ms frame interval and is coded with 34 bits per frame. The prediction coefficients of a 10th-order vocal tract spectrum are transformed to LSFs, a parameter set that we saw is more amenable to quantization than other all-pole representations. The 10 LSFs are coded using nonuniform quantization. The short-term and long-term predictors are estimated in open-loop, while the residual codewords are determined in closed-loop form. The residual vectors and long-term pitch and gain are updated at a 7.5-ms inner frame and both are coded with VQ schemes (long-term pitch and gain as a separate vector). This entails a 512-element stochastic (algebraic) codebook for the residual, requiring 9 bits per inner frame, and an associated gain coded with 5-bit nonuniform scalar quantization per inner frame. The long-term pitch and gain use 256 codewords, requiring 28 bits per outer frame. The total number of bits/s is thus 4600 bps. The remaining 200 bps are used for frame synchronization at the receiver, for error protection, and future expansion. Further details of this bit allocation are given in [8].

The CELP concept also forms the basis for many current international standards, for cellular, satellite, and Internet communications of limited bandwidth, with bit rates in about the range of 5–13 kbps. As does the 4.8-kbps STU–3 coder, these coders apply the residual models of this section and the quantization principles described throughout the chapter. Two such coders are G.729 and G.723.1. The two coders are based on a residual/LSF/postfilter analysis/synthesis, with the primary difference being the manner of coding the excitation residual. The G.729 coder runs at 8 kbps and was standardized by ITU (International Telecommunication Union) for personal communication and satellite systems, and is based on an algebraic CELP residual coding scheme [30]. The gain and pitch from an open-loop estimation are refined with estimates from a closed-loop predictor. This coder also exploits time-differential quantization of the LSF coefficients, in particular, the finite-order moving average predictor approach of Figure 12.13 for reducing propagation over time of channel bit errors. The G.723.1 coder is also based on a variation of the basic CELP concept, and is a current multi-media standard coder at 5.3 and 6.3 kbps [31].

12.8 Summary

In this chapter, we have described the principles of one-dimensional scalar quantization and its generalization to vector quantization that form the basis for waveform, model-based, and hybrid coding techniques. Not having the space to cover the wide array of all speech coders, we gave a representative set from each of the three major classes. The area of hybrid coders, in particular, is rapidly evolving and, itself, requires a complete text to give it justice. One important new class of hybrid coders, for example, is waveform-interpolation coders that view the speech waveform as a slowly evolving series of glottal cycles; this coder class combines a sinusoidal model with waveform representation methods [35]. This, in turn, has led to the development of a further decomposition of the sinewave representation of evolving glottal cycles into slowly varying and rapidly varying components [36]. By computing sinewave parameters at a relatively high frame rate (≈ 5 ms), matching the sinewave parameters from frame-to-frame, and applying complementary highpass and lowpass filters to the real and imaginary parts along each of the sinewave tracks, the rapidly varying and slowly varying components of the speech signal can be isolated. If the rapidly varying components are quantized crudely but often, and the slowly varying components are quantized accurately but infrequently, high-quality synthetic speech can be obtained at 2.4 kbps. This quantization technique exploits a form of perceptual masking where noise is hidden by spectrally dynamic (the rapidly varying) regions, and which we will again encounter in Chapter 13 for speech enhancement in noise.

Perceptual masking in speech coding, more generally, is an important area that we have only touched upon in this chapter. We have mentioned that, by working in the frequency domain, we are able to exploit perceptual masking principles to “hide” quantization noise under the speech signal in a particular frequency band. However, more general masking principles based on the masking of one tone by adjacent tones can also be exploited. Such principles have been applied successfully in wideband coding at high bit rates [29] and have also been investigated for use in sinewave coding [4],[21]. Finally, an important direction is the evolving area of speech coding based on nonlinear speech modeling. A number of new techniques are being developed based on the acoustic/nonacoustic modeling approaches of Chapter 11 in which parametric representations of speech modulations are exploited [67], as well as on nonlinear dynamical modeling approaches [40],[69].

Exercises

12.1 Show for the uniform scalar quantizer in Equation (12.1) that the quantizer step size Δ, which is the step size equal to the spacing between two consecutive decision levels, is the same as the spacing between two consecutive reconstruction levels.

12.2 Assume quantization noise has a uniform probability density function (pdf) of the form for and zero elsewhere. Show that the variance of the quantization noise is given by , and thus that the noise variance of a uniform quantizer with step size Δ is given as in Equation (12.3).

12.3 Consider a 4-level quantizer. Suppose that values of a sequence x[n] fall within the range [0, 1] but rarely fall between and x₄ = 1. Propose a nonuniform quantization scheme, not necessarily optimal in a least-squared-error sense, that reduces the least-squared error relative to that of a uniform quantization scheme.

12.4 Let x denote the signal sample whose pdf p_x(x) is given by

(a) Suppose we assign only one reconstruction level to x. We denote the reconstruction level by . We want to minimize . Determine . How many bits are required to represent the reconstruction level?

(b) Suppose again we assign only one reconstruction level to x, but now the reconstruction level is set to unity, i.e., . Compute the signal-to-noise ratio (SNR) defined as

i.e., the ratio of signal and quantization noise variances, where the quantization noise is given by

(c) Suppose a uniform quantizer is applied to x with four reconstruction levels. Determine the reconstruction levels that minimize . Assign a codeword to each of the reconstruction levels.

12.5 For a random variable with a uniform pdf, show that the Max quantizer results in a set of uniform quantization levels.

12.6 In this problem, you consider the method of companding introduced in Section 12.3.4.

(a) Show that the transformation in Equation (12.10) results in the random process g[n] whose elements have a uniform pdf. Hint: First note that the transformation T gives the probability distribution of x(x denoting the value of signal sample x[n]), i.e.,

Prob{x ≤ x_o} = F_x[x_o]

where F_x denotes probability distribution. Then consider the probability distribution of g[n] itself.

(b) Show that logarithmic companding makes the SNR (i.e., signal-to-quantization noise ratio) resistant to signal characteristics. Specifically, show that the logarithm makes the SNR approximately independent of signal variance and dependent only upon quantization step size.

Figure 12.31 Symmetric probability density function of random variable x.

12.7 Let x denote the random variable whose pdf p_x(x) is given in Figure 12.31.

(a) A symmetric quantizer is defined such that if it has a reconstruction level of r, then it also has a reconstruction level of −r. Given 1 bit to quantize the random variable x, determine the symmetric minimum mean-squared-error quantizer. What is the corresponding mean-squared error?

(b) It has been suggested that using a non-symmetric quantizer would result in lower mean-squared error than that of part (a). If you think that the above statement is true, then determine the non-symmetric quantizer and its corresponding mean-squared error. If it is false, then justify your answer.

12.8 Show that the expected value of the signal variance estimate in Equation (12.11) is equal to the true variance, , for a stationary x[n] and for an appropriate choice of the constant β. Discuss the time-frequency resolution tradeoffs in the estimate for adaptive quantization with a nonstationary signal.

12.9 Consider a signal x[n] that takes on values according to the pdf given in Figure 12.32.

(a) Derive the signal-to-noise ratio (SNR) defined as

for a 2-bit quantizer where the reconstruction levels are uniformly-spaced over the interval [0, 1], as shown in Figure 12.32c. Assume the pdf of the quantization noise is uniform. Note that you should first find the range of the quantization error as a function of the quantization step size Δ.

(b) Now suppose we design a nonuniform quantizer that is illustrated in Figure 12.32d. Derive the SNR for this quantizer. Assume the pdf of the quantization error e[n] is uniform for x[n] in the interval [0,1/2], and also uniform for x[n] in the interval [1/2, 1], but within each interval a different quantization step size is applied, as shown. Hint: Use the relation:

(c) Is the nonuniform quantizer of part (b) an optimal Max quantizer? Explain your reasoning.

Figure 12.32 Quantization conditions for Exercise 12.9: (a) sample function; (b) pdf of x[n]; (c) uniform quantizer; (d) nonuniform quantizer.

12.10 Consider two random variables x₁ and x₂ with joint pdf p_x1,x2(x₁, x₂) where the separate pdf’s p_x1(x₁) and p_x2(x₂) are triangular, as illustrated in Figure 12.33, with range [−a, a] and with peak amplitude 1/a. In this problem, you investigate the advantage of vector quantization for correlated scalars.

(a) From Figure 12.33, we can deduce that the two scalars are linearly dependent, i.e., correlated. Show this property analytically for the variables, i.e., show that

E[x₁x₂] ≠ E[x₁]E[x₂].

(b) Consider quantizing x₁ and x₂ separately using scalar quantization with the Max quantizer (invoking the minimum-mean-squared-error criterion). Suppose, in particular, that we use two reconstruction levels for each scalar. Then find the optimal reconstruction levels. Sketch for the two quantized scalars, the four (2 × 2) resulting reconstruction levels in the two-dimensional space [x₁, x₂], corresponding to a 2-bit representation.

(c) From the joint pdf in Figure 12.33, we know two of the four reconstruction levels in part (b) are not possible; we have wasted two bits. Suppose you use only these two possible reconstruction levels in the 2-D space, requiring only a 1-bit codebook representation. Argue that the mean-squared error (MSE) in the two representations is the same. Therefore, by exploiting linear dependence, for the same MSE, we have saved 1 bit.

(d) Rotate the pdf in Figure 12.33 45° clockwise and argue that you have eliminated the linear dependence of the two new scalars y₁ and y₂. In particular, show that

E[y₁y₂] = E[y₁]E[y₂]

so that the advantage of VQ is removed. What might be an implication of such a decorrelating transformation for speech coding?

Figure 12.33 Joint probability density function p_x1,x₂(x₁, x₂) and its marginals p_x1(x₁) and p_x2(x₂).

12.11 Show that minimizing the subband coding distortion measure in Equation (12.14) results in the optimal bit assignment of Equation (12.15) and thus the flat quantization noise variance . Then derive the optimal frequency-weighted bit assignment rule of Equation (12.17).

12.12 We saw with subband coding that when the weight in the error function in Equation (12.16) across bands is inversely proportional to the variance of the subband signal, the optimal rule is to give each band the same number of bits, and this results in the same SNR for each band. Suppose that an octave band structure is invoked, e.g., quadrature mirror filter bandwidths decreasing by a factor of two with decreasing frequency. Then, as we have seen in Chapter 8, wider bands imply the need for a faster sampling rate (less decimation) to maintain the invertibility of the filter bank. Discuss the implications of this requirement of band-dependent decimation for bit allocation, given the above optimal uniform bit allocation per band.

12.13 Argue that a reasonable model of the spectral tilt used in postfiltering is a first-order all-pole function:

and hence the spectrally flattened sinewave amplitude for the kth harmonic in Section 12.5.2, denoted by F(kω₀), can be written as

In the context of sinusoidal coding, then determine the prediction coefficient ρ by applying linear prediction analysis to the synthetic speech waveform in Equation (12.20). Specifically, show that ρ = r₁/r₀, where r₀ and r₁ are the first two correlation coefficients of and that

12.14 The oldest form of speech coding, the channel vocoder, invented by Dudley [15], was illustrated in Figures 8.31 and 8.32. In this exercise, you are asked to consider the scalar quantization of channel parameter outputs determined in Exercise 8.11.

(a) Suppose each channel parameter b_k spans a range of 0 to 8, and a 5-bit uniform quantizer is used. What is the mean-squared error (i.e., variance of the quantization noise) associated with the quantizer? Assume a uniform pdf for quantization noise.

(b) If 10 bits are needed for the pitch and voicing decision, what is the total bit rate in bits per second? Assume that the pitch and voicing are transmitted at the decimation rate you selected in part (b) of Exercise 8.11.

12.15 An alternative to the cubic phase interpolator used in sinewave synthesis was given in Figure 9.29 of Exercise 9.5, that shows one synthesis frame from t = 0 to t = T. The measurements θ^l, θ^l+1, Ω^l, and Ω^l+1 are given. In this interpolator, the frequency is assumed piecewise-constant over three equal intervals. The mid-frequency value is a free parameter that was selected so that the phase measurement θ(T) at the right boundary is met. In this problem, you are asked to design a sinewave phase coding scheme based on your results from parts (a) and (b) of Exercise 9.5.

(a) Suppose sinewave analysis/synthesis is performed every 5 ms. We want to determine a quantization scheme and corresponding bit rate for transmitting phase. Suppose we have transmitted up to frame l so the receiver has the values θ^l (assume unwrapped) and Ω^l. On the (l + 1)st frame there are two parameters to be transmitted: θ^l+1 and Ω^l+1, from which we use the results in parts (a) and (b) of Exercise 9.5 to obtain the interpolated phase on frame l + 1. Note that θ^l+1 is wrapped, thus falling in the interval [−π, π] and Ω^l+1 falls in the interval [0, π]. Consider a uniform quantization scheme with 16 levels for each parameter. Calculate the bit rate required to transmit phase for 50 sinewaves per 5-ms frame. Discuss the feasibility of this scheme for low-rate speech coding.

(b) Comment on the effect of the phase quantization scheme of part (a) in maintaining the speech waveform shape in synthesis. That is, argue briefly either for or against the method’s capability to avoid waveform dispersion in synthesis. Suppose you choose the phase difference θ^l+1 − θ^l and frequency difference Ω^l+1 − Ω^l to quantize and code, rather than the phase and frequency themselves. Would such a quantization scheme help or hurt your ability to (1) reduce quantization noise and/or (2) maintain speech waveform shape at the receiver? A qualitative argument is sufficient.

12.16 Consider a signal x[n] consisting of two sinewaves with frequency ω₁, and ω₂, phase ₁ and ₂, and amplitude 2.0 and 1.0, respectively i.e.,

x[n] = 2cos(ω₁n + ₁) + cos(ω₂n + ₂).

Suppose that the pdf of x[n] is uniform, as given in Figure 12.34a

(a) Suppose you quantize x[n] with a 9-bit optimal (in the mean-squared error sense of Max) uniform quantizer. What is the signal-to-noise (SNR) ratio in dB?

Figure 12.34 Illustrations for Exercise 12.16: (a) probability density function of x[n]; (b) two-band quantizer.

(b) Suppose now you (subband) filter x[n] with two bandpass filters split roughly over the full-band interval [0, π], one centered at and the second at , as illustrated in Figure 12.5b. Assume H₁(ω) passes only the sinewave at frequency ω₁ exactly, while H₂ (ω) passes only the sinewave at frequency ω₂ exactly. Assume the pdfs of x₁[n] and x₂[n] are uniform over their respective ranges. Given that you have 9 bits to allocate over the two bands (i.e., you can use only a total of 9 bits, so that B₁ + B₂ = 9 where B₁ denotes the number of bits for the first band and B₂ denotes the number of bits for the second band), design an optimal uniform quantizer for each band such that the SNR is the same in each band. In your design, specify the number of bits you allocate for each band, as well as the quantization step size. What is the SNR in each band? (Assume quantization noise has uniform pdfs. Also, the number of bits may be fractional.)

(c) To reconstruct the signal, you add the two quantized outputs. What is the SNR of the reconstructed signal? (Assume quantization errors and signals are independent.)

(d) Suppose that the absolute phase relation between two sinewave components of a signal is perceptually unimportant. (Generally, this is not true, but let’s assume so.) Suppose also that you are given a priori knowledge that only two sinewaves are present, each falling within a different subband of the system from part (b). Finally, suppose also that the magnitude and frequency of the sinewaves are varying slowly enough so that they can be sampled at 1/8 of the original input sampling rate. Modify the subband encoder of Figure 12.34b so that only the sinewave amplitude and the frequency per subband are required for encoding. If each parameter is allocated the same number of bits, given that you have 9 bits/sample at the input sampling rate, how many bits does each amplitude and frequency parameter receive? Explain qualitatively the steps involved in the design of the optimal nonuniform Max quantizer for the amplitude and frequency parameters. (Assume you have many sample functions and that you don’t know the pdfs and need to measure them.) Keep in mind that pdfs of the amplitude and frequency parameters are not generally uniform, even though we are assuming the pdfs of x₁[n] and x₂[n] are uniform. Hint: Use the principles of the phase vocoder.

12.17 From the frequency-domain view in Chapter 5 of linear prediction analysis, show that we can derive an all-pole fit to the SEEVOC envelope simply by replacing the autocorrelation coefficients of the short-time speech by the autocorrelation coefficients associated with the squared SEEVOC envelope.

12.18 Determine a means to recover the inverse filter A(z) from the roots of P(z) and Q(z) (i e., the LSF coefficients) in Equation (12.26).

12.19 Model-based vocoders that do not reproduce frequency-domain phase typically do not allow a time-domain minimum-mean-squared error performance measure. Give reasoning for this limitation. Then propose for such coders an alternative frequency-domain objective measure. Suppose now that a class of vocoders generates a minimum-phase reconstruction. Propose an objective performance measure as a function of the amplitudes, frequencies, and phases of sinusoidal representations of the original and coded waveforms.

12.20 Show that a zero in the vocal tract transfer function manifests itself as a “multi-pulse” close to the primary pitch impulses of the excitation when the original speech is filtered by its all-pole transfer function component.

12.21 In multi-pulse linear prediction analysis/synthesis, assume that the impulse positions n_k in the excitation model [Equation (12.28)] are known and solve the minimum mean-squared error [Equation (12.27)] estimation of the unknown impulse amplitudes A_k. Hint: The solution is linear in the unknown amplitudes.

12.22 Consider estimation of the long-time predictor lag P in the long-term predictor b[n] = δ[n] − bδ[n − P] of Section 12.7.2. Argue why at long lags, we might obtain poor estimates of the autocorrelation and, consequently, obtain a poor basis for the use of the autocorrelation method in obtaining the gain b required in the long-term predictor.

12.23 Derive the error expression in Equation (12.30) for the long-term predictor error.

12.24 (MATLAB) In this MATLAB exercise, use the speech waveform speech1_10k (at 10000 samples/s) in workspace ex12M1_mat located in companion website directory Chap_exercises/chapter12. This problem asks you to design a speech coder based on both homomorphic filtering and sinewave analysis/synthesis, and is quite open-ended, i.e., there is no one solution.

(a) Window speech1_10k with a 25-ms Hamming window and compute the real cepstrum of the short-time segment using a 1024-point FFT.

(b) Estimate the pitch (in DFT samples) using the real cepstrum. Quantize the pitch using a 7-bit uniform quantizer. Assume that the pitch of typical speakers can range between 50 Hz and 300 Hz, and map this range to DFT samples by rounding to the nearest integer.

(c) Quantize the first 28 cepstral coefficients c[n] for n = 0, 1, 2, 3, … 27. Observe that there is a very large dynamic range in the cepstral coefficients; i.e., the low-quefrency coefficients are much larger than the high-quefrency coefficients, implying that the quantizer should adapt to the cepstral number. Since the gain term c[0] swamps the cepstrum, remove this term and quantize it separately with a 5-bit quantizer. Assume that c[0] can range over twice the measured value. Then compute an average value of c[n] for n = 1, 2, 3, … 27 and assume that these coefficients can range over twice this average value. Select some reasonable quantization step size and quantize the 27 cepstral coefficients. Alternatively, you can assume the range of each c[n] is twice the given c[n]. This latter choice is more consistent with letting the quantizer adapt to the variance of the cepstral value. This ends the “transmitter” stage of the speech coder.

(d) At the receiver, form a minimum-phase reconstruction of the waveform from your quantized values. To do this, first apply the appropriate right-sided cepstral lifter to the quantized 28 cepstral coefficients and Fourier transform the result using a 1024-point FFT. The lifter multiplies c[0] by 1 and c[n] for n = 1, 2 … 27 by 2 for a minimum-phase reconstruction. These operations provide a log-magnitude and phase estimate of the vocal tract frequency response.

(e) Sample the log-magnitude and phase functions from part (d) at the harmonic frequencies. The harmonic frequencies can be generated from the quantized pitch. Note that you will need to select harmonics closest to DFT samples.

(f) Exponentiate the log-magnitude and phase harmonic samples. Then use the resulting amplitudes, frequencies, and phases to form a sinewave-based reconstruction of the speech waveform using the MATLAB function designed in Exercise 9.20. The synthesized waveform should be the same length as the original (i.e., 1000 samples). How does your reconstruction compare to the original waveform visually and aurally? You have now synthesized a minimum-phase, harmonic waveform at your “receiver.”

(g) Assuming that you code, reconstruct, and concatenate 1000-point short-time segments (as in this problem), i.e., the time decimation factor L = 1000, and that the waveform sampling rate is 10000 samples/s (in practice a L = 100 for 10000 samples/s is more realistic), what is the bit rate, i.e., number of bits/s, of your speech coder? In answering this question, you need to consider the bits required in coding the pitch, gain (c[0]), and cepstral coefficients (c[n], 1 ≤ n ≤ 27).

Bibliography

[1] G. Aguilar, J. Chen, R.B. Dunn, R.J. McAulay, X. Sun, W. Wang, and R. Zopf, “An Embedded Sinusoidal Transform Codec with Measured Phases and Sampling Rate Scalability,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Proc., Istanbul, Turkey, vol. 2, pp. 1141–1144, June 2000.

[2] S. Ahmadi and A.S. Spanias, “A New Phase Model for Sinusoidal Transform Coding of Speech,” IEEE Trans. on Speech and Audio Processing, vol. 6, no. 5, pp. 495–501, Sept. 1998.

[3] L.B. Almeida and F.M. Silva, “Variable-Frequency Synthesis: An Improved Harmonic Coding Scheme,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, San Diego, CA, pp. 27.5.1–27.5.4, May 1984.

[4] D. Anderson, “Speech Analysis and Coding Using a Multi-Resolution Sinusoidal Transform,” Proc. IEEE Conf. on Acoustics, Speech, and Signal Processing, Atlanta, GA, vol. 2, pp. 1037–1040, May 1996.

[5] B.S. Atal and S.L. Hanauer, “Speech Analysis and Synthesis by Linear Prediction of the Speech Waveform,” J. Acoustical Society of America, vol. 50, pp. 637–655, 1971.

[6] B.S. Atal and J.R. Remde, “A New Model of LPC Excitation for Producing Natural-Sounding Speech at Low Bit Rates,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Paris, France, vol. 1, pp. 614–617, April 1982.

[7] N. Benvenuto et al., “The 32-kb/s ADPCM Coding Standard,” AT&T Technical J.., vol. 65, pp. 12–22, Sept./Oct. 1990.

[8] J.P. Campbell, Jr., T.E. Tremain, and V.C. Welch, “The Federal Standard 1016 4800 bps CELP Voice Coder,” Digital Signal Processing, Academic Press, vol. 1, no. 3, pp. 145–155, 1991.

[9] J.H. Chen and A. Gersho, “Real-Time Vector APC Speech Coding at 4800 b/s with Adaptive Post filtering,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Dallas, TX, vol. 4, pp. 2185–2188, May 1987.

[10] R.V. Cox, S.L. Gay, Y. Shoham, S.R. Quackenbush, N. Seshadri, and N. Jayant, “New Directions in Subband Coding,” IEEE J. Selected Areas in Communications, vol. 6. no. 2, pp. 391–409, Feb. 1988.

[11] R.E. Crochiere, S.A. Webber, and J.L. Flanagan, “Digital Coding of Speech in Sub-Bands,” Bell System Technical J., vol. 55, no. 8, pp. 1069–1085, Oct. 1976.

[12] I. Daubechies, Ten Lectures on Wavelets, SIAM, 1992.

[13] W.B. Davenport, “An Experimental Study of Speech-Wave Probability Distributions,” J. Acoustical Society of America, vol. 24, pp. 390–399, July 1952.

[14] L.D. Davisson, “Rate-Distortion Theory and Application,” Proc. IEEE, vol. 60, no. 7, pp. 800–808, July 1972.

[15] H. Dudley, R. Riesz, and S. Watkins, “A Synthetic Speaker,” J. Franklin Inst., vol. 227, no. 739, 1939.

[16] S. Dimolitsas, F.L. Corcoran, C. Ravishankar, R.S. Skaland, and A. Wong. “Evaluation of Voice Codec Performance for the INMARSAT Mini-M System,” Proc. 10th Int. Digital Satellite Conf., Brighton, England, May 1995.

[17] J.R. Deller, J.G. Proakis, and J.H.L. Hansen, Discrete-Time Processing of Speech, Macmillan Publishing Co., New York, NY, 1993.

[18] E.W. Forgy, “Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classifications,” Biometrics, abstract, vol. 21, pp. 768–769, 1965.

[19] European Telecommunication Standards Institute, “European Digital Telecommunications System (Phase2); Full Rate Speech Processing Functions (GSM 06.01),” ETSI, 1994.

[20] A. Gersho and R. Gray, Vector Quantization and Signal Compression, Kluwer Academic Publishers, Dordrecht, Holland, 1991.

[21] O. Ghitza, “Speech Analysis/Synthesis Based on Matching the Synthesized and Original Representations in the Auditory Nerve Level,” Proc. Int. Conf. Acoustics, Speech, and Signal Processing, pp. 1995–1998, Tokyo, Japan, 1986.

[22] B. Gold and C.M. Radar, “The Channel Vocoder,” IEEE Trans. on Audio and Electroacoustics, vol. AU–15, no. 4, Dec. 1967.

[23] B. Gold and N. Morgan, Speech and Audio Signal Processing, John Wiley and Sons, New York, NY, 2000.

[24] D. Griffin and J.S. Lim, “Multiband Excitation Vocoder,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–36, no. 8, pp. 1223–1235, 1988.

[25] J.C. Hardwick and J.S. Lim, “A 4800 bps Improved Multi-Band Excitation Speech Coder,” Proc. IEEE Workshop on Speech Coding for Telecommunications, Vancouver, B.C., Canada, Sept. 5–8, 1989.

[26] J.N. Holmes, “Formant Excitation Before and After Glottal Closure,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Philadelphia, PA, pp. 39–42, April 1976.

[27] J. Huang and F. Schultheiss, “Block Quantization of Correlated Gaussian Random Variables,” IEEE Trans. Communications Systems, vol. CS–11, pp. 289–296, Sept. 1963.

[28] “INMARSAT-M Voice Codec,” Proc. Thirty-Sixth Inmarsat Council Meeting, Appendix I, July 1990.

[29] J.D. Johnson, “Transform Coding of Audio Signals Using Perceptual Noise Criteria,” IEEE J. Selected Areas in Communications, vol. 6, no. 2, pp. 314–323, Feb. 1988.

[30] ITU-T Recommendation G. 729, “Coding of Speech at 8 kb/s Using Conjugate-Structure Algebraic-Code-Excited Linear Prediction,” June 1995.

[31] ITU-T Recommendation G. 723. 1, “Dual Rate Speech Coder for Multimedia Communications Transmitting at 5.3 and 6.3 kb/s,” March 1996.

[32] “APCO/NASTD/Fed. Project 25 Vocoder Description,” Telecommunications Industry Association Specifications, 1992.

[33] N.S. Jayant and P. Noll, Digital Coding of Waveforms: Principles and Applications to Speech and Video, Prentice Hall, Englewood Cliffs, NJ, 1984.

[34] G.S. Kang and S.S. Everett, “Improvement of the Excitation Source in the Narrowband LPC Analysis,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–33, no. 2, pp. 377–386, April 1985.

[35] W.B. Kleijn, “Encoding Speech Using Prototype Waveforms,” IEEE Trans. on Speech and Audio Processing, vol. 1, no. 4, pp. 386–399, Oct. 1993.

[36] W.B. Kleijn and J. Haagen, “A Speech Coder Based on Decomposition of Characteristic Waveforms,” Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Detroit, MI, pp. 508–511, May 1995.

[37] A. Kondoz, Digital Speech: Coding for Low Bit Rate Communication Systems, John Wiley & Sons, NY, 1994.

[38] P. Kroon and W.B. Kleijn, “Linear-Prediction Analysis-by-Synthesis Coding,” chapter in Speech Coding and Synthesis, W.B. Kleijn and K.K. Paliwal, eds., Elsevier, Amsterdam, the Netherlands, 1995.

[39] K.D. Kryter, “Methods for the Calculation and Use of the Articulation Index,” J. Acoustical Society of America, vol. 34, pp. 1689–1697, 1962.

[40] G. Kubin, “Nonlinear Processing of Speech,” chapter in Speech Coding and Synthesis, W.B. Kleijn and K.K. Paliwal, eds., Amsterdam, the Netherlands, Elsevier, 1995.

[41] J.S. Lim, Two-Dimensional Signal and Image Processing, Prentice Hall, Englewood Cliffs, NJ, 1990.

[42] Y. Linde, A. Buzo, and R.M. Gray, “An Algorithm for Vector Quantizer Design,” IEEE Trans. Communications, vol. COM–28, no. 1, pp. 84–95, Jan. 1980.

[43] S.P. Lloyd, “Least-Squares Quantization in PCM,” IEEE Trans. Information Theory, vol. IT–28, pp. 129–137, March 1982.

[44] J. Makhoul, S. Roucos, and H. Gish, “Vector Quantization in Speech Coding,” Proc. IEEE, vol. 73, pp. 1551–1588, Nov. 1985.

[45] J. Makhoul, R. Viswanathan, R. Schwartz, and A.W.F. Huggins, “A Mixed-Source Model for Speech Compression and Synthesis,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Tulsa, OK, pp. 163–166, 1978.

[46] J.D. Markel and A.H. Gray, Linear Prediction of Speech, Springer-Verlag, New York, NY, 1976.

[47] J.S. Marques and L.B. Almeida, “New Basis Functions for Sinusoidal Decomposition,” Proc. EUROCON, Stockholm, Sweden, 1988.

[48] J. Max, “Quantizing for Minimum Distortion,” IRE Trans. Information Theory, vol. IT–6, pp. 7–12, March 1960.

[49] R.J. McAulay and T.F. Quatieri, “Low-Rate Speech Coding Based on the Sinusoidal Model,” chapter in Advances in Speech Signal Processing, S. Furui and M.M. Sondhi, eds., Marcel Dekker, New York, NY, 1992.

[50] R.J. McAulay and T.F. Quatieri, “Sinusoidal Coding,” chapter in Speech Coding and Synthesis, W.B. Kleijn and K.K. Paliwal, eds., Amsterdam, the Netherlands, Elsevier, 1995.

[51] R.J. McAulay and T.F. Quatieri, “Speech Analysis-Synthesis Based on a Sinusoidal Representation,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–34, no. 4, pp. 744–754, 1986.

[52] R.J. McAulay, T.M. Parks, T.F. Quatieri, and M. Sabin, “Sinewave Amplitude Coding at Low Data Rates,” Proc. IEEE Workshop on Speech Coding, Vancouver, B.C., Canada, 1989.

[53] R.J. McAulay, T.F. Quatieri, and T.G. Champion, “Sinewave Amplitude Coding Using High-Order All-Pole Models,” Proc. EUSIPCO–94, Edinburgh, Scotland, U.K., pp. 395–398, Sept. 1994.

[54] R.J. McAulay and T.F. Quatieri, “Multirate Sinusoidal Transform Coding at Rates from 2.4 kb/s to 8 kb/s,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Dallas, TX, vol. 3, pp. 1645–1648, May 1987.

[55] A.V. McCree and T.P. Barnwell, “A Mixed Excitation LPC Vocoder Model for Low Bit Rate Speech Coding,” IEEE Trans. on Speech and Audio Processing, vol. 3, no. 4, pp. 242–250, July 1995.

[56] A.V. McCree, K. Truong, E.B. George, T.P. Barnwell, and V. Viswanathan, “A 2. 4 kbit/s MELP Coder Candidate for the New US Federal Standard,” Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Atlanta, GA, vol. 1, pp. 200–203, May 1996.

[57] A.V. McCree, “A 4.8 kbit/s MELP Coder Candidate with Phase Alignment,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Istanbul, Turkey, vol. 3, pp. 1379–1382, June 2000.

[58] E. McLarnon, “A Method for Reducing the Frame Rate of a Channel Vocoder by Using Frame Interpolation,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Washington, D.C., pp. 458–461, 1978.

[59] M. Nishiguchi, J. Matsumoto, R. Wakatsuki, and S. Ono, “Vector Quantized MBE with Simplified V/UV Division at 3.0 kb/s,” Proc. Int. Conf. Acoustics, Speech and Signal Processing, Minneapolis, MN, vol. 2, pp. 151–154, April 1993.

[60] P. Noll, “A Comparative Study of Various Schemes for Speech Encoding,” Bell System Tech. J., vol. 54, no. 9, pp. 1597–1614, Nov. 1975.

[61] P. Noll, “Adaptive Quantizing in Speech Coding Systems,” Proc. 1974 Zurich Seminar on Digital Communications, Zurich, Switzerland, March 1974.

[62] A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1989.

[63] A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw-Hill, New York, NY, 1965.

[64] D. Paul, “The Spectral Envelope Estimation Vocoder,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–29, no. 4, pp. 786–794, Aug. 1981.

[65] J.G. Proakis, Digital Communications, McGraw-Hill, New York, NY, 1983.

[66] M.D. Paez and T.H. Glisson, “Minimum Mean Squared-Error Quantization in Speech,” IEEE Trans. Communications, vol. COM–20, pp. 225–230, April 1972.

[67] A. Potamianos and P. Maragos, “Applications of Speech Processing Using an AM-FM Modulation Model and Energy Operators,” chapter in Signal Processing VII: Theories and Applications, M. Holt, C.F. Cowan, P.M. Grant, and W.A. Sandham, eds., vol. 3, pp. 1669–1672, Amsterdam, the Netherlands, Elsevier, Sept. 1994.

[68] T.F. Quatieri and R.J. McAulay, “Phase Coherence in Speech Reconstruction for Enhancement and Coding Applications,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Glasgow, Scotland, vol. 1, pp. 207–209, May 1989.

[69] T.F. Quatieri and E.M. Hofstetter, “Short-Time Signal Representation by Nonlinear Difference Equations,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Albuquerque, NM, vol. 3, pp. 1551–1554, April 1990.

[70] S.R. Quackenbush, T.P. Barnwell, and M.A. Clements, Objective Measures of Speech Quality, Prentice Hall, Englewood Cliffs, NJ, 1988.

[71] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice Hall, Englewood Cliffs, NJ, 1978.

[72] L.G. Roberts, “Picture Coding Using Pseudo-Random Noise,” IRE Trans. on Information Theory,vol. 8, pp. 145–154, Feb. 1962.

[73] R.C. Rose, The Design and Performance of an Analysis-by-Synthesis Class of Predictive Speech Coders, Ph.D. Thesis, Georgia Institute of Technology, School of Electrical Engineering, April 1988.

[74] M.J. Sabin, “DPCM Coding of Spectral Amplitudes without Positive Slope Overload,” IEEE Trans. Signal Processing, vol. 39. no. 3, pp. 756–758, 1991.

[75] R. Salami, C. Laflamme, J.P. Adoul, A. Kataoka, S. Hayashi, T. Moriya, C. Lamblin, D. Massaloux, S. Proust, P. Kroon, and Y. Shoham, “Design and Description of CS-ACELP: A Toll Quality 8 kb/s Speech Coder,” IEEE Trans. on Speech and Audio Processing, vol. 6, no. 2, pp. 116–130, March 1998.

[76] M.R. Schroeder and B.S. Atal, “Code-Excited Linear Prediction (CELP): High-Quality Speech at Very Low Bit Rates,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 3, pp. 937–940, April 1985.

[77] M.J.T. Smith and T.P. Barnwell, “Exact Reconstruction Techniques for Tree-Structured Subband Coders,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–34, no. 3, pp. 434–441, June 1986.

[78] F.K. Soong and B.H. Juang, “Line Spectral Pair and Speech Compression,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, San Diego, CA, vol. 1, pp. 1.10. 1–1. 10. 4, 1984.

[79] N. Sugamura and F. Itakura, “Speech Data Compression by LSP Analysis/Synthesis Technique,” Trans. of the Institute of Electronics, Information, and Computer Engineers, vol. J64-A, pp. 599–606, 1981.

[80] B. Smith, “Instantaneous Companding of Quantized Signals,” Bell System Technical J., vol. 36, no. 3, pp. 653–709, May 1957.

[81] J.M. Tribolet and R.E. Crochiere, “Frequency Domain Coding of Speech,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–27, no. 5, pp. 512–530, Oct. 1979.

[82] D.Y. Wong, B.H. Juang, and A.H. Gray, Sr., “An 800 bit/s Vector Quantization LPC Vocoder,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–30, no. 5, pp. 770–779, Oct. 1982.

[83] S. Yeldener, A.M. Kondoz, and B.G. Evans, “High-Quality Multi-Band LPC Coding of Speech at 2.4 kb/s,” IEE Electronics, vol. 27, no. 14, pp. 1287–1289, July 1991.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12 Speech Coding

Create new playlist

Sign In

Sign Up

Chapter 12Speech Coding