Chapter 6
Homomorphic Signal Processing

6.1 Introduction

Signals that are added together and have disjoint spectral content can be separated by linear filtering. Often, however, signals are not additively combined. In particular, the source and system in the linear speech model are convolutionally combined and, consequently, these components cannot be separated by linear filtering. The speech signal itself may also be convolved with a system response such as when distorted by the impulse response of a transmission channel or by a flawed recording device. In addition, the speech signal may be multiplied by another signal as occurs, for example, with a time-varying fading channel or with an unwanted expansion of its dynamic range. In these cases, it is desired to separate the nonlinearly combined signals to extract the speech signal or its source and system components.

The linear prediction analysis methods of the previous chapter can be viewed as a process of deconvolution where the convolutionally combined source and system speech production components are separated. Linear prediction analysis first extracts the system component by inverse filtering then extracts the source component. This chapter describes an alternative means of deconvolution of the source and system components referred to as homomorphic filtering. In this approach, convolutionally combined signals are mapped to additively combined signals on which linear filtering is applied for signal separation. Unlike linear prediction analysis, which is a “parametric” (all-pole) approach to deconvolution, homomorphic filtering is “nonparametric” in that a specific model need not be imposed on the system transfer function in analysis.

We begin this chapter in Section 6.2 with the principles of homomorphic systems which form the basis for homomorphic filtering. Homomorphic systems for convolution are one of a number of homomorphic systems that map signals nonlinearly combined to signals combined by addition on which linear filtering can be applied for signal separation. As illustrated above, signals may also be combined by other nonlinear operations such as multiplication. Because our main focus in this chapter, however, is speech source and system deconvolution, Section 6.3 develops in detail homomorphic systems for convolution and, in particular, homomorphic systems that map convolution to addition through a logarithm operator applied to the Fourier transform of a sequence. Section 6.4 then analyzes the output of this homomorphic system for input sequences with rational z-transforms and for short-time impulse trains, the convolution of the two serving as a model for a voiced speech segment. Section 6.5 shows that homomorphic systems for convolution need not be based on the logarithm by introducing the spectral root homomorphic system, which relies on raising the Fourier transform of a sequence to a power. For some sequences, homomorphic root analysis can be of advantage over the use of the logarithm for signal separation. As a precursor to the analysis of real speech, in Section 6.6 we then look at the response of homomorphic systems to a windowed periodic waveform and its implications for homomorphic deconvolution. In particular, we address the difference between a windowed periodic waveform and an exact convolutional model. We will see that homomorphic analysis of windowed periodic waveforms benefits from numerous conditions on the analysis window and its location within a glottal cycle in the deconvolution of a mixed-phase system response (i.e., with both minimum- and maximum-phase components). Similar conditions on window duration and alignment for accurate system phase estimation appear in a variety of speech analysis/synthesis systems throughout the text, such as the phase vocoder and sinusoidal analysis/synthesis.

In the remainder of the chapter, we investigate the application of homomorphic systems to speech analysis and synthesis. The properties of homomorphic filtering for voiced and unvoiced speech are described first in Section 6.7. An important consideration in these systems is the phase one attaches to the speech transfer function estimate: zero, minimum, or mixed phase. We will see that the mixed-phase estimate requires a different and more complex method of analysis, alluded to above, from a minimum- or maximum-phase estimate when dealing with windowed periodic waveforms, and we explore the perceptual consequences of these different phase functions for synthesis in Section 6.8. Unlike linear prediction, homomorphic analysis allows for a mixed-phase estimate. This is one of a number of comparisons made of the two systems that leads in Section 6.9 to a homomorphic filtering scheme that serves as a preprocessor to linear prediction. This method, referred to as “homomorphic prediction,” can remove the waveform periodicity that renders linear prediction problematic for high-pitched speakers. Finally, a number of the exercises at the end of the chapter explore some applications of homomorphic filtering that fall outside the chapter’s main theme. This includes the restoration of old acoustic recordings and dynamic range compression for signal enhancement. The homomorphic filtering approach to these problems will be contrasted to alternative methods throughout the text.

6.2 Concept

An essential property of linear systems is that of superposition whereby the output of the system to an input of two additively combined sequences, x[n] = x1[n] + x2[n], is the sum of the individual outputs; in addition, a scaled input results in a scaled output. The superposition property can be expressed explicitly as

(6.1)

Image

where L represents a linear operator and α a scaling factor. A consequence of superposition is the capability of linear systems to separate, i.e., filter, signals that fall in disjoint frequency bands.

Example 6.1       Figure 6.1 shows the Fourier transform magnitude of a sequence consisting of two additive components that fall in nonoverlapping frequency bands; i.e., x[n] = x1[n] + x2[n] where X1(ω) and X2(ω) reside in the frequency bands Image and Image, respectively. Application of the highpass filter H(ω) separates out x2[n] and can be expressed as

Image

where h[n] is the inverse Fourier transform of H(ω). Image

To allow the separation of signals that are nonlinearly combined, Oppenheim [9] introduced the concept of generalized superposition, which leads to the notion of generalized linear filtering. In formulating the generalized principle of superposition, consider two signals x1[n] and x2[n] that are combined by some rule which we denote by Image, i.e.,

(6.2)

Image

and consider a transformation on x[n] denoted by Image. In addition, we define a generalized multiplicative operator “:”. In generalizing the notion of superposition, we require Image to have the following two properties:

Figure 6.1 Signal with disjoint low- and high-frequency spectra X1(ω) and X2(ω).

Image

Figure 6.2 Homomorphic system Image

Image

(6.3)

Image

These can be viewed as an analogy to the linear superposition properties of Equation (6.1), where addition and scalar multiplication have been generalized. An even wider class of systems characterized by generalized superposition is given by the two properties

(6.4)

Image

and is illustrated in Figure 6.2. Systems that satisfy these two properties are referred to as homomorphic systems and are said to satisfy a generalized principle of superposition [9].1

1 This notation and terminology stem from the study of vector spaces, which is a framework for abstract linear algebra [9]. In this framework, a sequence is considered a vector in a vector space. A vector space is characterized by vector addition, denoted by Image, which is a rule for combining vectors in a vector space, and by scalar multiplication, denoted by :, which is a rule for combining vectors with scalars in a vector space. A linear transformation on a vector space, denoted by Image, maps the input vector space to an output vector space. The transformation is homomorphic if it satisfies the generalized principle of superposition of Equation 6.4 where Δ and Image define the output vector space.

Part of the practical importance of homomorphic systems for speech processing lies in their capability of transforming nonlinearly combined signals to additively combined signals so that linear filtering can be performed. This capability stems from the fact that homomorphic systems can be expressed as a cascade of three homomorphic sub-systems, which is referred to as the canonic representation of a homomorphic system [9]. The canonic representation of a homomorphic system Image is illustrated in Figure 6.3. The signals combined by the operation Image are transformed by the sub-system D Image to signals that are additively combined. Linear filtering is performed by the linear system L (mapping addition to addition), and the desired signal is then obtained by the inverse operation Image, which maps addition to the operation Δ.

Figure 6.3 Canonical formulation of a homomorphic system.

SOURCE: A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Processing [13]. ©1989, Pearson Education, Inc. Used by permission.

Image

Example 6.2       Consider two sequences x1[n] and x2[n] of low-frequency and high-frequency content that are nonlinearly combined by the operation Image. As in Example 6.1, the objective is to separate the high-frequency signal x2[n]. Applying the homomorphic system Image and using the canonical representation, we have

Image

that are linearly combined. If the operation Image is such that Image and Image have disjoint spectra, then the highpass component can be separated. If L denotes the highpass filter, then the output of L is given by

Image

and, therefore, when the operation Δ = Image

Image

thus extracting the high-frequency component. Image

Example 6.2 illustrates the use of homomorphic systems in performing generalized linear filtering. The system of Example 6.2 is considered a homomorphic filter having the property that the desired component passes through the system unaltered while the undesired component is removed. Two typical nonlinear operators are convolution and multiplication. As illustrated in the introduction, many problems arise where signals are so combined, such as in speech deconvolution, waveform distortion, and dynamic range compression. Since our primary focus is speech deconvolution, the next section looks in detail at homomorphic systems for convolution [13].

6.3 Homomorphic Systems for Convolution

In homomorphic systems for convolution, the operation Image is convolution, i.e., Image = * and the resulting homomorphic system Image maps convolution to addition and the inverse system Image maps addition to convolution. This class of homomorphic systems is useful in speech analysis [11],[12] as demonstrated in the following example:

Example 6.3       Consider a sequence x[n] consisting of a system impulse response h[n] convolved with an impulse train Image (with P the impulse spacing), i.e., x[n] = h[n] * p[n]. The goal is to estimate the response h[n]. Applying the canonical representation for convolution, we have

Image

that contains additively combined sequences. Suppose that D* is such that Image remains a train of impulses with spacing P and suppose that Image falls between impulses. Then, if L denotes the operation of multiplying by a rectangular window (for extracting Image),we have

Image

and, therefore,

Image

thus separating the impulse response. Image

An approach for finding the components of the canonical representation and, in particular, the elements D* and Image, is to note that if x[n ] = x1[n] * x2[n], then the z-transform of x[n] is given by X(z) = X1(z)X2(z) and because we want the property that convolution maps to addition, i.e., D*(x1[n] * x2[n]) = D*(x1[n]) + D*(x2[n]), this motivates the use of the logarithm in the operators; i.e., D*[x] = log(Z[x]) and Image where Z denotes z-transform. However, if we want to represent sequences in the time domain, rather than in the z domain, then it’s desirable to have the operations D* = Z−1[log(Z)] and Image. The canonical system with the forward and inverse operators is summarized in Figure 6.4, showing that our selection of D* and Image gives the desired properties of mapping convolution to addition and addition back to convolution, respectively. However, in this construction of D* we have overlooked the definition of the logarithm of a complex z-transform which we refer to henceforth as the “complex logarithm.” Because the complex logarithm is key to the canonical system, the existence of D* relies on the validity of log[(X1(z)X2(z)] = log[X1(z)] + log[X2(z)] and this will depend on how we define the complex logarithm.2 For the trivial case of real and positive z-transforms, the logarithm, sometimes referred to as the “real logarithm,” of the product is the product of the logarithms. Generally, however, this property is more difficult to obtain, as illustrated in the following example:

2 There is no such problem with the inverse exponential operator since ea+b = eaeb. Thus, along with the forward and inverse z-transforms, addition is unambiguously mapped back to convolution.

Figure 6.4 Homomorphic system for convolution: (a) canonical formulation; (b) the subsystem D*; and (c) its inverse.

SOURCE: A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Processing [13]. ©1989, Pearson Education, Inc. Used by permission.

Image

Example 6.4       Consider two real and positive values a and b. Then log (ab) = log(a) + log(b). On the other hand, if b < 0, then log(ab) = log(a|b|ejkπ) = log (a) + log(|b|) + jkπ, where k is an odd integer. Thus, the definition of log(ab) in this case is ambiguous. Image

EXAMPLE 6.4 indicates that special consideration must be made in defining the logarithm operator for complex X(z) in order to make the logarithm of a product the sum of the logarithms [13]. Suppose for simplicity that we evaluate X(z) on the unit circle (z = e), i.e., we evaluate the Fourier transform.3 Then we consider the real and imaginary parts of the complex logarithm by writing the logarithm in polar form as

3We assume the sequence x[n] is stable and thus that the region of convergence of X(z) includes the unit circle.

(6.5)

Image

Then, if X(ω) = X1(ω)X2(ω, we want the logarithm of the real parts and the logarithm of the imaginary parts to equal the sum of the respective logarithms. The real part is the logarithm of the magnitude and, for the product X1(ω)X2(ω), is given by

(6.6)

Image

provided that |X1(ω)| > 0 and |X2(ω)| > 0, which is satisfied when zeros and poles of X(z) do not fall on the unit circle. In this case, there is no problem with the uniqueness and “additivity” of the logarithms. The imaginary part of the logarithm is the phase of the Fourier transform and requires more careful consideration. As with the real part, we want the imaginary parts to add

(6.7)

Image

The relation in Equation (6.7), however, generally does not hold due to the ambiguity in the definition of phase, i.e., ∠X(ω) = PV[∠X(ω)] + 2πk, where k is any integer value, and where PV denotes the principle value of the phase which falls in the interval [−π, π]. Since an arbitrary multiple of 2π can be added to the principal phase values of X1(ω) and X2(ω), the additivity property generally does not hold. One approach to obtain uniqueness is to force continuity within the definition of phase, i.e., select the integer k such that the function ∠X(ω) = PV[∠X(ω)] + 2πk is continuous (Figure 6.5). Continuity ensures not only uniqueness, but also guarantees that the phase of the product X1(ω)X2(ω) is the sum of the individual phase functions, i.e., Equation (6.7) is satisfied (Exercise 6.1).

Figure 6.5 Fourier transform phase continuity: (a) typical continuous phase function; (b) its principal value.

SOURCE: A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Processing [13]. ©1989, Pearson Education, Inc. Used by permission.

Image

An alternative approach to resolving ambiguity of the definition of phase is through the phase derivative 4 with which we define the phase as

4 For a rational z-transform, the phase derivative is a measure of the rate of change of the continuous angle accumulated over all poles and zeros.

Image

where the derivative of the phase with respect to Image, is uniquely defined through the real and imaginary part of X(ω), Xr(ω)and Xi(ω), respectively, and is shown in Exercise 6.2 to be given by

(6.8)

Image

With this expression for Image, we can show (given that |X(ω)| ≠ 0) that ∠X(ω) is unique and that the additivity property of Equation (6.7) is satisfied (Exercise 6.1). The phase derivative eliminates any constant level in the phase, but this is simply π or −π which is sign information.5 The two means of removing phase ambiguity are also useful in motivating phase unwrapping algorithms which are described in Section 6.4.4.

5 Sign information, however, is important in synthesis if the sign varies over successive analysis frames.

Observe that since x[n] is real, then |X(ω)| and thus log(|X(ω)|) are real and even. Likewise, since x[n] is real, then jX(ω) is imaginary and odd. Hence, the inverse Fourier transform of log[X(ω)] = log[|X(ω)|] + jX(ω) is a real sequence and is expressed as

Image

The sequence Image is referred to as the complex cepstrum. The even component of the complex cepstrum, denoted as c[n], is given by Image and is referred to as the real cepstrum because it is the inverse Fourier transform of the real logarithm, which is the real part of the complex logarithm, i.e., log(| X(ω|) = Re{log[X(ω)]}. The primary difference between the complex and real cepstrum is that we have discarded the phase of the complex cepstrum. Discarding the phase is useful, as we will see when dealing with minimum-phase sequences or when the phase is difficult to compute. Observe that applying an inverse Fourier transform to a log-spectrum makes the real and complex cepstra a function of the time index n. This time index is sometimes referred to as “quefrency”; the motivation for this nomenclature will become clear shortly.

6.4 Complex Cepstrum of Speech-Like Sequences

Following the development of Oppenheim and Schafer [13], we investigate the complex cepstrum of two classes of sequences in preparation for deconvolving real speech signals: sequences with rational z-transforms and periodic impulse trains. Homomorphic filtering is introduced for separating sequences that convolutionally combine these two signal classes, and is given an alternative interpretation as a spectral smoothing process.

6.4.1 Sequences with Rational z-Transforms

Consider the class of sequences with rational z-transforms of the form

(6.9)

Image

where (1 − akz−1) and (1 − ckz−1) are zeros and poles inside the unit circle and (1 − bkz) and (1 − dkz) are zeros and poles outside the unit circle with |ak|, |bk|, |ck|, |dk| < 1 so that there are no zeros or poles on the unit circle. The term zr represents a delay of the sequence with respect to the time origin which we assume for the moment can be estimated and removed. 6 The factor A is assumed positive; a negative A introduces a sign change which can be thought of as an additive factor of π to the phase of X(z) since −1 = e. Taking the complex logarithm then gives

6 In speech modeling, the delay often represents a shift of the vocal tract impulse response with respect to the time origin.

Image

Consider Image as a z-transform of a sequence Image. We want the inverse z-transform to be a stable sequence, i.e., absolutely summable, so that the region of convergence (ROC) for Image must include the unit circle (|z| = 1). This is equivalent to the condition that the Laurent power series Image is analytic on the unit circle. This condition implies that all components of Image, i.e., of the form log(1 −αz−1) and log(1 − βz) with |α|, |β| < 1, must represent z-transforms of sequences whose ROC includes the unit circle. With this property in mind, we write the following power series expansions for two generic terms:

(6.10)

Image

The ROC of the two series is illustrated in Figure 6.6a,b, the first converging for |z| > |α| and the second for |z| < |β−1|. The ROC of Image is therefore given by an annulus which borders on radii corresponding to the poles and zeros of X(z) closest to the unit circle in the z-plane and which includes the unit circle as shown in Figure 6.6c. From our z-transform properties reviewed in Chapter 2, the first z-transform corresponds to a right-sided sequence while the second corresponds to a left-sided sequence. The complex cepstrum associated with a rational X(z) can therefore be expressed as

(6.11)

Image

where u[n] is the unit step function. Therefore, the zeros and poles inside the unit circle contribute to the right side of the complex cepstrum, while the zeros and poles outside the unit circle contribute to the left side of the complex cepstrum; the value of Image at the origin is due to the gain term A (Figure 6.7). We see then that the complex cepstrum is generally two-sided and for positive or negative time is a sum of decaying exponentials that are scaled by Image.

Figure 6.6 Region of convergence (ROC) for (a) log(1 − αz−1 and (b) log(1 − βz), while (c) shows the annular ROC for a typical rational X(z). For all cases, the ROC includes the unit circle; the unit circle is shown as a dashed line.

Image

It was noted earlier that the linear phase term zr is removed prior to determining the complex cepstrum. The following example illustrates the importance of removing this term:

Figure 6.7 Schematized example illustrating right- and left-side contributions to the complex cepstrum.

Image

Example 6.5       Consider the z-transform

(6.12)

Image

where a, b, and c are real and less than unity. The ROC of X(z) contains the unit circle so that x[n] is stable. A delay term zr corresponds to a shift in the sequence, for example, a shift in the vocal tract impulse response relative to the time origin. The complex cepstrum is given by

Image

where the inverse z-transform, denoted by Z−1, of the shift term is given by [16]

Image

The inverse z-transform of the linear phase term is a Image (sinc) function which can swamp the complex cepstrum. On the unit circle, zr = ejωr contributes a linear ramp to the phase and thus, for a large shift r, dominates the phase representation and gives a large discontinuity at π and −π. By tracing the vector representation of the pole and zero components of X(z), one sees (Exercise 6.3) that the phase has zero value at ω = 0 and continuously evolves to a zero value at ω = π; furthermore, each phase component must lie within the bounds |θ(ω)| < π (Exercise 6.3). The phase of X(z) is given by the sum of these three phase components with the linear term −, which, for Image, will dominate the sum of the three pole-zero phase components, i.e.,

Image

for Image, where θpz(ω) denotes the sum of the nonzero pole-zero phase contributions. To illustrate these properties, the unwrapped phase of the pole component of X(z) is shown in Figure 6.8, along with an unwrapped linear phase contribution, and the sum of the phase from the pole and the linear phase. Image

Figure 6.8 Illustration of phase contributions of Example 6.5: (a) unwrapped phase of the linear phase contribution; (b) unwrapped phase of the pole component; (c) sum of (a) and (b). The value of r in zr is negative (–2) and the pole occurs on the real z axis (0 Hz).

Image

If X (z) has no poles or zeros outside the unit circle, i.e., bk = dk = 0 so x[n] is a minimum-phase sequence, then Image is right-sided Image, and if X(z) has no poles or zeros inside the unit circle, i.e., ak = ck = 0 so that x[n] is a maximum-phase sequence, then Image is left-sided Image. An implication is that if x[n] has a rational z-transform and is minimum-phase, then the complex cepstrum can be derived from log(|X(ω)|) (and thus from the real cepstrum) or from the phase ∠X(ω) (to within a scale factor). To show the former, recall that the even part of the complex cepstrum equals the real cepstrum and is given by

Image

whose z-transform is given by log(|X(ω)|). Therefore, because the complex cepstrum of a minimum-phase sequence with a rational z-transform is right-sided

(6.13)

Image

where

Image

A similar argument shows that Image can be determined from the magnitude of X (ω) for a maximum-phase sequence, and that the phase is sufficient to recover Image for minimum- or maximum-phase sequences within a scale factor (Exercise 6.4).

6.4.2 Impulse Trains Convolved with Rational z-Transform Sequences

The second class of sequences of interest in the speech context is a train of uniformly-spaced unit samples with varying weights

Image

whose z-transform can be expressed as a polynomial in zN as

Image

P(z) can thus be expressed as a product of factors of the familiar form (1 − akµ−1) where µ = zN. Therefore, if p[n] is minimum-phase, assuming |arµ−1| < 1 and using Equation (6.10), we can express log[P(z)] as

Image

and so the resulting complex cepstrum Image is an infinite right-sided sequence of unit samples spaced N samples apart. More generally, for non-minimum-phase sequences of this kind, the complex cepstrum is two-sided with uniformly spaced impulses.

We next look at a specific example of a synthetic speech waveform derived by convolving a periodic impulse train with a sequence with a rational z-transform.

Example 6.6       Consider a sequence x[n] = h[n] * p[n] where the z-transform of h[n] is given by

Image

where b, b* and c, c* are complex conjugate pairs, all with magnitude less than unity so that the zero pair is outside the unit circle and the pole pair is inside the unit circle. p[n] is a periodic impulse train windowed by a decaying exponential:

Image

thus having z-transform

Image

where β is selected so that p[n] is minimum-phase. The complex cepstrum Image is illustrated in Figure 6.9 (for b = 0.99ej0.12π and c = −1.01ej0.12π), showing that the two components Image and Image are approximately separated along the n axis. (The analytic expressions are left as an exercise.) Image

An important observation within this example is that the complex cepstrum allows for the possibility of separating or deconvolving the source and filter, which we investigate in the next section.

6.4.3 Homomorphic Filtering

We saw in the previous section that the complex cepstrum of speech-like sequences consists of the sum of a low-quefrency (the term “quefrency” was introduced at the end of Section 6.2) component due to the system response and a high-quefrency component due to the pulse train source. When the complex cepstrum of h[n] resides in a quefrency interval less than a pitch period then the two components can be separated from each other [11],[12],[13].

Figure 6.9 Complex cepstrum of x[n] = p[n] * h[n] of Example 6.6. The sequence p[n] is minimum-phase and h[n] is mixed phase (zeros inside and poles outside the unit circle).

Image

Further insight into this separation process can be gained with a spectral smoothing interpretation of homomorphic deconvolution. We begin by viewing log[X(ω)] as a “time signal” and suppose it consists of low-frequency and high-frequency contributions. Then one might lowpass or highpass filter this signal to separate the two frequency components. One implementation of the lowpass filtering is given in Figure 6.10a, which is simply the concatenation of our forward and inverse sub-systems D* and Image with a “filter” l[n] placed between the two operators. As illustrated in Figure 6.10b, the filtering operation on Image can be implemented by (inverse) Fourier transforming the signal log[X(ω)] to obtain the complex cepstrum, applying the filter l[n] to the complex cepstrum, and then (forward) Fourier transforming back to a desired signal Image. In this operation, we have interchanged the time and frequency domains by viewing the frequency-domain signal log[X(ω)] as a time signal to be filtered. This view originally led to the nomenclature [2] “cepstrum” since Image can be thought of as the “spectrum” of log[X(ω)]; correspondingly, the time-axis for Image is referred to as “quefrency,” and filter l[n] as the “lifter.” Rather than transforming to the quefrency domain, we could have directly convolved log[X(ω)] with the Fourier transform of the lifter l[n], denoted as L(ω). The three elements in the dotted lines of Figure 6.10b can then be replaced by L(ω), which can be viewed as a smoothing function:

Figure 6.10 Homomorphic filtering interpreted as a linear smoothing of log[X(ω)]: (a) quefrency-domain implementation; (b) expansion of the operations in (a); (c) frequency-domain interpretation.

Image

(6.14)

Image

which is illustrated in Figure 6.10c and where Image denotes circular convolution.

With this spectral smoothing perspective of homomorphic filtering, one is motivated to smooth X(ω) directly rather than through its logarithm. An advantage of smoothing the logarithm, however, is that the logarithm compresses the spectrum, thus reducing its dynamic range (i.e., its range of values) and giving a better estimate of low-energy spectrum regions after smoothing; without this “dynamic range compression,” the low-energy regions, e.g., high-frequency regions in voiced speech, may be distorted by leakage from high-energy regions, e.g., low-frequency regions in voiced speech (Figure 6.11). In a speech processing context, the low-energy resonances and harmonics can be distorted by leakage from the high-energy regions. The logarithm is simply one compressive operator. In a later section we explore spectral root homomorphic deconvolution that provides a generalization of the logarithm, and that is motivated by the spectral smoothing interpretation of homomorphic filtering.

Figure 6.11 Schematic of smoothing (a) a harmonic spectrum in contrast to (b) the logarithm of a harmonic spectrum.

Image

6.4.4 Discrete Complex Cepstrum

In previous sections, we determined analytically the complex cepstrum of a variety of classes of discrete-time sequences x[n] using the discrete-time Fourier transform X(ω) or z-transform X(z). In practice, however, the discrete Fourier transform (DFT) is applied to a sequence x[n] of finite-length N. An N-point DFT is then used to compute the complex cepstrum as

Image

where Image is referred to as the discrete complex cepstrum. Two computational issues arise: (1) Aliasing in Image since Image is infinitely long, i.e., Image is an aliased version of Image, being of the form Image and (2) Phase unwrapping from samples of the principal phase values to ensure a continuous phase function. To avoid significant distortion from aliasing, the DFT must be “large enough.” Similar considerations hold for the real cepstrum. In Sections 6.7 and 6.8, we will see that in the context of speech analysis, a 512- to 1024-point DFT is adequate.

In addressing the second computational issue, we want the unwrapped phase samples to equal our analytic definition of continuous phase. We earlier saw in Section 6.3 two frameworks in which a phase unwrapping algorithm can be developed: 7 (1) Phase continuity by adding appropriate multiples of 2π to the principal phase value, and (2) Continuity by integration of the phase derivative. In the latter, we can analytically obtain the continuous phase function by integrating the phase derivative by way of the real and imaginary parts. In practice, however, we have only samples of the principal phase and the real and imaginary parts of the Fourier transform. The two frameworks motivate the following algorithms for phase unwrapping:

7 Phase unwrapping appears throughout the text in a number of other contexts, such as in the phase vocoder and sinewave analysis/synthesis.

Modulo 2π Phase Unwrapper: This algorithm finds an integer multiple of 2π for each k, expressed as 2πr[k], to add to the principal phase function to yield a continuous phase [13]. That is, we find a phase function of the form

(6.15)

Image

such that ∠X(k) is continuous. Let r[0] = 0; then to find r[k] for k > 1, we perform the following steps:

S1: If PV[X(k)] − PV[X(k − 1)] > 2π (positive jump of 2π is detected) then subtract 2π, i.e.,

r[k] = r[k − 1] − 1.

S2: If PV[X(k)] − PV[X(k − 1)] < −(2π − ) (negative jump of 2π is detected) then add 2π, i.e.,

r[k] = r[k − 1] + 1.

S3: Otherwise

r[k] = r[k − 1].

This approach to phase unwrapping yields the correct unwrapped phase whenever the frequency spacing Image is small enough so that the difference between any two adjacent samples of the unwrapped phase is less than the threshold . There exist, however, numerous cases where this “small enough” condition is not satisfied, as demonstrated in the following example:

Example 6.7       Consider a sequence x[n] that has a zero very close to the unit circle and is located midway between two DFT frequencies Image. The phase will change by approximately +π between ωk and ωk − 1 if the zero is inside the unit circle and by approximately −π if the zero is outside the unit circle (Figure 6.12a).8 With the DFT spacing shown in Figure 6.12, the above phase unwrapping algorithm cannot distinguish a natural discontinuity due to the closeness of the zero to the unit circle, and an artificial discontinuity due to the wrapping (modulo 2π) of a smooth phase function (Figure 6.12b). This situation can occur even when the zeros and poles are not close to the unit circle, but are clustered so that phase changes accumulate to create the above ambiguity. Image

8 Consider a vector argument using the example 1 − αz−1 = z−1(zα).

As we saw in Example 6.7, zeros or poles close to the unit circle may cause natural phase discontinuities which are difficult to distinguish from modulo 2π jumps, requiring for accurate phase unwrapping a very large DFT that may be impractical. An alternative algorithm exploits the additional information one can obtain through the phase derivative, which gives the direction and rate at which the phase is changing.

Phase Derivative-Based Phase Unwrapper: An alternative phase unwrapping algorithm combines the information contained in both the phase derivative and principal value of the phase [19]. We saw earlier that the unwrapped phase can be obtained analytically as the integral of the phase derivative:

(6.16)

Image

where the phase derivative Image is uniquely defined through the real and imaginary parts of X(ω), Xr(ω), and Xi(ω), respectively, and is given in Equation (6.8). Although the unwrapped phase can be precisely defined through Equation (6.16), in general it cannot be implemented in discrete time. Nevertheless, using Equation (6.16), one can attempt to compute the unwrapped phase by numerical integration. The accuracy of this approach depends on the accuracy of the derivative of the real and imaginary components, possibly approximated by first differences, and on the size of the integration step Image, but this approach can lead to significant errors since the numerical error can accumulate (Exercise (6.5)). In order to avoid the accumulation of error, the principal phase value can be used as a reference as we now show [19].

Figure 6.12 Phase unwrapping ambiguity: (a) natural unwrapped phase change of π across two DFT frequencies, with zero close to unit circle; (b) two possible phase values in performing phase unwrapping.

SOURCE: J.M. Tribolet, “A New Phase Unwrapping Algorithm” [19]. ©1977, IEEE. Used by permission.

Image

The phase unwrapping problem can be restated as determining the integer value q(ωk) such that

X(ωk) = PV[X(ωk)] + 2πq(ωk)

gives the integrated phase derivative at ωk. Assume the phase has been correctly unwrapped up to ωk − 1 with value θ(ωk − 1). Then the unwrapped phase at ωk is given by

Image

where Image denotes Image. As indicated in Figure 6.12b, we can then think of numerical integration as “predicting” the next phase value from the previous one. We then compare that predicted value against the candidate phase values PV[X(ωk) + 2πq(ωk). For trapezoidal numerical integration, the unwrapped phase is estimated as

Image

which improves as the DFT length increases (frequency spacing decreases). One then selects a value of q(ωk) such that the difference between the predicted value and candiate value are minimized, i.e., we minimize

(6.17)

Image

over q(ωk). One can reduce this minimum error, and thus improve the accuracy of the phase unwrapper, by reducing the frequency spacing (by increasing the DFT length).

There have been numerous attempts to improve on these phase unwrapping algorithms, including a method of polynomial factoring [17] and another based on Chebyshev polynomials [6], both of which give closed-form solutions to the unwrapping problem rather than recursively use past values of unwrapped phase as in the above methods. These closed-form solutions, however, appear to lack computational robustness because they require impractical numerical precision.

6.5 Spectral Root Homomorphic Filtering

A different homomorphic system for convolution is motivated by mapping x[n] = h[n] * p[n] to Image such that Image is a new pulse train with the same spacing as p[n], and where Image is more time-limited than h[n]. If Image is sufficiently compressed in time, it can be extracted by time-liftering in the vicinity of Image. One such class of homomorphic systems replaces the logarithm with the γ power of the z-transform of X (z), i.e., by the rooting operation X (z)γ [5]. As with time-liftering the complex cepstrum, this alternate means of homomorphic filtering, referred to as “spectral root homomorphic filtering” [5], can also be considered as a spectral smoother.

Spectral root homomorphic filtering, illustrated in Figure 6.13, is an analog to the log-based system; the difference is that the γ and 1/γ replace the logarithmic and exponential operations. If we consider real-valued γ, then we define

(6.18)

Image

As with the complex logarithm, in order to make our definition unique, the phase must be unambiguously defined, and this can be done through either of the approaches described in the previous section. Then, since x[n] is a real and stable sequence, Image is a valid z-transform with an ROC that includes the unit circle. Under this condition, we define a sequence, analogous to the complex cepstrum, as the inverse Fourier transform of Image that we refer to as the “spectral root cepstrum”:

Image

Figure 6.13 Spectral root homomorphic filtering.

SOURCE: J.S. Lim, “Spectral Root Homomorphic Deconvolution System” [5]. ©1979, IEEE. Used by permission.

Image

Because Image is even and since Image is odd, then Image is a real and stable sequence. As with liftering the complex cepstrum, liftering the spectral root cepstrum can be used to separate the fast- and slow-varying components of Image.

As before, we consider a class of sequences with rational z-transforms of the form of Equation (6.9). Then Image is expressed as

(6.19)

Image

where (1 − akz−1) and (1 − ckz−1) are zeros and poles inside the unit circle and (1 − bkz) and (1 − dkz) are zeros and poles outside the unit circle with |ak|, |bk|, |ck|, |dk| < 1. The time-shift term zr has been removed, assuming it can be estimated, and the factor A is assumed positive. Each factor in Equation (6.19) can be rewritten using the following power series expansion:

Image

where Image. Thus the spectral root cepstrum of, for example, the kth zero inside the unit circle is given by

Image

and the spectral root cepstra for the three remaining factors (1 − bz)γ, (1 − cz−1)γ, and (1 − dz)γ are similarly derived (Exercise 6.7).

Many of the properties of the spectral root cepstrum are similar to those of the complex cepstrum because the former can be written in terms of the latter [5]. To see this, let Image. Then Image is related to Image by

Image

Using the inverse z-transform, the relation between Image and Image is then given by

(6.20)

Image

From Equation (6.20), we see that if x[n] is minimum-phase, then Image is right-sided, i.e., Image for n < 0, since its complex cepstrum Image is right-sided. A similar observation can be made for maximum-phase, left-sided sequences. As with the complex cepstrum, for minimum- and maximum-phase sequences, Image can be obtained from |X(ω)| (Exercise 6.7). Finally, to complete the analogy with the complex cepstrum, if p[n] is a train of impulses with equal spacing of N then Image remains an impulse train with the same spacing (Exercise 6.7).

When the unwrapped phase is defined unambiguously, e.g., through the phase derivative, then the spectral root cepstrum of a convolution of two sequences equals the convolution of their spectral root cepstra. That is, if x[n] = x1[n] * x2[n], then X(z) = X1(z)X2(z) so that X(z)γ = X1(z)γ X2(z)γ, resulting in Image. This convolutional property is the basis for the spectral root deconvolution system which is analogous to the complex cepstrum deconvolution system that maps convolution to addition, i.e., Image. To see how the spectral root cepstrum can be used for deconvolution, we look at the following example where the convolutional components are an impulse train x1[n] = p[n] and an all-pole response x2[n] = h[n]:

Example 6.8       Suppose h[n] is a minimum-phase all-pole sequence of order q. Consider a waveform x[n] constructed by convolving h[n] with a sequence p[n] where

p[n] = δ[n] + βδ[nN], with β < 1

so that

x[n] = p[n] * h[n]

where q < N and where

P(z) = 1 + βzN.

Suppose we form the spectral root cepstrum of x[n] with γ = −1. Then, using the Taylor series expansion for Image and replacing z by zN, it is seen that the inverse z-transform of P−1 (z) is an impulse train with impulses spaced by N samples (Exercise 6.8). Also H−1(z) is all-zero, since H(z) is all-pole, so that Image is a q-point sequence. Because q < N, h[n] can be deconvolved from x[n] by inverting X(z) to obtain X−1(z), and liftering h−1[n], the inverse z-transform of H−1(z), using a right-sided lifter of q samples (Exercise 6.8). Image

More generally, when p[n] is of the form Image and when Image is sufficiently low-time limited, then low-time liftering of Image yields an estimate of h[n] scaled by the value of the pulse train a0 at the origin; as shown in Example 6.8, this estimate can be exact. In comparison with the complex cepstrum, low-time liftering the complex cepstrum does not recover the response h[n] to within a scale factor since Image is always infinite in extent. In general, however, the relative advantages are not clear-cut since for a general pole-zero sequence, Image is also infinitely long. In this situation, we select γ to maximally compress Image such that it has the smallest energy “concentration” in the low-time region. One definition of energy concentration is the percentage of energy in the first n points of Image relative to its total energy, i.e., Image[5]. For an all-pole sequence, as we saw in the previous example, γ = −1 gives a tight concentration, while for an all-zero sequence, γ = 1 is preferred. For pole-zero sequences, the selection of γ depends on the pole-zero arrangement; for example, for voiced speech dominated by poles, a γ closer to γ = −1 is optimal. Empirically, it has been found that as the number of poles increases relative to the number of zeros, γ should be made closer to −1, and vice-versa when zeros dominate, as illustrated in the following example:

Example 6.9       Consider a sequence of the form x[n] = p[n]*x[n], as in Example 6.8. Figure 6.14 illustrates an example of extracting an h[n] with ten poles and two zeros [5]. Figure 6.14a shows the logarithm of its spectral magnitude. The spectral log-magnitude of an estimate of h[n] derived from low-time liftering the real cepstrum is shown in Figure 6.14b. The same estimate derived from the real spectral root cepstrum is given in Figures 6.14c and 6.14d with γ = +0.5 and γ = −0.5, respectively. The negative value of γ results in better pole estimates, while the positive value results in better zero estimates. For this example, the poles dominate the spectrum and thus the negative γ value is preferred. Image

Figure 6.14 Example of spectral root homomorphic filtering on synthetic vocal tract impulse response: (a) log-magnitude spectrum of impulse response; (b) estimate of log-magnitude spectrum of h[n] derived from low-time gating real cepstrum; (c) log-magnitude spectral estimate derived from low-time gating spectral root cepstrum with γ = +0.5; (d) same as (c) with γ = −0.5.

SOURCE: J.S. Lim, “Spectral Root Homomorphic Deconvolution System” [5]. ©1979, IEEE. Used by permission.

Image

As with a logarithmic cepstral-based analysis, a spectral smoothing interpretation of the spectral root cepstral-based system can be made. This spectral smoothing interpretation shows, for example, that H (ω) in Example 6.9 is estimated by lowpass filtering X (ω)γ. It is natural to ask, therefore, whether the speech spectrum may be smoothed based on some other transformation that is neither logarithmic nor a spectral root function. Finally, as with the complex cepstrum, computational considerations include the aliasing due to insufficient DFT length and inaccuracy in phase unwrapping. Unlike the complex cepstrum, phase unwrapping is required in both the forward X (ω)γ and the inverse X (ω)γ transformations (Exercise 6.9).

6.6 Short-Time Homomorphic Analysis of Periodic Sequences

Up to now we have assumed that a model of voiced speech is an exact convolution of an impulse train and an impulse response, i.e., x[n] = p[n] * h[n], where the impulse train p[n], with impulse spacing P, is of finite extent. In practice, however, a periodic waveform is windowed by a finite-length sequence w[n] to obtain a short-time segment of voiced speech:

s[n] = w[n](p[n] * h[n]).

Ideally, we want s[n] to be close to the convolutional model Image. We expect this to hold under the condition that w[n] is smooth relative to h[n], i.e., under this smoothness condition, we expect9

9 Consider the special case where the impulse spacing P is very large so that there is no overlap among the sequences h[nkP]. Suppose also that the window is piecewise flat and constant over each h[nkP]. Then the approximation in Equation (6.21) becomes an equality.

Image

so that

(6.21)

Image

where Image is the complex cepstrum of the windowed impulse train w[n]p[n] and Image is the complex cepstrum of h[n]. It can be shown, however, that the complex cepstrum of the windowed sequence is given exactly by [21]

(6.22)

Image

where D[n] is a weighting function concentrated at n = 0 and is dependent on the window w[n]. In this section, we do not prove this general case, but rather look at a simplifying case to provide insight into the representation for performing deconvolution. This leads to a set of conditions on the analysis window w[n] and the cepstral lifter l[n] under which Imagein the context of deconvolution of speech-like periodic signals.

6.6.1 Quefrency-Domain Perspective

Consider first a voiced speech signal modeled as a perfectly periodic waveform x[n] = p[n] * h[n] with source Image with P the pitch period, and with vocal tract impulse response h[n]; then only samples of the log[X(ω)] are defined at multiples of the fundamental frequency Image, that is,

log[X(ωk)] = log[P(ωk)] + log[H(ωk)]

where ωk = o and is undefined elsewhere because X(ω) = 0 for ωko. Suppose we define log[X(ω) = 0 for ωko. Then a system component of the form Image appears in the complex cepstrum, i.e., the system component consists of replicas of the desired complex cepstrum Image. These replicas must occur at the pitch period rate because samples of the logarithm are available only at harmonics. Moreover, aliasing at the pitch period rate can occur as a consequence of the undersampling of the spectrum H(ω). From this perspective, we must account for the Nyquist sampling condition and the decay of Image

To further develop this concept analytically, let us now introduce a short-time window w[n] and define

s[n] = w[n](p[n] * h[n])

where as before Image, and h[n] is the system impulse response. We have seen in Chapter 2 that

(6.23)

Image

We write s[n] as

(6.24)

Image

where g[n] is a sequence assumed “close to” h[n]. Then, in this form S(ω) can be expressed as

(6.25)

Image

Therefore, taking the logarithm of the both sides of Equations (6.23) and (6.25) and solving for log[G(ω)], we have

(6.26)

Image

To simplify the expression in Equation (6.26), consider a rectangular W(ω) where W(ω) = 1 for |ω| ≤ ωo/2 and otherwise is zero in the interval [−π, π] (and periodic with period 2π). In the time domain, W(ω) corresponds to the sinc function Image second term in Equation (6.26) then becomes zero and, with our choice of W(ω), the logarithm operator can be taken inside the summation of the first term, resulting in

(6.27)

Image

Therefore, from Equations (6.24) and (6.27), we can write the complex cepstrum of s[n] as

(6.28)

Image

where Image is the complex cepstrum of p[n]w[n] and where the complex cepstrum of g[n]is given by

(6.29)

Image

where Image is the complex cepstrum of h[n] and w[n] is the inverse Fourier transform of the rectangular function W(ω). The result is illustrated in Figure 6.15. We see that Equation (6.29) is a special case of Equation (6.22) with D[n] = w[n].

As with the purely convolutional model, i.e., Image, the contributions of the windowed pulse train and impulse response are additively combined so that deconvolution is possible. Now, however, the impulse response contribution is repeated at the pitch period rate. This is a different sort of aliasing, dependent upon pitch, from the aliasing we saw earlier in Section 6.4.4 due to an insufficient DFT length. We also see that both the impulse response contribution and its replicas are weighted by a “distortion” function D[n] = w[n]. In the particular case described, D[n] is pitch-dependent because the window transform W(ω) is rectangular of width equal to the fundamental frequency. Furthermore, Image has zeros passing through integer pitch periods, thus reducing the effect of replicas of Image. We have assumed in this derivation that the window w[n] is centered at the origin so that the window contains no linear-phase component. Likewise, we have assumed that the impulse response h[n] has no linear phase, thus avoiding the dominating effect of linear phase shown in Example 6.5. An implication of these two conditions is that the window and impulse response are “aligned.” When the window is displaced from the origin, an expression similar to Equations (6.22) and (6.29) can be derived using the approach in [21]. The distortion term D[n] becomes a function of the window shift r and thus is written D[n, r]; this distortion function, however, becomes increasingly severe as r increases [21].

Figure 6.15 Schematic of complex cepstrum of windowed periodic sequence.

Image

We can now state conditions on our particular window Image under which s[n] ≈ (p[n]w[n]) * h[n] from the view of the the quefrency domain [Equations (6.28) and (6.29) and more generally Equation (6.22)]. First, we should select the time-domain window w[n] to be long enough so that D[n] is “smooth” in the vicinity of the quefrency origin and over the extent of Image. Second, the window w[n] should be short enough to reduce the effect of the replicas of Image. Verhelst and Steenhaut [21] have shown that for typical windows (such as Hamming or Hanning), a compromise of these two conflicting constraints is a window w[n] whose duration is about 2 to 3 pitch periods. Finally, the window w[n] should be centered at the time origin and “aligned” with h[n], i.e., there is no linear phase contribution from h[n]. In the context of homomorphic deconvolution, in selecting the low-time lifter l[n], we also want to account for the general two-sidedness of Image. Therefore, a low-time lifter width of half the pitch period should be used for deconvolution. Under these conditions, for |n| < P/2, the complex cepstrum is close to that derived from the conventional model Image. Finally, note that there are now two reasons for low-pitched waveforms to be more amenable to cepstral analysis than high-pitched waveforms: With high-pitched speakers, there is stronger presence of Image close to the origin, as noted earlier, but there is also more aliasing of Image (Exercise 6.11)

In the next section, from the perspective of the frequency domain, we arrive at the above constraints on the analysis window w[n] in a more heuristic way than in this section. This gives us insights different from those we have seen in the quefrency domain.

6.6.2 Frequency-Domain Perspective

Consider again a voiced speech signal modeled as a perfectly periodic waveform, i.e., as x[n] = p[n] * h[n] where Image with P as the pitch period and where h[n]is a vocal tract impulse response. This sequence corresponds in the frequency domain to impulses weighted by H(ωk) at multiples of the fundamental frequency (ωk = o). Windowing in the time domain can then be thought of as a form of interpolation across the harmonic samples X(ωk) = P(ωk)H(ωk) by the Fourier transform of the window, W(ω). We want to determine conditions on the window under which this interpolation results in the desired convolution model, i.e., Image.

For a particular window, e.g., Hamming, one can use an empirical approach to selecting the window length and determining how well we achieve the above convolutional model. This entails looking at the difference between the desired convolutional model Image and the actual windowed sequebce s[n] = w[n](p[n] * h[n]). With Image, we define a measure of spectral degradation with respect to the spectral magnitude as

Image

Over a representative set of pitch periods and speech-like sequences, for a Hamming window this spectral distance measure was found empirically to be minimized for window length in the range of roughly 2 to 3 pitch periods [14]. An implication of this result is that the length of the analysis window should be adapted to the pitch period to make the windowed waveform as close as possible (in the above sense) to the desired convolutional model.

An empirical approach can also be taken for determining window conditions for a phase measurement congruent with the convolutional model [14]. We intuitively see a problem with measuring phase in the frequency domain where s[n] = w[n](p[n] * h[n]) is mapped to

Image

which is essentially the mainlobe of the window Fourier transform weighted and repeated at the harmonic frequencies. As we increase the window length beyond about 2 to 3 pitch periods, the phase of S(ω) between the main harmonic lobes becomes increasingly meaningless in light of the desired convolutional model. As we decrease the window length below 2 to 3 pitch periods, then the window transform mainlobes overlap one another and effectively provide an interpolation of the real and imaginary parts of S(ω) across the harmonics, resulting in a phase more consistent with the convolutional model [14],[15]. Using this viewpoint, a window length of roughly 2 to 3 pitch periods can be argued to be “optimal” for phase measurements to be consistent with the convolutional model (Exercise 6.12). A heuristic argument can also be made in the frequency domain for the requirement that the window be centered at the time origin under the condition that h[n] has no linear phase, i.e., the window center and h[n] are “aligned” (Exercise 6.12). These same conditions were established in the previous section using the more analytic quefrency-domain perspective.

Example 6.10       Figure 6.16 illustrates examples of sensitivity to window length and alignment (position) for a synthetic speech waveform with a system function consisting of two poles inside the unit circle at 292 Hz and 3500 Hz and a zero outside the unit circle at 2000 Hz, and a periodic impulse source of 100-Hz pitch [14]. Figure 6.16a shows the unwrapped phase of the system function’s pole-zero configuration. In panels (b) and (c), the analysis window is the square of the sinc function, i.e., Image, so that its Fourier transform is a triangle with length equal to twice the fundamental frequency. Figure 6.16b shows an unwrapped phase estimate with the window time-aligned with h[n], while Figure 6.16c shows the estimate from the window at a different position. Figure 6.16d,e shows the sensitivity to the length of a Hamming window. In all cases, the unwrapped phase is computed with the modulo 2π phase unwrapper of Section 6.4.4. This example illustrates fundamental problems in the phase representation, being independent of the reliability of the phase unwrapper; it is further studied in EXERCISE 6.12. Image

Figure 6.16 Sensitivity of system phase estimate to the analysis window in Example 6.10: (a) unwrapped phase of artificial vocal tract impulse response; (b) unwrapped phase of a periodic waveform with window Image and time-aligned; (c) same as (b) with window displacement; (d) same as (b) with Hamming window 2 pitch periods in length; (e) same as (d) with Hamming window 3.9 pitch periods in length.

SOURCE: T.F. Quatieri, “Minimum- and Mixed-Phase Speech Analysis/Synthesis by Adaptive Homomorphic Deconvolution” [14]. ©1979, IEEE. Used by permission.

Image

6.7 Short-Time Speech Analysis

6.7.1 Complex Cepstrum of Voiced Speech

Recall that the transfer function model from the glottal source to the lips output for voiced speech is given by

H(z) = AG(z)V(z)R(z)

where the glottal flow over a single cycle G(z), the source gain A, and the radiation load R(z) are all embedded within the system function. The corresponding voiced speech output in the time domain is given by

Image

where p[n] is the idealized periodic impulse train, g[n] is the glottal volume velocity, υ[n] is the resonant (and anti-resonant) vocal tract impulse response, and rL[n] represents the lip radiation response. We assume a rational z-transform for the resonant vocal tract contribution which is stable (poles inside the unit circle) and which may have both minimum- and maximum-phase zeros, i.e.,

(6.30)

Image

As we have seen, “pure” vowels contain only poles, while nasalized vowels and unvoiced sounds can have both poles and zeros. The radiation load is modeled by a single zero R(z) ≈ 1−βz−1, so that with a typical β ≈ 1, we obtain a high-frequency emphasis. The glottal flow waveform can be modeled by either a finite set of minimum- and maximum-phase zeros, i.e.,

Image

or, more efficiently (but less generally), by two poles outside the unit circle:

Image

which, in the time domain, is a waveform beginning with a slow attack and ending with a rapid decay, emulating a slow opening and an abrupt closure of the glottis. We have assumed that H(z) contains no linear phase term so that h[n] is “aligned” with the time origin.

One goal in homomorphic analysis of voiced speech is to separate h[n] and the source p[n] [11],[12]. The separation of h[n] into its components is treated later in the chapter. In order to perform analysis, we must first extract a short-time segment of speech with an analysis window w[n]. Let

s[n] = w[n](p[n] * h[n])

which is assumed to be approximately equal to the convolutional model of the form

Image

i.e., Image under the conditions on w[n] determined in the previous section: the window duration is 2 to 3 pitch periods and its center is aligned with h[n]; the latter condition is more important for good phase estimation of the transfer function than for good spectral magnitude estimation. The discrete complex cepstrum is then computed using an N -point DFT as

Image

and a similar expression can be written for the real cepstrum. For a typical speaker, the duration of the short-time window lies in the range of 20 ms to 40 ms. We assume that the source and system components lie roughly in separate quefrency regions and, in particular, that negligible aliasing of the replicas of Image in Equation (6.22) occurs within Image of the origin. We also assume that the distortion function D[n] is “smooth” in this same region Image so that Image is not significantly distorted, and that D[n] attenuates replicas of Image to make them negligible, as described in Section 6.6.1. We then apply a low-quefrency cepstral lifter, l[n] = 1 for Image and zero elsewhere, to separate Image, and apply the complementary high-quefrency cepstral lifter to separate the input pulse train. The following example illustrates properties of this method:

Example 6.11       Figure 6.17 illustrates homomorphic filtering for a speech waveform from a female speaker with an average pitch period of about 5 ms. The continuous waveform was sampled at 10000 samples/s and is windowed by a 15-ms Hamming window. A 1024-point FFT is used in computing the discrete complex cepstrum. The window center is aligned roughly with h[n] and shifted to the time origin (n = 0) using the strategy to be described in Section 6.8.2. The figure shows the spectral log-magnitude and unwrapped phase estimates of h[n] superimposed on the short-time Fourier transform measurements, as well as the time-domain source and system estimates. Observe that the unwrapped phase of the short-time Fourier transform is approximately piecewise-flat, resulting from the symmetry of the analysis window and the removal of linear phase through the alignment strategy. (The reader should argue this property.). We will see another example of this piecewise-flat phase characteristic again in Chapter 9 (Figure 9.32). A low-quefrency lifter of 6 ms duration (unity in the interval [−3, 3] ms and zero elsewhere) was applied to the cepstrum to obtain the h[n] estimate and a complementary lifter (zero in the interval [−3, 3] ms and unity elsewhere) was applied to obtain the windowed pulse train source estimate. The deconvolved maximum-phase and minimum-phase response components are shown in Figure 6.18 (using a left-sided and right-sided 3-ms lifter, respectively). The convolution of the two components is identical to the sequence of Figure 6.17d. The figure also shows the log-magnitude spectra of the maximum-phase and minimum-phase components. The maximum-phase component and its log-magnitude spectrum resemble those of a typical glottal flow derivative sequence, indicating that, for this particular example, the radiation (derivative) contribution may be maximum-phase, unlike our earlier minimum-phase zero model. For reference, the glottal flow derivative derived from the pole/zero-estimation method of Section 5.7.2 is also shown. Image

Figure 6.17 Homomorphic filtering of voiced waveform from female speaker: (a) waveform (solid) and aligned analysis window (dashed); (b) complex cepstrum of windowed speech signal s[n] (solid) and low-quefrency lifter (dashed); (c) log-magnitude spectrum of s[n] (thin solid) and of the impulse response estimate (thick solid); (d) impulse response estimate from low-quefrency liftering; (e) spectral unwrapped phase of s[n] (thin solid) and of the impulse response estimate (thick solid) (The smooth estimate is displaced for clarity.); (f) estimate of windowed impulse train source from high-quefrency liftering.

Image

Figure 6.18 Deconvolved maximum-phase component (a) (solid) and minimum-phase component (b) in Example 6.11. The convolution of the two components (d) is identical to the sequence of Figure 6.17d. Panel (c) shows the log-magnitude spectra of the maximum-phase (dashed) and minimum-phase (solid) component. The maximum-phase component in panel (a) and its log-magnitude spectrum in panel (c) resemble those of a typical glottal flow derivative sequence. For reference, the dashed sequence in panel (a) is the glottal flow derivative derived from the pole/zero-estimation method of Section 5.7.2.

Image

In this example, our spectral smoothing interpretation of homomorphic deconvolution is seen in the low-quefrency and high-quefrency liftering that separates the slowly-varying and rapidly-varying components of the spectral log-magnitude and phase. We see that the resulting smooth spectral estimate does not necessarily traverse the spectral peaks at harmonics, in contrast to the linear prediction analysis of Chapter 5 that more closely preserves the harmonic amplitudes. In addition, because the mixed-phase estimate is obtained by applying a symmetric lifter, of an extent roughly Image, to the complex cepstrum, the lifter length decreases with the pitch period. Homomorphic filtering by low-quefrency liftering will, therefore, impart more spectral smoothing with increasing pitch. The resulting smoothing widens the formant bandwidths which tends to result in a “muffled” quality in synthesis (Exercise 6.6). Therefore, the requirement that the duration of the low-quefrency lifter be half the pitch period leads to more artificial widening of the formants for females and children than for males.

As indicated earlier, depending on whether we use the real or complex cepstrum, the impulse response can be zero-, minimum-, or mixed-phase. To obtain the zero-phase estimate, the real cepstrum (or the even part of the complex cepstrum) is multiplied by a symmetric lowquefrency lifter l[n] that extends over the region Image. The resulting impulse response estimate is symmetric about the origin (Exercise 6.14). On the other hand, the minimum-phase counterpart can be obtained by multiplying the real cepstrum by the right-sided lifter l[n] of Equation (6.13). The resulting impulse response is right-sided and its energy is compressed toward the origin, a property of minimum-phase sequences [13] that was reviewed in Chapter 2. When the vocal tract is indeed minimum-phase, we saw earlier, at the end of Section 6.4.1, that the liftering operation yields the correct phase function. When the vocal tract is mixed-phase, however, which can occur with back cavity or nasal coupling, for example, then this phase function is only a rough approximation to the original; we have created a synthetic phase from the vocal tract spectral magnitude. In this case, for a rational z-transform, the effect is to flip all zeros that originally fall outside the unit circle to inside the unit circle. We now look at a “proof by example” of this property.

Example 6.12       Consider the z-transform

Image

where a, b, and ck are less than unity. The ROC of X(z) contains the unit circle so that x[n] is stable. There is one zero outside the unit circle giving a maximum-phase component. The real cepstrum is written as

c[n] = cmax[n] + cmin[n]

where cmax[n] and cmin[n] are the real cepstra of the minimum- and maximum-phase components, respectively. Using the right-sided lifter l[n] of Equation (6.13), we can then write the minimum-phase construction as

Image

The component Image is obtained using Equation (6.13) that constructs the complex cepstrum of a minimum-phase sequence from its real cepstrum. We interpret the minimum-phase construction of the maximum-phase term, i.e., cmax[n]l[n], as follows. From our earlier discussion, the complex cepstrum of the maximum-phase component can be written as

Image

Then, because

Image

it follows that the minimum-phase construction of the maximum-phase term is given by

Image

resulting in the zero (1 − bz) flipped inside the unit circle. Image

This argument can be generalized for multiple poles and zeros outside the unit circle, i.e., all poles and zeros outside the unit circle are flipped inside. The accuracy of the resulting minimum-phase function depends, then, on the presence of zeros and poles outside the unit circle. A similar argument can be used for the counterpart maximum-phase construction (Exercise 6.15). We will see examples of the various phase options on real speech when we describe alternatives for speech synthesis in Section 6.8.

6.7.2 Complex Cepstrum of Unvoiced Speech

Recall the transfer function model from the source to the lips output for unvoiced speech:

H(z) = AV(z)R(z)

where, in contrast to the voiced case, there is no glottal volume velocity contribution. The resulting speech waveform in the time domain is given by

Image

where u[n] is white noise representing turbulence at the glottal opening or at some other constriction within the oral cavity. As before, we assume a rational z-transform for the vocal tract contribution which is stable, i.e., poles inside the unit circle, and which may have both minimum-and maximum-phase zeros introduced by coupling of a cavity behind the vocal tract constriction. As before, the radiation lip load is modeled by a single zero.

In short-time analysis, we begin with the windowing of a speech segment

s[n] = w[n](u[n] * h[n]).

The duration of the analysis window w[n] is selected so that the formants of the unvoiced speech power spectral density are not significantly broadened.10 In addition, in order to enforce a convolutional model, we make the assumption that w[n] is “sufficiently smooth” so as to be seen as nearly constant over h[n]. Therefore, as with voiced speech, we assume the convolutional model:

10 The periodogram of a discrete-time random process is the magnitude squared of its short-time Fourier transform divided by the sequence length and fluctuates about the underlying power spectral density of the random process (Appendix 5.A). Since the window multiples the random process, the window’s transform is convolved with, i.e., “smears,” its power spectral density. This bias in the periodogram estimate of the power spectral density is reduced by making the window long, but this lengthening can violate waveform stationarity and increases the variance of the periodogram. This can be a particularly difficult choice also because unvoiced speech may consist of very long segments, e.g., voiceless fricatives, or very short segments, e.g., plosives. Furthermore, unvoiced plosives, as we saw in Chapter 3, have an impulse-like component (i.e., the initial burst) followed by an aspiration component.

Image

Defining the windowed white noise as q[n] = u[n]w[n], a discrete complex cepstrum is computed with an N -point DFT and expressed as

Image

A similar expression can be found for the real cepstrum. Unlike the voiced case, there is overlap in these two components in the low-quefrency region and so we cannot separate Image by low-quefrency liftering; noise is splattered throughout the entire quefrency axis. Nevertheless, the spectral smoothing interpretation of homomorphic deconvolution gives insight as to why homomorphic deconvolution can still be applicable. As we have seen, we can think of log[X(ω)] as a “time signal” which consists of low-frequency and high-frequency contributions. The spectral smoothing interpretation of low-quefrency liftering indicates that we may interpret it as smoothing the fluctuations in log[X(ω)] that perturb the underlying system function H(ω), i.e., fluctuations due to the random source component that excites the vocal tract. Although this interpretation has approximate validity for the spectral log-magnitude, the smoothing viewpoint gives little insight for phase estimation. In practice, sensitivity of the unwrapped phase to small perturbations in spectral nulls prohibits a meaningful computation; for stochastic sequences, the unwrapped phase can jump randomly from sample-to-sample in discrete frequency and large trends can arise that do not reflect the underlying true phase function of H(ω). Moreover, the phase of the system function for unvoiced speech, excluding perhaps plosives, is not deemed to be perceptually important [3]. For these reasons, the real cepstrum has been typically used in practice for the unvoiced case,11 resulting in a zero- or minimum-phase counterpart to the mixed-phase system response. Finally, the excitation noise component is typically not extracted by high-quefrency liftering because it is said to be emulated by a white-noise process. Nevertheless, the deconvolved excitation may contain interesting fine source structure, e.g., with voiced fricatives, diplophonic and breathy voices, and nonacoustic-based distributed sources (Chapter 11) that contribute to the quality and distinction of the sound and speaker.

11 A cepstrum has been defined based on high-order moments of a random process [8]. Under certain conditions on the process, the phase of the vocal tract can then be estimated. Alternatively, smoothing of the real and imaginary parts of the complex Fourier transform may provide a useful phase estimate (Exercise 13.4).

6.8 Analysis/Synthesis Structures

In speech analysis, the underlying parameters of the speech model are estimated, and in synthesis, the waveform is reconstructed from the model parameter estimates. We have seen in short-time homomorphic deconvolution that, by liftering the low-quefrency region of the cepstrum, an estimate of the system impulse response is obtained. If we lifter the complement high-quefrency region of the cepstrum, and invert with the homomorphic system Image to obtain an excitation function, then convolution of the two resulting component estimates yields the original short-time segment exactly. With an overlap-add reconstruction from short-time segments, the entire waveform is recovered; we have, therefore, an identity system with no “information reduction.” (In linear prediction analysis/synthesis, this is analogous to reconstructing the waveform from the convolution of the all-pole filter and the output of its inverse filter.) In applications such as speech coding and speech modification, however, a more efficient representation is often desired. The complex or real cepstrum provides an approach to such a representation because pitch and voicing can be estimated from the peak (or lack of a peak) in the high-quefrency region of the cepstrum [7]. Other methods of pitch and voicing estimation will be described in Chapter 10 of the text. Assuming then that we have a succinct and accurate characterization of the speech production source, as with linear prediction-based analysis/synthesis, we are able to “efficiently” synthesize an estimate of the speech waveform. This synthesis can be performed based on any one of several possible phase functions: zero, minimum- (maximum-), or mixed-phase functions. The original homomorphic analysis/synthesis system, developed by Oppenheim [11], was based on zero- and minimum-phase impulse response estimates. In this section, we describe this analysis/synthesis system, its generalization with a mixed-phase representation, and then an extension using spectral root deconvolution.

6.8.1 Zero- and Minimum-Phase Synthesis

The general framework for homomorphic analysis/synthesis is shown in Figure 6.19. In zero-or minimum-phase reconstruction, the analyzer consists of Fourier transforming a short-time speech segment, computing the real logarithm log |X(ω)|, and inverse transforming to generate the real cepstrum. A 1024-point DFT is sufficient to avoid DFT aliasing and a typical frame interval is 10–20 ms. In the analysis stage, for voiced segments, a pitch-adaptive Hamming window of length of 2 to 3 pitch periods is first applied to the waveform. This choice of the window length is needed to make the windowed periodic waveform approximately follow a convolutional model, as described in Section 6.6. The pitch and voicing estimates are obtained from the cepstrum or by some other means, as indicated by the waveform input to the pitch and voicing estimation module of Figure 6.19. The analysis stage yields a real cepstrum and an estimate of pitch and voicing. The cepstral lifter l[n] selects a zero- or minimum- (maximum-) phase cepstral representation. As discussed in Section 6.6, the cepstral lifter adapts to the pitch period and is of length Image to avoid aliasing of replicas of the cepstrum of the system impulse response, Image.

In synthesis, the DFT of the liftered cepstral coefficients is followed by the complex exponential operator and inverse DFT. This yields either a zero- or minimum- (maximum-) phase estimate of the system impulse response. We denote this estimate by h[n, pL] for the pth frame and a frame interval of L samples. The excitation to the impulse response is generated on each frame in a manner similar to that in linear prediction synthesis (Figure 5.16). During voicing, the excitation consists of a train of unit impulses with spacing equal to the pitch period. This train is generated over a frame interval beginning at the last pitch pulse from the previous frame. During unvoicing, a white noise sequence is generated over the frame interval. We denote the excitation for the pth frame by u[n, pL]. One can then construct for each frame a short-time waveform estimate Image, pL] by direct convolution

Figure 6.19 General framework for homomorphic analysis/synthesis: (a) analysis; (b) synthesis. The alignment and correlation operations are used for mixed-phase reconstruction.

SOURCE: T.F. Quatieri, “Minimum- and Mixed-Phase Speech Analysis/Synthesis by Adaptive Homomorphic Deconvolution” [14]. ©1979, IEEE. Used by permission.

Image

(6.31)

Image

and so the reconstructed waveform is given by

Image

In order to avoid sudden discontinuities in pitch or spectral content, linear interpolation of the pitch period and impulse response is performed, with the convolution being updated at each pitch period. The pitch interpolation is performed by allowing each period to change according to its interpolated value at each pulse location. The impulse responses are constructed by linearly interpolating the response between h[n, pL] and h[n, (p − 1)L] at the interpolated pitch periods. This strategy leads to improved quality, in particular, less “rough,” over use of the frame-based convolution of Equation (6.31) [11],[14]. In comparing phase selections in synthesis, the minimum-phase construction was found in informal listening tests to be of higher quality than its zero-phase counterpart, the maximum-phase version being least perceptually desirable; each has its own perceptual identity. The minimum-phase construction was considered most “natural”; the zero-phase rendition was considered most “muffled,” while the maximum-phase was most “rough” [11],[14].

We noted above that a Hamming window of the length of 2 to 3 pitch periods was applied to satisfy our constraint of Section 6.6 for a convolutional model. This window length was also found empirically to give good time-frequency resolution. It was observed that with a window much longer than 2 to 3 pitch periods, time resolution was compromised, resulting in a “slurred” synthesis because nonstationarities are smeared, blurring and mixing speech events. On the other hand, with an analysis window much shorter than 2 to 3 pitch periods, frequency resolution is compromised, resulting in a more “muffled” synthesis because speech formants are smeared, abnormally widening formant bandwidths (Exercise 6.6).

6.8.2 Mixed-Phase Synthesis

The analysis/synthesis framework based on mixed-phase deconvolution, using the complex cepstrum, is also encompassed in Figure 6.19. The primary difference from the zero- and minimum-phase systems is the pre- and post-alignment. As shown in Section 6.6, to make the windowed segment meet the conditions for a convolutional model, the window should be aligned with the system impulse response. This alignment is defined as centering the window at the time origin and removing any linear phase term zr (in the z-domain) in the transfer function from the source to the lips output. The alignment removes any linear phase that manifests itself as a cepstral sinc-like function and can swamp the desired complex cepstrum (Example 6.5), corresponding to reducing distortion from D[n] in the exact cepstral representation of Equation (6.22).

An approximate, heuristic alignment can be performed by finding the maximum value of the short-time segment and locating the first zero crossing prior to this maximum value. Alternative, more rigorous, methods of alignment will be described in Chapters 9 and 10, and are also given in [14],[15]. Ideally, the alignment removes the linear phase term zr; in practice, however, a small residual linear phase often remains after alignment. With low-quefrency liftering of the complex cepstrum, this residual linear phase appears as a time shift “jitter,” np, in the impulse response estimate over successive frames. The presence of jitter results in a small but audible change in pitch which can give a “hoarseness” to the reconstruction, as discussed in Chapter 3. If we assume the estimated transfer function is represented by a finite number of poles and zeros, we have on the pth frame

(6.32)

Image

where znp represents the time jitter after alignment. One approach to remove np uses the unwrapped phase at π; in practice, however, this method is unreliable (Exercise 6.16). A more reliable approach is motivated by observing that only relative delays between successive impulse response estimates need be removed. The relative delay can be eliminated with a method of post-alignment in synthesis that invokes the cross-correlation function of two successive impulse response estimates [14],[15]. Given that system functions are slowly varying over successive analysis frames except for the random time shifts np, we can express two consecutive impulse response estimates as

Image

Their cross-correlation function, therefore, is given approximately by

Image

where rh[n] is the autocorrelation function of h[n]. Therefore, the location of the peak in r[n] is taken as an estimate of the relative delay npnp +1. We then perform post-alignment by shifting the (p + 1)st impulse response estimate by npnp+1 points. When post-alignment is not performed, a “hoarse” quality is introduced in the reconstruction due to change in the pitch period by the difference npnp+1 over successive frames [15].

Example 6.13       Figure 6.20 shows a comparison of zero-, minimum-, and mixed-phase syntheses for a voiced speech segment from a male speaker. (This is the same segment used in Figure 5.18 to illustrate the minimum-phase reconstruction from linear prediction analysis/synthesis.) The three systems are identical except for the manner of introducing phase and the elimination of post-alignment in the minimum- and zero-phase versions. Specifications of the analysis/synthesis parameters are given in the previous Section 6.8.1. Observe that the minimum-phase version is different from the minimum-phase reconstruction from linear prediction analysis because the homomorphic estimate does not impose an all-pole transfer function. Image

Figure 6.20 Homomorphic synthesis based on (b) mixed-phase, (c) zero-phase, and (d) minimum-phase analysis and synthesis. Panel (a) is the original.

Image

For a database of five males and five females (3-4 seconds in duration), in informal listening (by ten experienced listeners), when compared with its minimum-phase counterpart, the mixed-phase system produces a small but audible improvement in quality [14],[15]. When preferred, the mixed-phase system was judged by the listeners to reduce “buzziness” of the minimum-phase reconstruction. Minimum-phase reconstructions are always more “peaky” or less dispersive than their mixed-phase counterparts because the minimum-phase sequences have energy that is maximally compressed near the time origin [13], as reviewed in Chapter 2 and discussed in the context of linear prediction analysis in Chapter 5. This peakiness may explain the apparent buzziness one hears in minimum-phase reconstructions when compared to their mixed-phase counterparts. One also hears a “muffled” quality in the minimum-phase system relative to the mixed-phase system, due perhaps to the removal of accurate timing and fine structure of speech events by replacing the original phase by its minimum-phase counterpart. These undesirable characteristics of the minimum-phase construction are further accentuated in the zero-phase construction. It is also interesting that differences in the perceived quality of zero-, minimum-, and mixed-phase reconstructions were found to be more pronounced in males than in females [15]. This finding is consistent with the human auditory system’s having less phase sensitivity to high-pitched harmonic waveforms, which is explained by the critical-band theory of auditory perception (described in Chapters 8 and 13) and which is supported by psychoacoustic experiments. The importance of phase will be further described in the context of other speech analysis/synthesis systems and auditory models throughout the text.

6.8.3 Spectral Root Deconvolution

We saw earlier that a sequence estimated from spectral root deconvolution approaches the sequence estimated from the complex cepstrum as the spectral rootγ approaches zero (Exercise 6.9). Thus, under this condition, the two deconvolution methods perform similarly. On the other hand, for a periodic waveform, by selecting γ to give the greatest energy concentration possible for Image, we might expect better performance from spectral root deconvolution in speech analysis/synthesis. In particular, we expect that, because voiced speech is often dominated by poles, we can do better with the spectral root cepstrum than the complex cepstrum by selecting a fractional power close to –1. In fact, we saw earlier for an all-pole h[n] that perfect recovery can be obtained when the number of poles is less than the pitch period (Example 6.8); we also saw that, in general, spectral peaks (poles) are better preserved with smaller γ and spectral nulls (zeros) are better preserved with larger γ (Example 6.9). For real speech, similar results have been obtained [5]. As with cepstral deconvolution, in spectral root deconvolution, we choose the analysis window according to the constraints of Section 6.6 so that a windowed periodic waveform approximately follows the desired convolutional model.

A minimum-phase spectral root homomorphic analysis/synthesis system, with a γ = –1/3, was observed to preserve spectral formant peaks better than does the counterpart real-cepstral analysis/synthesis [5]. For a few American English utterances, this spectral root system was judged in informal listening to give higher quality, in the sense of a more natural sounding synthesis, than its real-cepstrum counterpart using the same window and pitch and voicing. For γ close to zero, the quality of the two systems is similar, as predicted.

6.9 Contrasting Linear Prediction and Homomorphic Filtering

6.9.1 Properties

In the introduction to this chapter we stated that homomorphic filtering is an alternative to linear prediction for deconvolving the speech system impulse response and excitation input. It is of interest, therefore, to compare the properties of these two deconvolution techniques. Linear prediction, being parametric, tends to give sharp smooth resonances corresponding to an all-pole model, while homomorphic filtering, being nonparametric, gives wider spurious resonances consistent with the spectral smoothing interpretation of cepstral liftering. Linear prediction gives an all-pole representation, although a zero can be represented by a large number of poles, while homomorphic filtering can represent both poles and zeros. Linear prediction is constrained to a minimum-phase response estimate, while homomorphic filtering can give a mixed-phase estimate by using the complex cepstrum. Speech analysis/synthesis by linear prediction is sometimes described as “crisper” but more “mechanical” than that by homomorphic filtering, which is sometimes perceived as giving a more “natural” but “muffled” waveform construction.

Nevertheless, many problems encountered by the two methods are similar. For example, both methods suffer more speech distortion with increasing pitch. Aliasing of the vocal tract impulse response at the pitch period repetition rate can occur in the cepstrum, as well as in the autocorrelation function, although aliasing can be better controlled in the complex cepstrum by appropriate time-domain windowing and quefrency liftering. In both cases, time-domain windowing can alter the assumed speech model. In the autocorrelation method of linear prediction, windowing results in the prediction of nonzero values of the waveform from zeros outside the window. In homomorphic filtering, windowing a periodic waveform distorts the convolutional model. Finally, there is the question of model order; in linear prediction, the number of poles is required, while in homomorphic filtering, the length of the low-quefrency lifter must be chosen. The best window and order selection in both methods is often a function of the pitch of the speaker. In the next section, we see that some of these problems can be alleviated by merging the two deconvolution methods.

6.9.2 Homomorphic Prediction

There are a number of speech analysis methods that rely on combining homomorphic filtering with linear prediction and are referred to collectively as homomorphic prediction. There are two primary advantages of merging these two analysis techniques: First, by reducing the effects of waveform periodicity, an all-pole estimate suffers less from the effect of high-pitch aliasing; Second, by removing ambiguity in waveform alignment, zero estimation can be performed without the requirement of pitch-synchronous analysis.

Waveform Periodicity— Consider the autocorrelation method of linear prediction analysis. Recall from Chapter 5 that the autocorrelation function of a waveform consisting of the convolution of a short-time impulse train and an impulse response, i.e., x[n] = p[n] * h[n], equals the convolution of the autocorrelation function of the response and that of the impulse train

rx[τ] = rh[τ] * rp[τ].

Thus, as the spacing between impulses (the pitch period) decreases, the autocorrelation function of the impulse responses suffers from increasing distortion. If one can extract an estimate of the spectral magnitude of h[n] then linear prediction analysis can be performed with an estimate of rh[τ] free of the waveform periodicity. One approach is to first homomorphically deconvolve an estimate of h[n] by lowpass liftering the real or complex cepstrum of x[n]. The autocorrelation function of the resulting impulse response estimate can then be used by linear prediction analysis. The following example demonstrates the concept:

Example 6.14       Suppose h[n] is a minimum-phase all-pole sequence of order p. Consider a waveform x[n] constructed by convolving h[n] with a sequence p[n] where

p[n] = δ[n] + βδ[nN],     with β < 1.

The complex cepstrum of x[n] is given by

Image

where Image and Image are the complex cepstra of p[n] and h[n], respectively, and which is of the form in Figure 6.9. The autocorrelation function, on the other hand, is given by

rx[τ] = (1 + β2)rh[τ] + βrh[τ − N] + βrh[τ + N]

so that rh[τ] is distorted by its neighboring terms centered at τ = + N and τ = −N. Image

The example illustrates an important point: The first p coefficients of the real cepstrum of x[n] are undistorted (with a long-enough DFT length used in the computation), whereas the first p coefficients of the autocorrelation function of the waveform rx[τ ] are distorted by aliasing of autocorrelation replicas (regardless of the DFT length used in the computation). Therefore, a cepstral lowpass lifter of duration less than p extracts a smoothed version of the spectrum, but not aliased. Moreover, the linear prediction coefficients can alternatively be obtained exactly through the recursive relation between the real cepstrum and the predictor coefficients of the all-pole model when h[n] is all-pole (Exercise 6.13) [1].

Nevertheless, it is important to note that we have not considered a windowed periodic waveform. That is, as seen in Section 6.6, the cepstrum of a windowed periodic waveform does indeed experience aliasing distortion, as does the autocorrelation function; this distortion, however, is minimized by appropriate selection of the function D[n].

Zero Estimation — Consider a transfer function of poles and zeros of the form

Image

and a sequence x[n] = h[n] * p[n] where p[n] is a periodic impulse train. Suppose that an estimate of h[n] is obtained through homomorphic filtering of x[n] and assume that the number of poles and zeros is known and that a linear-phase component z−r has been removed. Then, following Kopec [4], we can estimate the poles of h[n] by using the covariance method of linear prediction (Chapter 5) with a prediction-error interval that is free of zeros. The Shanks, or other methods described in Chapter 5, can then be applied to estimate the zeros. The following example illustrates the approach on a real speech segment:

Example 6.15       In this example, a rational z-transform consisting of 10 poles and 6 zeros is used to model a segment of the nasalized vowel/u/ in the word “moon” [4]. Homomorphic prediction was performed by first applying homomorphic filtering on the complex cepstrum to obtain an impulse response estimate. The log-magnitude spectrum of this estimate is shown in Figure 6.21b. The covariance method of linear prediction analysis was then invoked to estimate the poles, and then the Shanks method was used to estimate the zeros. The method estimated a zero near 2700 Hz, which is typical for this class of nasalized vowels (Figure 6.21c).Image

Other zero estimation methods can also be combined with homomorphic filtering, such as the Steiglitz method (Exercise 5.6). In addition, forms of homomorphic prediction can be applied to deconvolve the vocal tract glottal source from the vocal tract transfer function. Moreover, this synergism of homomorphic filtering and linear prediction analysis allows, under certain conditions, this source/system separation when both zeros and poles are present in the vocal tract system function (Exercise 6.22).

Figure 6.21 Homomorphic prediction applied to a nasalized vowel from /u/ in “moon”: (a) log-magnitude spectrum of speech signal; (b) log-magnitude spectrum obtained by homomorphic filtering (low-time liftering the real cepstrum); (c) log-magnitude spectrum of 10-pole/6-zero model with zeros from Shanks method.

SOURCE: G.E. Kopec, A.V. Oppenheim, and J.M. Tribolet, “Speech Analysis by Homomorphic Prediction” [4]. ©1977, IEEE. Used by permission.

Image

6.10 Summary

In this chapter, we introduced homomorphic filtering with application to deconvolution of the speech production source and system components. The first half of the chapter was devoted to the theory of homomorphic systems and the analysis of homomorphic filtering for convolution. Both logarithmic and spectral root homomorphic systems were studied. The complex cepstrum was derived for an idealized voiced speech waveform as a short-time convolutional model consisting of a finite-length impulse train convolved with an impulse response having a rational z-transform. The complex cepstrum of a windowed periodic waveform was also derived and a set of conditions was established for the accuracy of the short-time convolutional model. The second half of the chapter applied the theory of homomorphic systems to real speech and developed a number of speech analysis/synthesis schemes based on zero-, minimum-, and mixed-phase estimates of the speech system response. Finally, the properties of linear prediction analysis were compared with those of homomorphic filtering for deconvolving speech waveforms, and the two methods were merged in homomorphic prediction.

Although this chapter focuses on homomorphic systems for convolution for the purpose of source/system separation, the approach is more general. For example, homomorphic systems have been designed to deconvolve multiplicatively combined signals to control dynamic range (Exercise 6.19). In dynamic range compression, a signal, e.g., speech or audio, is modeled as having an “envelope” that multiplies the signal “fine structure”; the envelope represents the time-varying volume and the fine structure represents the underlying signal. The goal is to separate the envelope and reduce its wide fluctuations. A second application not covered in this chapter is recovery of speech from degraded recordings. For example, old acoustic recordings suffer from convolutional distortion imparted by an acoustic horn that can be approximated by a linear resonant filter. The goal is to separate the speech or singing from its convolution with the distorting linear system. This problem is similar to the source/system separation problem, except that here we are not seeking the components of the speech waveform, but rather the speech itself. This problem can be even more difficult since the cepstra of the horn and speech overlap in quefrency. A solution of this problem, developed in Exercise 6.20, represents a fascinating application of homomorphic theory.

Other applications of homomorphic filtering will be seen throughout the remainder of the text. For example, the style of analysis in representing the speech system phase will be useful in developing different phase representations in the phase vocoder and sinewave analysis/synthesis. Other extensions of homomorphic theory will be introduced as they are needed. For example, in the application of speech coding (Chapter 12) and speaker recognition (Chapter 14) where the cepstrum plays a major role, we will find that speech passed over a telephone channel effectively loses spectral content outside of the band 200 Hz to 3400 Hz, so that the cepstrum is not defined outside of this range. For this condition, we will, in effect, design a new homomorphic system that in discrete time requires a mapping of the high spectral energy frequency range to the interval [0, π]. This modified homomorphic system was first introduced in the context of seismic signal processing where a similar spectral distortion arises [20]. Also in the application of speaker identification, we will expand our definition of the cepstrum to represent the Fourier transform of a constant-Q filtered log-spectrum, referred to as the mel-cepstrum. The mel-cepstrum is interesting for its own sake because it is hypothesized to approximate signal processing in the early stages of human auditory perception [22]. We will see in Chapter 13 that homomorphic filtering, applied along the temporal trajectories of the mel-cepstral coefficients, can be used to remove convolutional channel distortions even when the cepstrum of these distortions overlaps the cepstrum of speech. Two such methods that we will study are referred to as cepstral mean subtraction and RASTA processing.

EXERCISES

6.1 Consider the phase of the product X(z) = X1(z)X2(z). An ambiguity in the definition of phase arises since ∠X(z) = PV[∠X (z)] + 2πk, where k is any integer value (PV = principle value).

(a) Argue that when k is selected to ensure phase continuity, then the phase of the product X1(z)X2(z) is the sum of the phases of the product components.

(b) Argue that the phase additivity property in part (a) can also be ensured when the phase is defined as the integral of the phase derivative of X(z) (see Exercise 6.2).

6.2 Show that the phase derivative Image of the Fourier transform X(ω) of a sequence x[n] can be obtained through the real and imaginary parts of X(ω), Xr(ω) and Xi(ω), respectively, as

Image

where |X(ω)|, the Fourier transform magnitude of x[n], is assumed non-zero.

6.3 This problem addresses the nature of the phase function for poles and zeros of a rational z-transform in Equation (6.9).

(a) Show that the phase on the unit circle of the terms (1 − az−1), (1 − bz), (1 − cz−1), and (1 − dz) all begin at zero (at ω = 0) and end at zero (at ω = π).

(b) Show that the absolute value of the phase in (a) is less than π.

Hint: Use a vector argument.

6.4 Show that the complex cepstrum of a minimum- or maximum-phase sequence can be recovered within a scale factor from the phase of its Fourier transform.Hint: Recall that the phase of X(ω) is the imaginary part of log[X(ω)] that maps to the odd part of the complex cepstrum.

6.5 Suppose we are given samples of the phase derivative of the Fourier transform of a sequence. Consider a linear numerical integration scheme for computing the unwrapped phase in Equation (6.16) based on these samples. Argue why this scheme may be flawed when poles or zeros are located very close to the unit circle. How might this problem affect the phase derivative-based phase unwrapping algorithm of Section 6.4?

6.6 It has been observed that nasal formants have broader bandwidths than non-nasal voiced sounds. This is attributed to the greater viscous friction and thermal loss due to the large surface area of the nasal cavity. In linear prediction or homomorphic analysis of a short-time sequence s[n], the resulting vocal tract spectral estimates often have wider bandwidths than the true underlying spectrum. Consequently, when the spectral estimates are used in synthesizing speech, the resulting synthesized speech is characterized by a “nasalized” or “muffled” quality.

(a) Using the frequency-domain linear prediction error expression given in Equation (5.38) of Chapter 5, give a qualitative explanation of bandwidth broadening by linear prediction analysis. Hint: Consider the error contribution in low-energy regions of |S(ω)|2.

(b) We have seen that low-quefrency liftering in homomorphic analysis is equivalent to filtering the complex logarithm. Suppose that l[n] is the lifter with N -point discrete Fourier transform L(k), and that log[S(k)] is the logarithm of the N -point discrete Fourier transform of the speech sequence s[n]. Write an expression for the modified logarithm (in the discrete frequency variable k) after application of the lifter l[n]. How might this expression explain formant broadening by homomorphic analysis?

(c) Referring to parts (a) and (b), explain qualitatively why formant broadening can be more severe for female (and children) than for male speakers. A distinct answer should be given for each technique (i.e., linear prediction and homomorphic analysis).

(d) Suppose the vocal tract impulse response h[n] is all-pole and stable with system function

Image

that can be rewritten as

Image

with poles z = bk that are located inside the unit circle. Formant bandwidth is related to the distance of a pole from the unit circle. As the pole approaches the unit circle, the bandwidth narrows. To compensate for formant bandwidth widening, in both linear prediction and homomorphic analysis, the poles are moved closer to the unit circle. Show that poles are moved closer to the unit circle by the transformation

(6.33)

Image

where α (real) is greater than unity and |αbk| < 1.

(e) Suppose h[n] in part (d) has a z-transform that consists of a single pole of the form

Image

and that bo is real. Sketch the real cepstrum associated with H(z) and Image from part (d). For a general H(z), how is the complex cepstrum modified by the transformation in Equation (6.33)? Give an expression in terms of the original complex cepstrum.

6.7 In Section 6.5, we introduced the spectral root homomorphic system. In this problem you are asked to develop some of the properties of this system for sequences with rational z-transforms.

(a) Derive the spectral root cepstrum for zeros outside the unit circle, and for poles inside and outside the unit circle.

(b) Show that if x[n] is minimum phase, Image can be obtained from |X(ω)|.

(c) Let p[n] be a train of equally-spaced impulses Image. Show that Image remains an impulse train with the same spacing N. Hint: Use the relation between the spectral root cepstrum and the complex cepstrum derived in Section 6.5.

6.8 Suppose h[n] is an all-pole sequence of order q. Consider a waveform x[n] created by convolving h[n] with a sequence p[n] (β < 1):

p[n] = δ[n] + βδ[nN]

so that

x[n] = p[n] * h[n]

where q < N, and where

P(z) = 1 + βz N.

(a) Argue that P−1(z) represents an impulse train with impulses spaced by N samples.Hint: Use the Taylor series expansion for Image and replace z by zN. Do not give an explicit expression.

(b) Show that h[n] can be deconvolved from x[n] by inverting X(z) [i.e., forming X −1(z)] and “liftering” h−1[n], the inverse z-transform of H −1(z). Sketch a block diagram of the deconvolution algorithm as a homomorphic filtering procedure.Hint: H −1(z) is all-zero.

(c) Suppose we add noise to x[n] and suppose a resonance of h[n] is close to the unit circle. Argue that the inversion process is very sensitive under this condition and thus the method may not be useful in practice.

6.9 This problem contrasts the spectral root and complex cepstrum.

(a) For x[n] = p[n] * h[n], with p[n] a periodic impulse train, discuss the implications of p[n] not being aligned with the origin for both the complex and spectral root cepstrum.

(b) Using the relation between the spectral root cepstrum and the complex cepstrum, Equation (6.20), show that as γ approaches zero, the spectral root deconvolution system approaches the complex cepstral deconvolution.

(c) Show why it is necessary to perform two phase-unwrapping operations, one for the forward X(ω)γ and one for the inverse X(ω)1/γ transformations when γ is an integer. How does this change when γ is not an integer?

6.10 In this problem, a phase unwrapping algorithm is derived which does not depend on finding 2π jumps. Consider the problem of computing the unwrapped phase of the Fourier transform of a sequence x[n]:

X(ω) = |X(ω)|ejθ(ω).

The corresponding all-pass function is given by

(6.34)

Image

where xa[n] is the inverse Fourier transform of Xa(ω).

(a) By determining the phase derivative associated with Equation (6.34), show that

(6.35)

Image

(b) Why might the “direct” unwrapped-phase algorithm in Equation (6.35) be flawed even when the sequence x[n] is of finite extent? In answering this question, consider how you would compute the all-pass sequence xa[n].

6.11 This problem further investigates in the quefrency domain the complex cepstrum of windowed periodic waveforms which were described in Section 6.6.1. Assume W(ω) is rectangular with width equal to the fundamental frequency (pitch) as in Section 6.6.1.

(a) Under what conditions does s[n] ≈ (p[n]w[n]) * h[n]? When these conditions do not hold, describe the kinds of distortions introduced into the estimate of h[n] using homomorphic deconvolution. Why is the distortion more severe for high-pitched than for low-pitched speakers?

(b) In selecting a time-liftering window for deconvolving the vocal tract impulse response, how is its width influenced by minimum-, mixed-, and zero- phase sequences?

(c) Design an “optimal lifter” that compensates for the distortion function D[n]. What are some of the problems you encounter in an optimal design? How does the design change with pitch and with the sequence h[n] being minimum-, mixed-, or zero-phase?

6.12 This problem further investigates in the frequency domain the complex cepstrum of windowed periodic waveforms which was described in Section 6.6.2. We have seen that to use homomorphic de-convolution to separate the components of the speech model, the speech signal, x[n] = h[n] * p[n], is multiplied by a window sequence, w[n], to obtain s[n] = x[n]w[n]. To simplify analysis, s[n] is approximated by:

Image

Assume in the following that h[n] has no linear phase:

(a) Suppose that the window Fourier transform W(ω) is a triangle symmetric about ω = 0 so that w[n] is the square of a sinc function and that W(ω) is zero-phase, i.e., w[n] is centered at the time origin. Suppose that the width of W(ω) is much greater than the fundamental frequency. Discuss qualitatively the problem of extracting the unwrapped phase of h[n]. Repeat for the case where the width of W(ω) is much less than the fundamental frequency. Explain why a width of exactly twice the fundamental frequency provides an “optimal” interpolation across harmonic samples of X(ω) for extracting unwrapped phase.

(b) Consider W(ω) of part (a) but with a linear phase corresponding to a displacement of w[n] from the time origin. Discuss qualitatively the problem of extracting the unwrapped phase of h[n].

(c) Repeat parts (a) and (b) with a Hamming window whose width in the frequency domain is defined as the 3-dB width of its mainlobe.

6.13 This problem explores recursive relations between the complex cepstrum and a minimum-phase sequence.

(a) Derive a recursive relation for obtaining a minimum-phase sequence h[n] directly from its complex cepstrum Image. In determining the value h[0], assume that the z-transform of h[n] is rational with positive gain factor A. Derive a similar recursion for obtaining the complex cepstrum Image from h[n]. Hint: Derive a difference equation based on cross-multiplying in the relation Image.

(b) Suppose h[n] is all-pole (minimum-phase) of order p. Show that the all-pole (predictor) coefficients can be obtained recursively from the first p complex cepstral coefficients of h[n] (not including Image).

(c) Argue that, when an all-pole h[n] is convolved with a pulse train p[n], the all-pole (predictor) coefficients can be obtained from the complex cepstrum of x[n] = h[n] * p[n], provided that p < P, where P is the pitch period in samples. Explain the “paradox” that truncating the complex cepstrum of x[n] at p coefficients and inverting via the inverse Fourier transform and the complex exponential operator smears the desired spectrum H(ω), even though the first p coefficients contain all information necessary for recovering the minimum-phase sequence.

6.14 Suppose that the real cepstrum c[n] is used to deconvolve a speech short-time segment x[n] into source and system components. Argue that the resulting zero-phase impulse response is an even function of time. Suppose the odd part of the complex cepstrum Image is low-quefrency liftered. Argue that an all-pass transfer function is obtained for the system estimate and that the corresponding impulse response is an odd function of time.

6.15 Derive the maximum-phase counterpart to Example 6.12. That is, show by example that the maximum-phase construction of a mixed-phase sequence with a rational z-transform flips poles and zeros inside the unit circle to outside the unit circle. Show that the maximum- and minimum-phase constructions are time reversals of each other.

6.16 It was stated in Section 6.8.2 that one approach to remove the jitter np in a speech system impulse response, corresponding to the linear-phase residual in Equation (6.32), is to measure the unwrapped phase at π.

(a) For a rational transfer function of the form in Equation (6.32), propose a method to determine np from H (z, pL) evaluated at z = e.

(b) Argue why, in practice, this method is unreliable.Hint: Consider lowpass filtering prior to A/D conversion, as well as a mixed-source excitation, e.g., the presence of breathiness during voicing.

6.17 Consider a signal which consists of the convolution of a periodic impulse train p[n], which is minimum-phase, and a mixed-phase signal h[n] = hmin[n] * hmax[n] with poles configured in Figure 6.22.

(a) Using homomorphic processing on y[n] = h[n] * p[n], describe a method to create a signal

y′[n] = h′[n] * p[n]

where

Image

with Image being the minimum-phase counterpart of hmax[n] (i.e., poles outside the unit circle are flipped to their conjugate-reciprocal locations inside the unit circle).

Figure 6.22 Pole configuration for mixed-phase signal of Exercise 6.17.

Image

(b) Suppose that the period of p[n] is long relative to the length of the real cepstrum of h[n]. Suppose also that homomorphic prediction (homomorphic filtering followed by linear prediction analysis) with the real cepstrum is performed on the signal y[n] = h[n] * p[n], using a 4-pole model. Sketch (roughly) the pole locations of the (homomorphic prediction) estimate of the minimum-phase counterpart to h[n].

6.18 In Exercise 5.20 of Chapter 5, you considered the analysis of a speech-like sequence consisting of the convolution of a glottal flow (modeled by two poles), a minimum-phase vocal tract impulse response (modeled by p poles), and a source of two impulses spaced by P samples. You are asked here to extend this problem to deconvolving the glottal flow and vocal tract impulse response using the complex and real cepstra. In working on this problem, please refer back to Exercise 5.20.

(a) Determine and sketch the complex cepstrum of s[n] of Exercise 5.20 [part (c)]. Describe a method for estimating the glottal pulse Image and the vocal tract response v[n] from the complex cepstrum. Assume that the tails of the complex cepstra of Image and v[n] are negligibly small for |n| > P.

(b) Now we use the real cepstrum to estimate Image and v[n]. First determine and sketch the real cepstrum of s[n] of Equation (5.66). Next, suppose that you are given the spectral magnitude only of Image, i.e., you are given Image. Devise a procedure to recover both Image and v[n]. Again assume that the tails of the real cepstra of Image and v[n] are negligibly small for |n| > P. Assume you know a priori that Image is maximum-phase.

6.19 Consider a speech or audio signal x[n] = e[n]f [n] with “envelope” e[n] and “fine structure” f [n]. The envelope (assumed positive) represents a slowly time-varying volume fluctuation and the fine structure represents the underlying speech events (Figure 6.23).

(a) Design a homomorphic system for multiplicatively combined signals that maps the time-domain envelope and fine structure components of x[n] to additively combined signals. In your design, consider the presence of zero crossings in f [n], but assume that f [n] never equals zero exactly.Hint: Use the magnitude of x[n] and save the sign information.

Figure 6.23 Acoustic signal with time-varying envelope.

Image

(b) Suppose that e[n] has a wide dynamic range, i.e., the values of e[n] have wide fluctuations over time. Assume, however, that the fluctuations of e[n] occur slowly in time, representing slowly-varying volume changes. Design a homomorphic filtering scheme, based on part (a), that separates the envelope from x[n]. Assume that the spectral content of f [n] is high-frequency, while that of e[n] is low-frequency and the two fall in roughly disjoint frequency bands. In addition, assume that the logarithm of these sequences (with non-overlapping spectra) yields sequences that are still spectrally non-overlapping. Design an inverse homomorphic system that reduces dynamic range in x[n]. Sketch a flow diagram of your system.

6.20 It has been of widespread interest to restore old acoustic recordings, e.g., Enrico Caruso recordings, that were made up to the 1920’s with recording horns of the form shown in Figure 6.24. The recording horns distorted the source by introducing undesirable resonances. This problem walks you through the steps in using homomorphic filtering for restoration.

Engineers in the 1920’s were excellent craftsmen and managed to avoid nonlinearities in the system of Figure 6.24. The signal v(t), representing the grooves in the record, can therefore be approximately expressed as a convolution of the operatic singing voice s(t) with a (resonant) linear acoustic horn h(t):

Figure 6.24 Configuration of old acoustic recordings.

Image

(6.36)

Image.

Our goal is to recover s(t) without knowing h(t), a problem that is sometimes referred to as “blind deconvolution.” The approach in discrete time is to find an estimate of the horn Image, invert its frequency response to form the inverse filter Image, and then apply inverse filtering to obtain an estimate of s[n]. (The notation “hat” denotes an estimate, and not a cepstrum as in this chapter.) We will accomplish this through a homomorphic filtering scheme proposed by Stockholm [18] and shown in Figure 6.25.

(a) Our first step is to window v[n] with a sliding window w[n] so that each windowed segment is given by

vi[n] = v[n + iL]w[n]

where we slide v[n]L samples at a time under the window w[n]. If w[n] is “long and smooth relative to s[n],” argue for and against the following convolutional approximation:

(6.37)

Image

where si[n] = w[n]s[n + iL] and where we have ignored any shift in h[n].

(b) Determine the complex cepstrum of vi[n] (from the approximation in Equation (6.37)) and argue why we cannot separate out s[n] using the homomorphic deconvolution method that requires liftering the low-quefrency region of the complex cepstrum.

(c) Suppose that we average the complex logarithm (Fourier transform of complex cepstrum) over many segments, i.e.,

Image

Give an expression for Vavg in terms of Image and log[H(ω)]. Suppose that Savg = constant. Describe a procedure for extracting H(ω) to within a gain factor.

Figure 6.25 Restoration of Caruso based on homomorphic filtering.

Image

(d) In reality, the operatic singing voice does not average to a constant, but to a spectrum with some resonant structure. Suppose we had formed an average Image from a contemporary opera singer with “similar” voice characteristics as Enrico Caruso. Describe a method for extracting H(ω) to within a gain factor (assuming no horn distortion on the contemporary opera singer). Devise a homomorphic filtering scheme for recovering Enrico Caruso’s voice.

(e) Write out your expression Vavg in part (c) in terms of the spectral log-magnitude and phase. Explain why in practice it is very difficult to estimate the phase component of Vavg. Consider spectral harmonic nulls, as well as the possibility that the frequency response of the singer and the acoustic horn may have a large dynamic range with deep spectral nulls, and that the spectrum of the horn may have little high-frequency energy. Derive the zero-phase and minimum-phase counterparts to your solutions in parts (c) and (d). Both of these methods avoid the phase estimation problem. After inverse filtering (via homomorphic filtering), you will be left with an estimate of the singer’s frequency response in terms of the actual singer’s frequency response and the phase of the acoustic horn filter (that you did not estimate). If the horn’s frequency response is given by

H(ω) = |H(ω)|ejθH(ω),

and assuming you can estimate |H(ω)| exactly, then write the singer’s frequency response estimate Image in terms of the actual singer’s spectrum S(ω) and the phase of the horn θH(ω). Do this for both your zero-phase and minimum-phase solutions. Assume that the horn’s phase θH(ω) has both a minimum- and maximum-phase component:

Image

What, if any, may be the perceptual significance of any phase distortion in your solution?

(f) Denote an estimate of the horn’s inverse filter by g[n]. Using the result from Exercise 2.18, propose an iterative scheme for implementing g[n] and thus for restoring Enrico Caruso’s voice. Why might this be useful when the horn filter has little high-frequency energy?

6.21 (MATLAB) In this problem, you use the speech waveform speech1_10k (at 10000 samples/s) in the workspace ex6M1.mat located in the companion website directory Chap exercises/chapter6. The exercise works through a problem in homomorphic deconvolution, leading to the method of homomorphic prediction.

(a) Window speech1_10k with a 25-ms Hamming window. Using a 1024-point FFT, compute the real cepstrum of the windowed signal and plot. For a clear view of the real cepstrum, set the first cepstral value to zero (which is the DC component of the log-spectral magnitude) and plot only the first 256 cepstral values.

(b) Estimate the pitch period (in samples and in milliseconds) from the real cepstrum by locating a distinct peak in the quefrency region.

(c) Extract the first 50 low-quefrency real cepstral values using a lifter of the form

Image

Then Fourier transform (using a 1024-point FFT) and plot the first 512 samples of the resulting log-magnitude and phase.

(d) Compute and plot the minimum-phase impulse response using your result from part (c). Plot just the first 200 samples to obtain a clear view. Does the impulse response resemble one period of the original waveform? If not, then why not?

(e) Use your estimate of the pitch period (in samples) from part (b) to form a periodic unit sample train, thus simulating an ideal glottal pulse train. Make the length of the pulse train 4 pitch periods. Convolve this pulse train with your (200-sample) impulse response estimate from part (d) and plot. You have now synthesized a minimum-phase counterpart to the (possibly mixed-phase) vowel speech1 10k. What are the differences between your construction and the original waveform?

(f) Compute and plot the autocorrelation function of the impulse response estimate from part (d). Then, using this autocorrelation function, repeat parts (b) through (d) of Exercise 5.23 of Chapter 5 to obtain an all-pole representation of the impulse response. Does the log-magnitude of the all-pole frequency response differ much from the one obtained in Exercise 5.23? You have now performed “homomorphic prediction” (i.e., homomorphic deconvolution followed by linear prediction), which attempts to reduce the effect of waveform periodicity in linear prediction analysis.

6.22 (MATLAB) In Exercise 5.25 you considered separating glottal source and vocal tract components of a nasalized vowel where the nasal tract introduces zeros into the vocal tract system function. In this problem, you are asked to extend this problem to deconvolving the glottal flow and vocal tract impulse response using the complex cepstrum. In working this problem, please refer back to Exercise 5.25.

(a) We assume that the zeros of the glottal pulse are maximum phase and the zeros due to the vocal tract nasal branch are minimum phase. Using your result from parts (a) and (b) of Exercise 5.25, propose a method based on homomorphic deconvolution for separating the minimum-phase vocal tract zeros, the maximum-phase glottal pulse zeros, and the minimum-phase vocal tract poles.

(b) Implement part (a) in MATLAB using the synthetic speech waveform speech1 10k (at 10000 samples/s) in workspace ex6M2.mat located in the companion website directory Chap exercise/chapter6. Assume 4 vocal tract poles, 2 vocal tract zeros, a glottal pulse width of 20 samples, and a pitch period of 200 samples. You should compute the predictor coefficients associated with the vocal tract poles using results of part (d) of Exercise 5.25 and inverse filter the speech waveform. Then find the coefficients of the vocal tract numerator polynomial and the glottal pulse using homomorphic filtering. You will find in ex6M2.readme, located in Chap exercises/chapter6, a description of how the synthetic waveform was produced and some suggested MATLAB functions of possible use.

BIBLIOGRAPHY

[1] B.S. Atal, “Effectiveness of Linear Prediction Characteristics of the Speech Wave for Automatic Speaker Identification and Verification,” J. Acoustical Society of America, vol. 55, no. 6, pp. 1304–1312, June 1974.

[2] B. Bogert, M. Healy, and J. Tukey, “The Quefrency Analysis of Time Series for Echoes,” chapter in Proc. Symposium on Time Series Analysis, M. Rosenblatt, ed., pp. 209–243, John Wiley and Sons, New York, 1963.

[3] G. Kubin, B.S. Atal, and W.B. Kleijn, “Performance of Noise Excitation for Unvoiced Speech,” Proc. IEEE Workshop on Speech Coding for Telecommunications, Sainte-Adele, Quebec, pp. 35–36, 1993.

[4] G.E. Kopec, A.V. Oppenheim, and J.M. Tribolet, “Speech Analysis by Homomorphic Prediction,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–25, no. 1, pp. 40–49, Feb. 1977.

[5] J.S. Lim, “Spectral Root Homomorphic Deconvolution System,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–27, no. 3, pp. 223–232, June 1979.

[6] R. McGowan and R. Kuc, “A Direct Relation Between a Signal Time Series and its Unwrapped Phase,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–30, no. 5, pp. 719–724, Oct. 1982.

[7] P. Noll, “Cepstrum Pitch Determination,” J. Acoustical Society of America, vol. 41, pp. 293–309, Feb. 1967.

[8] C.L. Nikias and M.R. Raghuveer, “Bispectrum Estimation: A Digital Signal Processing Framework,” Proc. IEEE, vol. 75, no. 1, pp. 869–891, July 1987.

[9] A.V. Oppenheim, “Superposition in a Class of Nonlinear Systems,” Ph.D. Thesis, Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, Cambridge, MA, 1965.

[10] A.V. Oppenheim, “Generalized Superposition,” Information Control, vol. 11, nos. 5–6, pp. 528–536, Nov.–Dec. 1967.

[11] A.V. Oppenheim, “Speech Analysis/Synthesis Based on Homomorphic Filtering,” J. Acoustical Society of America, vol. 45, pp. 458–465, Feb. 1969.

[12] A.V. Oppenheim and R.W. Schafer, “Homomorphic Analysis of Speech,” IEEE Trans. Audio and Electroacoustics, vol. AU–16, pp. 221–228, June 1968.

[13] A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1989.

[14] T.F. Quatieri, “Minimum- and Mixed-Phase Speech Analysis/Synthesis by Adaptive Homomorphic Deconvolution,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–27, no. 4, pp. 328–335, Aug. 1979.

[15] T.F. Quatieri, “Phase Estimation with Application to Speech Analysis/Synthesis,” Sc.D. Thesis, Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, Cambridge, MA, Nov. 1979.

[16] R.W. Schafer, “Echo Removal by Discrete Generalized Linear Filtering,” Ph.D. Thesis, Massachusetts Institute of Technology, Dept. of Electrical Engineering, Cambridge, MA, Feb. 1968.

[17] K. Steiglitz and B. Dickinson, “Computation of the Complex Cepstrum by Factorization of the z-Transform,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp. 723–726, May 1977.

[18] T.G. Stockholm, T.M. Cannon, and R.B. Ingebretsen, “Blind Deconvolution through Digital Signal Processing,” Proc. IEEE, vol. 63, no. 4, pp. 678–692, April 1975.

[19] J.M. Tribolet, “A New Phase Unwrapping Algorithm,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–25, no. 2, pp. 170–177, April 1977.

[20] J.M. Tribolet, Seismic Applications of Homomorphic Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1979.

[21] W. Verhelst and O. Steenhaut, “A New Model for the Short-Time Complex Cepstrum of Voiced Speech,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–34, no. 1, pp. 43–51, Feb. 1986.

[22] K. Wang and S.A. Shamma, “Self-Normalization and Noise Robustness in Early Auditory Representations,” IEEE Trans. Speech and Audio Processing, vol. 2, no. 3, pp. 421–435, July 1995.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset