Chapter 8
Filter-Bank Analysis/Synthesis

8.1 Introduction

In the previous chapter, we introduced the filter bank summation (FBS) and overlap-add (OLA) methods of speech analysis and synthesis. In this chapter, we focus on extensions of the FBS method, in particular, beginning in Section 8.2 with its additional properties and practical design considerations. In Section 8.3 we interpret the filter-bank outputs in the FBS method for speech signals. Specifically, a sinewave model of the filter-bank outputs is developed for quasi-periodic speech signals, a perspective that leads to the phase vocoder for speech analysis and synthesis. Although we provide interpretation of the filter-bank outputs with respect to speech signals, the approach remains largely non-model-based, in contrast to the model-based approaches to speech analysis/synthesis studied earlier, such as those using linear prediction and homomorphic filtering. The phase vocoder is shown to be applicable in a number of areas, including speech coding and time-scale modification. We also describe limitations of this approach in the context of these applications, including the problem of achieving phase coherence, i.e., preserving the phase relation across sinewave outputs in synthesis. Loss of phase coherence is known to give a reverberant quality in the phase vocoder synthesis. Such limitations lead to a need for a more explicit formulation of sinewave components of speech; this sinewave representation will be described in Chapter 9.

In Section 8.4, we continue to address the problem of phase coherence and describe an approach to controlling individual sinewave phases in the phase vocoder so as to reduce loss of coherence in synthesis. As an example, we show how the shape of transient sounds can be approximately preserved in time-scale modification with appropriate phase control. In Section 8.5, we then take the FBS and phase vocoder methods to a generalization of filter-bank analysis/synthesis that involves constant-Q filters. We describe this generalization in the framework of the wavelet transform. The wavelet transform can be thought of as an extension of the STFT that provides good frequency resolution but poor time resolution for low frequency regions, and good time resolution but poor frequency resolution for high frequency regions. This leads us into the final Section 8.6 that gives a brief look at the relation of the wavelet transform to a front-end auditory filter-bank model. An AM-FM sinewave interpretation of the filter-bank outputs is used to speculate on why the human auditory system is phase-sensitive, especially for low-pitched speakers. We also describe how auditory processing elements may enhance joint time and frequency resolution, as well as sensitivity to temporal and spectral change in a signal. These, as well as other principles of auditory signal processing in this section, provide a basis for auditory-motivated speech processing techniques later in the text.

8.2 Revisiting the FBS Method

Recall that an analysis/synthesis system based on a filter-bank representation of a signal x[n] can be derived from the time-dependent short-time Fourier transform (STFT)

(8.1)

Image

Specifically, we saw in Chapter 7 that by replacing the expression nm by m, Equation (8.1) becomes

(8.2)

Image

where * denotes convolution. Equation (8.2) can be viewed as first a modulation of the window w[n] to frequency ω, thus producing a bandpass filter w[n]ejωn, followed by a filtering of x[n] through this bandpass filter. The output is then demodulated back down to baseband.

When the frequency is sampled uniformly to form a bank of filters, i.e., Image, we can express each bandpass filter as

Image

where the analysis window (filter) w[n] is zero outside the interval 0 ≤ n < Nw and Image is the frequency spacing between bandpass filters, N being the number of filters. The output of each filter hk[n] can be written as

(8.3)

Image

which is Equation (8.2) without the final demodulation and which was illustrated in Figure 7.5 of Chapter 7. The discrete frequency samples Image can be thought of as center frequencies for each of the N “channels” of the filter bank. As was shown in Figure 7.5, the FBS synthesis is then given by

(8.4)

Image

where we have written X (n, ωk) as X (n, k), the discrete STFT. It is desirable to have the term to the right of the convolution sign equal to the unit sample δ[n] so that y[n] = x[n]. From Chapter 7, the resulting FBS constraint requires in the frequency domain that the composite frequency response be flat [Equation (7.13)]; the corresponding constraint in the time domain is that the duration of w[n], Nw, be less than N or, less strictly, that w[rN] be zero at r = − 1, + 1, − 2, + 2 … [Equation (7.12)].

Consider now the design of such a filter bank where we specify a desired frequency response of the analysis window w[n], which is sometimes called the prototype filter because all other filters in the filter bank are derived from it by modulation. For example, we may want w[n], and thus each filter hk[n], to take on a certain bandwidth to achieve a certain frequency resolution. For voiced speech, in particular, it is often desired that Image pass only one harmonic. (We will see the importance of this constraint in our description of the phase vocoder.) Therefore, we now have two constraints on the window: (1) w[n] must satisfy the FBS constraint of being short in the time domain, e.g., NwN, and (2) w[n] must be narrow in the frequency domain. These are conflicting constraints and are difficult to meet, as seen in the uncertainty principle described in Chapter 2.

In order to help relieve these constraints on w[n], consider a slightly modified filter bank in which each channel output is multiplied by the complex constant Image, as shown in Figure 8.1 [42]. The modified composite output becomes (dropping the constant scale factor for simplicity)

(8.5)

Image

The factor pk provides a gain and phase adjustment for each channel in the bank. The effect of pk on the composite response can be seen by writing Image and expressing Equation (8.5) as

(8.6)

Image

where

Image

Figure 8.1 Phase adjustment factor of kth channel in FBS synthesis.

SOURCE: L.R.Rabiner and R.W.Schafer, Digital processing of Speech Signals [42]. ©1978, Pearson Education, Inc. Used by permission.

Image

with

Image

We have seen this result earlier in Chapter 7 (Section 7.3.2) in the context of multiplicative modification of the STFT, i.e., we can interpret the introduction of the adjustment factor pk as a multiplication of X (n, ω) by a function P(ω) at the uniformly spaced frequencies Image (The reader should carry through this interpretation.) The case pk = 1 takes us back to our FBS constraint that for Image, then w[rN] = 0 for r ±1, ±2, …, or the more useful constraint that the duration of w[n], Nw, is less that or equal to N (Exercise 8.1).

As we mentioned earlier, the Fourier transform of w[n], is often selected to be lowpass and narrow and so it is difficult to obtain a window w[n] that has short length Nw < N. When the FBS constraint is not satisfied, FBS synthesis results in multiple copies of the input (Chapter 7 and Exercise 8.1), i.e.,

Image

which results in a reverberant quality to the reconstruction. In order to remove the reverberation, we might shorten the analysis window w[n], but this would broaden its Fourier transform and sacrifice frequency resolution. On the other hand, we might increase the DFT length, i.e., the number of filters in our filter bank; this, however, increases computation significantly. An alternative is to select the factor pk to allow control of the time shift in the impulse-train sequence p[n], as illustrated in the following example adapted from Rabiner and Schafer [42]:

Example 8.1        Consider a discrete STFT generated with a rectangular window w[n] = 1 for 0 ≤ n < Nw, whose length Nw = 60, and computed at discrete uniform frequencies Image with the DFT length N = 50. The resulting composite impulse response associated with FBS synthesis is Image and so a second copy of the input signal x[n] is introduced in the output at a 50-sample delay. The resultant waveform is distinctly reverberant. On the other hand, by applying the adjustment factor with a linear phase factor Image with no = 20, for example, we remove the reverberation. The composite filter can be shown to be (Exercise 8.1)

(8.7)

Image

which simply delays the input.Image

We have introduced the notion of phase adjustment in FBS synthesis, first, because it helps illustrate the time-frequency resolution tradeoffs in FBS analysis and synthesis and, secondly, because this concept will be used more generally in a number of contexts in reducing forms of reverberation in other filter bank-based analysis/synthesis. One such system is the next focus of this chapter, the phase vocoder.

8.3 Phase Vocoder

A particular formulation of FBS analysis and synthesis capitalizes on the underlying harmonic speech spectrum during voicing. This filter-bank analysis approach, which exploits the harmonic nature of a signal, originated in the context of music sound processing as a Fourier series. The Fourier series was computed over a sliding window of a single pitch period duration and provided a measure of amplitude and frequency trajectories of the musical tones [26],[39]. This technique evolved into a filter-bank-based processor and, ultimately, to signal analysis/synthesis referred to as the phase vocoder for both speech and music processing [11],[26]. In this section, we describe the fundamentals of the phase vocoder analysis and synthesis and its use in a number of applications, including speech coding, to reduce information for transmission, and speech transformations such as time-scale modification.

8.3.1 Analysis/Synthesis of Quasi-Periodic Signals

We now show that, under certain conditions on the analysis window (filter) w[n], the output of each filter in the filter bank of the FBS method can be interpreted as discrete-time sinewaves that are both amplitude- and phase-modulated by the time-dependent Fourier transform [39],[42].

Consider a sequence x[n] passed through the discrete bank of filters hk[n] of the FBS method. We have seen that each filter is given by the modulated version of the baseband prototype filter w[n], i.e.,

(8.8)

Image

where w[n] is assumed to be zero outside the interval 0 ≤ n < Nw, and where the frequency sampling interval Image, i.e., the frequency spacing between bandpass filters, is determined by the number of filters N . The output of each filter can be written as in Equation (8.3), which is Equation (8.2) without the final demodulation, evaluated at discrete frequency samples Image that can be thought of as center frequencies for each of the N “channels.”

Because each filter impulse response hk[n] in Equation (8.3) is complex, each filter output yk[n] in Equation (8.3) is complex, so we can write the temporal envelope ak[n] and phase Image of the output of the kth channel as

(8.9)

Image

Thus, the output of each filter can be viewed as an amplitude- and phase-modulated complex sinewave (exponential)1

1 The amplitude and phase functions are not necessarily equal to those derived from an analytic signal formulation of the corresponding real filter output because the complex filter may contain negative frequencies.

(8.10)

Image

and reconstruction of the signal (via the FBS method without the modulation factors in Figure7.5) can be viewed as a sum of complex sinewaves

(8.11)

Image

with amplitude and phase components given by Equation (8.9). The resulting analysis/synthesis structure is referred to as the phase vocoder [11]. When the FBS constraint is satisfied by the analysis filter w[n], we have y[n] = x[n].

The amplitudes and phases in Equations (8.10) and (8.11) can correspond to physically meaningful parameters for quasi-periodic signals typical of voiced speech. In order to see this property, the STFT is first written as

(8.12)

Image

where Image is the center frequency of the k th channel. Then, from Equations (8.3) and (8.12), the output of the k th channel filter is expressed as

(8.13)

Image

and, therefore, from Equations (8.10) and (8.13), the temporal envelope ak[n] = |X(n, ωk)| and phase Image.

Consider now filters that are symmetric about π so that ωN−k = 2π − ωk, where as before Image, and assume for simplicity that N is even (Figure 8.2). Then it is straightforward to show that (Exercise 8.2)

(8.14)

Image

From Equations (8.13) and (8.14), the sum of two symmetric channels k and Nk can be written as (Exercise 8.2)

(8.15)

Image

which can be interpreted as a real sinewave that is amplitude- and phase-modulated by the STFT, the “carrier” of the latter being the k th filter’s center frequency. The changing, amplitude-modulated envelope (more strictly, amplitude modulation (AM) refers to a changing deviation about a steady amplitude component) is given by 2|X(n, ωk)| and the phase modulation (PM) about the carrier is given by θ (n, ωk).

Before investigating the response of the filter bank to a periodic input, we look at a useful interpretation of the output Image that comes from the concept of instantaneous frequency. To do so, we return for the moment to the analog realm where we write the STFT of a continuous time signal as

(8.16)

Image

Figure 8.2 Filters whose center frequencies are symmetric about π, i.e., ωN−k = 2π − ωk with Image. In this example N = 10.

SOURCE: L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals [42]. ©1978, Pearson Education, Inc. Used by permission.

Image

which can be expressed as

Image

where

Image

As in discrete time, we can derive an amplitude/phase modulated sinewave representation for each channel output (again using two symmetrically-placed filters):

Image

and we then define the phase function of each filter output as

Φ(t, Ωk) = Ωkt + θ (t, Ωk).

Consider next the phase derivative

Image

where

Image

We refer to Image as the instantaneous frequency at the output of each (kth) bandpass filter with center frequency Ωk. Observe that the phase can be recovered as

Image

where Φ(0, Ωk) is an initial condition. The signal 2|X(t, Ωk)| is likewise referred to as the instantaneous amplitude for each channel. The resulting filter-bank output is a sinewave with generally a time-varying amplitude and frequency modulation, as illustrated in Figure 8.3.

We can think of Image as the deviation of the instantaneous frequency from the center frequency Ωk of the kth filter, provided that Image is “slowly varying,” as under certain conditions that we will see shortly. This deviation is also called frequency-modulation (FM) and is illustrated graphically in Figure 8.3a. We will see in a moment why we might expect this condition to hold for voiced speech. An alternative expression for the instantaneous frequency deviation, not requiring the explicit calculation of the phase derivative, is given by (Exercise 8.3)

(8.17)

Image

which is the time-domain counterpart to the frequency-domain phase derivative that we have seen in the context of homomorphic processing [Equation (6.8)].

Figure 8.3 Interpretation of instantaneous frequency and amplitude in continuous time: (a) Image is the deviation of the instantaneous frequency from the center frequency Ωk of the kth filter (the frequency modulation) and Image is the instantaneous frequency; (b) the instantaneous amplitude 2|X(t, Ωk| and instantaneous frequency Image characterize each filter bank sinewave output.

Image

Based on this formulation, we seek the instantaneous frequency in discrete time. To obtain this function, suppose that X(t, Ωk) as a function of time is bandlimited. Then we can sample the continuous-time STFT, with sampling interval T, to obtain the discrete-time STFT, i.e., [42]

Image

where we have used the continuous-/discrete-time sampling relation reviewed in Chapter 2. Likewise, the phase derivative associated with X(n, ωk) can be defined as a sampled version of Image, i.e.,

Image

where we assume Image is also bandlimited. The continuous-time function Image, how-ever, is not available because the STFT is computed in discrete time. One possibility is to discretize the continuous-time phase derivative expression given in Equation (8.17), i.e.,

Image

where the corresponding a(t, Ωk), b(t, Ωk), and their derivatives are assumed to be bandlimited. Although a(n, ωk) and b(n, ωk) are available through the discrete-time STFT, this is not the case for their derivatives. Nevertheless, estimates of Image and Image can be obtained by discrete-time filtering of a(n, ωk) and b(n, ωk), such as by first forward or backward differ-encing. (Note that we cannot obtain the phase derivative by explicit phase differencing methods without first performing the difficult task of unwrapping the phase in time from principle phase measurements.)

We now return to the goal of gaining insight into the nature of the bandpass filter outputs for a speech input. In particular, we look at the response of our filter bank to exactly periodic and quasi-periodic inputs. We consider first a perfectly periodic input with stationary vocal tract and then a quasi-periodic input generated with a glottal excitation of slowly varying pitch and with a slowly changing vocal tract. The prototype filter of the filter bank is assumed to be narrowband and flat in the region in which each input sinewave component lies, as illustrated in Figure 8.4. Our goal is to compute the instantaneous amplitude and frequency of each filter output. We begin the analysis in continuous time because of the need to differentiate or integrate functions of phase.

Periodic Case — Consider the pth sinewave component of frequency Ωp, denoted by Image, that passes through the kth channel filter without distortion, as illustrated in Figure 8.4. For this periodic case, Ωp = pΩo where Ωo is the fundamental frequency. The pth sinewave component can be written as

Image

The modulated output of the kth filter, i.e., the STFT, can be expressed as

Image

Figure 8.4 Filter-bank response to perfectly periodic sequence. One harmonic component is passed through Hk(Ω).

Image

Then it can be seen that for the kth channel

(8.18)

Image

and thus the instantaneous frequency deviation is constant and given by

Image

which is the deviation of the harmonic component from the center frequency. Observe that by integrating Image with the appropriate initial condition, we can recover θ(t, Ωk), i.e.,

Image

with Image. Thus, we can Image recover X(t, Ωk) Finally, the output of the kth real channel (the sum of two symmetric channels), after demodulation by ejΩkt, is given by

(8.19)

Image

i.e., each signal component passes intact.

A similar analysis can be made for quasi-periodic signals which consist of a sum of sinewaves with slowly varying instantaneous amplitude and frequency, each of which is assumed to pass through a single filter.

Quasi-Periodic Case — We now look at the bandpass filter response to a sinewave with varying instantaneous frequency Ωp(t) and amplitude Ap(t), i.e,

with

Image

Image

which corresponds to a glottal excitation function with slowly varying pitch and amplitude and to a slowly varying vocal tract.2 Suppose xp(t) is the input to a bandpass filter hk(t) = w(t)ejΩkt where w(t) represents the analysis window. We impose the following constraints on the amplitude and frequency functions, illustrated in Figure 8.5:

2 We show in Chapter 9 how a changing source and vocal tract contribute to these time-varying amplitude and frequency functions.

1. Ωp(t) ≈ Ωp(t′) over the duration of w(t), i.e., Ωp(t) remains at nearly its initial value over the interval [t′, t″].

2. Ap(t) ≈ Ap(t′) over the duration of w(t), i.e., Ap(t) remains at nearly its initial value over the interval [t′, t″].

Figure 8.5 A sinewave component corresponding to slowly varying pitch and vocal tract. The input to a bandpass filter is a single sinewave xp(t) with a slowly varying amplitude (envelope) and instantaneous frequency. The instantaneous amplitude Ap(t) and instantaneous frequency Ωp(t) are assumed to appear as constant under the analysis (filter) window w(t).

Image

Then xp(t) appears to the filter over the time interval [t′, t′′] as a steady sinewave, i.e.,

Image

and thus the input appears as an eigenfunction of the linear time-invariant (LTI) filter hk(t) (Chapter 2). It follows that the magnitude and phase derivative of the STFT, X (t, Ω), are given by

(8.20)

Image

Therefore, we can show that the output of the bandpass filter is of the form

(8.21)

Image

The reader is stepped through an informal proof of this result in Exercise 8.4.

Our general solution in the quasi-periodic case can be written in discrete time for a pth-harmonic input Image. With time sampling the continuous-time expression for the STFT in Equation (8.20), we have (Exercise 8.4)

(8.22)

Image

Figure 8.6 Analysis/synthesis structure in the phase vocoder.

Image

We can then synthesize x[n] given | X (n, ωk)|, Image and θ(0,ωk) with appropriate numerical integration of the phase derivative, e.g., through a cumulative sum (Exercise 8.18). The complete analysis/synthesis scheme is shown in Figure 8.6. Observe that error incurred in this cumulative sum will cause phase drift across the sinewave outputs of adjacent channels and thus a loss in the original channel phase relations. We refer to the preservation of this phase relation as phase coherence. Loss of phase coherence results in a change in the shape of the original signal. (We alluded to a different, but related phase distortion in our discussion of speech synthesis from the STFTM in Chapter 7.) The importance of phase drift warrants a more detailed discussion; in Section 8.4 we elaborate on the problem and present a scheme for maintaining phase coherence.

As an aside, it is interesting to note that the approximations used in developing the phase vocoder analysis and synthesis for a sinewave with time-varying amplitude and frequency, as an input to a filter with frequency response flat over the instantaneous frequency, are a special case of a more general solution output for an arbitrary filter. For an input of the form Image, the output of an LTI filter H(ω) can be approximated as [2],[40]

(8.23)

Image

where it is assumed that H(ω) = 0 for ω < 0, making the resulting signal analytic. If x[n] has neither AM nor FM, then the approximation Equation (8.23) is exact, x[n] being an eigenfunction of H(ω). In the time-domain, error bounds for the approximation have been derived by Bovik, Havlicek, and Desai [2]. Let z[n] be the exact output of an LTI filter with frequency response H(ω) and impulse response h[n] for input x[n], i.e., z[n] = x[n] * h[n], and let y[n] be given by the approximation Equation (8.23). Then the error, defined as ε[n] = |z[n] − y[n]| is bounded by

(8.24)

Image

where Amax is the maximum value of A[n] and A(υ) and Image(υ) are the continuous time signals corresponding to A[n] and Image[n]. From the above upper bound, we want the energy in the impulse response to be concentrated around n = 0. In addition, observe that AM, as well as FM, causes error in the approximation, and that this error increases with increasing modulation. An alternative frequency-domain approach to determining error bounds in the approximation has also been developed and shows different considerations [50]. These error bounds can give us a quantitative way to determine the accuracy of our approximations for specific filters hk[n] and amplitude and frequency modulating functions. Even when the approximation is accurate, however, Equation (8.23) reveals that the filter H(ω) can change the channel amplitude and phase when it deviates from a flat, zero-phase response in the region of the input frequency (Exercise 8.6).

Before proceeding to our discussion of phase coherence, we digress from theory for a moment to describe a number of fascinating applications of the phase vocoder.

8.3.2 Applications

In applying the phase vocoder, it is advantageous to express each analysis output in terms of the channel phase derivative Image and initial phase offset θ(0, ωk). For a single sinewave xp[n] = Ap[n] cos(Image p[n]) of a harmonic set (assumed to enter one filter of the filter bank), we have seen that these two quantities are given approximately by Image = ωp[n] −ωk and Image, respectively. A phase function can then be obtained by integration of the phase derivative which is added to the carrier phase ωkn. This approach makes the filter outputs amenable to speech coding , i.e., to bit-rate reduction through quantization,3 and also amenable to speech modification, e.g., time-scale modification.

3 Formal definitions of bit rate and quantization are given in Chapter 12. For the moment, we can think of a decrease in bit rate as allowing a decrease in channel transmission bandwidth, while quantization is the dividing of a signal value into “quanta.” Bit rate is reduced when decreasing the number of quanta to represent signal values (one bit corresponding to two quanta) and when increasing time decimation.

Speech Coding — An overview of a speech coder based on a filter bank approach is given in Figure 8.7. In this scheme, the demodulated output of each filter is time-decimated at the transmitter and quantized in the encoder module. The quantized values are encoded into a bit stream and transmitted over a channel. At the receiver, the bitstream is decoded and the quantized values are interpolated. Finally, the filter-bank outputs are modulated and summed to form the synthesized received signal. To understand the limitations of this approach, and thus to motivate the use of the phase vocoder in speech coding, we revisit our discussion in Chapter 7 of time-frequency sampling requirements on the STFT that follow from its Fourier transform view, corresponding to OLA synthesis, and from its filtering view, corresponding to FBS synthesis. Consider the Fourier transform viewpoint. If the window is of duration Nw then we require a frequency sampling interval no greater than Image. From the filtering point of view, we require a sampling interval L that meets the Nyquist criterion based on the bandwidth of each bandpass filter. This implies that we sample at an interval Image, where ωc is the filter bandwidth, to avoid frequency-domain aliasing of the time sequence X(n, ωk). Thus, there results a sampling rate larger than the original sampling rate, e.g., four times the original sampling for a Hamming window [42]. These window length and bandwidth constraints can, however, be relaxed according to the FBS and OLA constraints of Chapter 7 by allowing zeros in the window or its transform. Although this relaxation of the constraints on window duration and bandwidth may conceivably avoid this increase in the sampling requirement, it is not an effective way to reduce bits in speech coding systems. Consequently, the filter-bank scheme in Figure 8.7 is not an effective coder. Alternatively, when the signal of interest is speech, or any signal with harmonic structure such as certain music and biological signals, we can exploit the phase vocoder filter-bank output structure of the previous section.

Figure 8.7 Filter-bank-based speech coder overview. The time decimation is limited by the bandwidth of each analysis filter, according to the Nyquist criterion, to avoid aliasing in frequency. In addition, the frequency decimation is limited by the duration of the analysis filter, i.e., the number of filters must be large enough to avoid aliasing in time. The number of samples/s over all filter bank channels (in time and frequency) is consequently larger than the input waveform sampling rate.

SOURCE L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals [42]. ©1978, Pearson Education, Inc. Used by permission.

Image

The idea is based on the observation that magnitude and phase derivative functions at the filter outputs can vary more slowly and therefore are characterized by a smaller bandwidth than the filter waveform output, i.e., |X(n, ωk)| and Image vary more slowly than X(n, ωk) for each channel k. We assume that we can design the analysis filter w[n] to pass a single (pth) harmonic of voiced speech within the passband of the kth channel.4 Then, as we have seen, the filter output magnitude and phase derivative are given approximately as

4 This implies that for low pitch, the filter bandwidth is very small and therefore the window duration is very large, thus perhaps violating the FBS constraint. Nevertheless, if indeed one sinewave (harmonic) is passed by each filter, then signal synthesis is achieved. This apparent paradox is left for the reader to ponder.

(8.25)

Image

both of which vary slowly if the pitch and vocal tract of the speaker vary slowly in time. Consequently, we can significantly time-decimate |X(n, ωk)| and Image. Observe that we could have also sampled the corresponding phase obtained by integrating the phase derivative. This unwrapped phase, however, can grow without bound, and its principal phase value contains sharp discontinuities at 2π wrapping points, and thus both functions are difficult to efficiently quantize. At the extreme, when the pitch and vocal tract are time-invariant, so that the harmonic amplitudes and frequencies are fixed, then only one sample of the amplitude and phase derivative is required to represent the two functions. Therefore, along with the phase offset, these parameters only are required in the representation of each filter output. This is not surprising because we are assuming that each filter output is a single sinewave.

To obtain a flavor for bit-rate reduction by quantizing the slowly varying magnitude and phase derivative functions, consider an early 28-channel phase vocoder with a channel spacing of 100 Hz [4],[42]. The log-magnitude and phase derivative signals were quantized non-uniformly with fewer bits being allocated to the higher channels because, as we will discuss in later chapters, the human auditory system is less sensitive to noise due to quantization in the high-frequency region. In addition, more bits were allocated to the channel phase derivative than to the magnitude. Specifically, 60 samples/s were used to represent the channel magnitude and phase derivative signals. For magnitude, two bits were used for the lower channels and one bit for the higher channels. For the phase derivative, three bits were used for the lower channels and two bits for the higher channels, resulting in speech coded at 7200 bits/s judged to be “good” quality, but not quality transparent from the original. On the other hand, if the output waveform from each filter channel is quantized directly as in Figure 8.7, rather than through the magnitude and phase derivative, then about 16000 bits/s are required for “good” quality. These rates are in contrast to about 64000 bits/s required to represent the speech waveform itself with quality transparent from the original. In spite of the reduction in bit rate, these rates are relatively high when compared to the lower rates that can be achieved with approaches that use the speech production model more explicitly.

In this section, we have hopefully stirred the reader’s curiosity for ways in which dramatic reductions in bit rate can be achieved. Chapter 12 will provide a much more thorough look into this fascinating application area.

Speech Modification — The phase vocoder has been widely used in modification of speech and speech-like signals. In time-scale modification, for example, as we have seen in Chapter 7, the goal is to maintain the perceptual characteristics of the original signal and speaker (e.g., pitch and vocal tract spectrum) while changing the articulation rate of the speaker. Two approaches to performing time-scale modification with the phase vocoder are described. The original approach [11],[42] combines with time scaling a method of compressing (or expanding) the speech spectrum along the frequency axis. With the filter-bank output magnitude and phase derivative functions as given in Equation (8.25), the modification steps, illustrated in Figure 8.8, are as follows:

S1: Frequency compress (or expand) each channel by dividing the phase derivative Image of each channel by the rate change factor ρ, i.e.,

Image

Figure 8.8 Time-scale modification with the phase vocoder using frequency compression/expansion and fast/slow playback.

Image

This operation has the effect of compressing the spectrum for ρ < 1 and expanding the spectrum for ρ > 1 along the frequency axis because when one harmonic (the pth harmonic) enters each filter, we have

Image

where the modified channel center frequency is at ρωk.

S2: Integrate (using a running sum) and exponentiate Image.

S3: Apply amplitude modulation to form the complex sequence for each channel given approximately by

Image

where we represent the running sum of the changing frequency over time by the integration operation (implemented with numerical integration).

S4: Modulate the channel signal by the new channel center frequency to form

Image

S5: Sum all N channels to form

Image

where we assume one harmonic per channel. We then play back the signal Image at ρ times the original rate. This final operation has the effect of restoring the correct frequency composition and modifying the time scale and can be implemented by either changing the sampling rate of the output D/A converter or by playing back the analog rendition of Image at a speed different from the original recording.

Examples of time-scale expansion and compression using the above phase vocoder-based technique are given in [11],[42], illustrating an accordion-like modification of the original speech spectrograms. Observe in Step 5 that if we do not modify the time scale, then the synthesized waveform is characterized by frequency compression or expansion which can be useful in its own right.

In a second approach to time-scale modification with the phase vocoder, the phase and amplitude of each channel are interpolated or decimated directly to a new time scale, in contrast to relying on a change in time scale during play back. A rate change of an arbitrary rational value can be performed by combined interpolation and decimation. In one form of this technique, illustrated in Figure 8.9, demodulation by ekn does not occur and the phase of each filter output in Equations (8.10) and (8.11) is obtained by integrating the phase derivative, i.e., the instantaneous frequency of each channel. The channel amplitude and phase functions are then time-scaled by time-decimation and/or interpolation. With time-scale modification by a factor ρ, the modified filter output for each channel is given by

(8.26)

Image

where Image and Image are the decimated/interpolated amplitude and phase functions, respectively. The modified phase is scaled by ρ to maintain the original instantaneous frequency of each filter output. We can see the need for this scaling by writing

(8.27)

Image

which is the time-scaled instantaneous frequency.

Figure 8.9 A direct approach to time-scale modification with the phase vocoder. The instantaneous frequency is integrated to form a phase function, decimated and/or interpolated to the new time scale, and finally scaled by the rate change factor ρ to restore the correct frequency.

Image

In a variation of the technique, we unwrap the principle phase value along the time axis without computation of the phase derivative. We can do the unwrapping, for example, by detecting 2π jumps in the principle phase values over time, as in the algorithm described in Chapter 6 for phase unwrapping over frequency.

8.3.3 Motivation for a Sinewave Analysis/Synthesis

In spite of the many successes of the phase vocoder, it suffers from a number of limitations [31],[32],[39],[45]. Consider first the analyzer. In the applications of speech coding and time-scale modification, for example, it is assumed that only one sinewave enters each bandpass filter within the filter bank. When more than one sinewave enters a bandpass filter, our interpretation of a filter output as a sinewave with slowly varying amplitude- and frequency-modulation breaks down. In addition, a particular sinewave may not be adequately estimated when the filter response shape is non-flat, so that Image in Equation (8.23) contributes additional AM, or when the filter bandwidth and the sinewave modulation result in a large upper bound in Equation (8.24). Likewise, the estimation may fail when a sinewave frequency falls between two adjacent filters of the filter bank. In addition, sinewaves with rapidly varying frequency due to large vibrato or fast pitch change are difficult to track as their frequencies move across multiple filters. Although these measurement problems may be resolved by an appropriate combining of adjacent filter-bank outputs, such solutions for a speech input are likely to be cumbersome [45]. An example of a filter-bank structure with a non-flat prototype filter for a single FM-sinewave input is given in Exercise 8.17.

Consider now the synthesizer stage of the phase vocoder. In speech coding, for example, if we are to quantize samples of the phase derivative, then we need to recover the phase from these samples. In doing so, we encounter the problem of obtaining the phase from the phase derivative samples. As we saw, we can do this by numerical integration, but the resulting estimate deviates from the absolute phase because the initial phase offset may not be available and because numerical integration introduces error, thus causing a drift from the original phase even if the initial phase offset was known. This phase drift results in an incorrect phase relation across bandpass filters, i.e., loss of phase coherence, and thus a change in the waveform shape, sometimes referred to as “dispersion.” Furthermore, the phase relation across channels changes with time because the phase error is continuously being introduced. A consequence of this dispersion is an annoying reverberant or what is also referred to as a “choral” effect. A similar waveform dispersion problem occurs in time-scale modification when using samples of the phase derivative. Integration of the phase derivative, as well as scaling of the resulting phase function, results in a loss of the original phase relation among sinewaves, thus giving an objectionable similar reverberant characteristic to the synthesis. This problem also occurs with the alternative direct implementation of time scaling where the unwrapped phase is obtained from the principal phase value (Exercise 8.5). We will describe in the following section a method for reducing dispersion in the phase vocoder. In Chapter 9, a method in a similar spirit is described in the different context of sinewave analysis/synthesis. Finally, we note that the phase vocoder was formulated for discrete sinewaves and hence was not designed for the representation of noise components of a sound. Under this input class, the filter output approximation of Equation (8.25) is not meaningful.

A number of refinements of the phase vocoder have addressed these problems [10],[20],[21],[31],[39]. For example, the assumption that only one sinewave passes through each filter motivates a filter bank with filter spacing equal to the fundamental frequency, thus allowing one harmonic to pass through each filter [20]. An alternative is to oversample in frequency, i.e., increase the number of filters, with filters of very narrow bandwidth with the hope that only one harmonic passes through each filter. One approach to prevent waveform dispersion is to use an overlap-add rendition of the synthesis with windows of length such that the overlap is always in phase [21]. Other ways to prevent dispersion are given in the following section. Another refinement of the phase vocoder was developed by Portnoff, who represented each sinewave component by a source and vocal tract filter contribution, thus introducing some control on the phase in synthesis [31], although the phase functions are still computed via a cumulative sum on phase derivatives and more than one sinewave can enter a filter. The sinewave frequencies in this model are constrained to be harmonically related. Portnoff also provided a rigorous analysis of the stochastic properties of the phase vocoder to a noise-like input.

The analysis stage of the original phase vocoder and its refinements views sinewave components as outputs of a bank of uniformly-spaced bandpass filters. Rather than relying on a filter bank to extract the underlying sinewave parameters, an alternate approach is to explicitly model and estimate time-varying parameters of sinewave components by way of spectral peaks in the short-time Fourier transform [25],[39]. It will be shown in Chapter 9 that this approach lends itself to sinewave tracking through frequency matching, phase coherence through a source and vocal tract filter phase model, and estimation of a stochastic component by use of an additive model of deterministic and stochastic signal components. As a consequence, the resulting sinewave analysis/synthesis scheme resolves many of the problems encountered by the phase vocoder, and provides a useful framework for a large range of speech and audio signal processing applications.

8.4 Phase Coherence in the Phase Vocoder

In this section, we describe approaches to achieve phase coherence in the phase vocoder. We previously defined phase coherence as the preservation of the original sinewave phase relations in the synthesized speech and saw that loss of phase coherence resulted in a reverberant or “choral” effect. One approach to achieving phase coherence is to apply a phase offset correction to each sinewave phase that attempts to make phase relations in the modified signal at t correspond to those in the original signal at that time or, with time-scale modification, at a time t' that maps back to the original time scale. Phase coherence can be achieved at specific signal event times or at regular intervals over time. This concept was first introduced in the context of sinewave analysis/synthesis [38] and a specific method of achieving phase coherence at uniformly spaced frame boundaries will be described in this context in Chapter 9. For the phase vocoder, on the other hand, we first illustrate the principle of achieving phase coherence, by example, with time-scale modification of signals consisting of successive short-duration decaying sinewaves. By preserving phase coherence at specific event times, we can approximately maintain in the time-scaled signal the shape of the original temporal envelope [33],[34],[37] which may play an important role in auditory discrimination of such sounds. We then briefly describe a related approach by Puckette [32] and Laroche and Dolson [18] for reducing the reverberant quality of quasi-periodic waveforms in the phase vocoder.

8.4.1 Preservation of Temporal Envelope

We saw in Chapter 2 that the temporal envelope of a signal is sometimes defined, typically in the context of bandpass signals, as the magnitude of the analytic signal representation [29]. We also saw in the previous section that the temporal envelope was defined as the magnitude of the complex filter bank output. Other definitions of temporal envelope have been proposed based on estimates of attack and release dynamics [1]. The quality of a sound is sometimes associated with its temporal envelope. We saw, for example, in previous chapters that the conversion of a mixed-phase vocal cord/vocal tract impulse response to its minimum-phase counterpart can decrease its attack time and increase its peakiness, thus significantly altering the signal’s temporal envelope and giving the synthesized speech a “buzzy” quality during voicing. More generally, assigning different Fourier transform phase functions to a given Fourier transform magnitude results in a large variety of temporal envelopes.

To further illustrate the relation between the frequency-domain phase and temporal structure of a signal, consider the following thought experiment. Suppose we are given the temporal envelope and the Fourier transform magnitude of a signal. We want to generate a time-scaled signal that has the given spectral magnitude and, with an appropriate selection of the Fourier transform phase, has a time-scaled version of the original temporal envelope. Although iterative methods can be applied to attempt to meet these time-frequency constraints [34],[53] a close match to both the spectral magnitude and modified temporal envelope, however, may not be consistent with the relationship between a sequence and its Fourier transform (Exercise 8.7). Nevertheless, we have proposed this thought experiment because it exemplifies the general approach of altering phase to preserve the temporal structure of a signal.

Consider now an analogous strategy in the context of the phase vocoder and, specifically, consider time-scale modification as given by Equation (8.26). Here we are given the desired time-scaled sinewave instantaneous amplitudes Image. We also have the desired time-scaled sinewave instantaneous frequencies by way of the phase derivatives Image. The time-scaled phase functions Image, however, typically result in a modified speech waveform whose temporal envelope is very different from that of the original waveform. Although Equation (8.26) does maintain amplitude and phase derivative (frequency) relations, it does not maintain absolute phase relations, i.e., it loses phase coherence. Our approach is to attempt to preserve the envelope by applying a phase offset correction to each channel. Clearly, however, a single phase correction to each function Image cannot preserve the phase relations over all time (Exercise 8.5). Within the framework of the phase vocoder rather than attempting to maintain the temporal envelope over all time, therefore, a different approach is to maintain the channel phase relations at time instants that are associated with distinctive features of the envelope [33],[34],[37],[39]. As a stepping stone to the approach, the notion of instantaneous invariance is introduced.

Instantaneous Invariance — It is assumed that the temporal envelope of a waveform near a particular time instant n = no is determined by the amplitude and phase of its channel components at that time (i.e., ak[no] and Image, and by the time rate of change of these amplitude and phase functions. Suppose that we want to time-scale modify the sequence by rate-change factor ρ. To preserve the temporal envelope in the new time scale near n = ρno, these amplitude and phase relations are maintained at that time. The phase relations can be maintained by adding an offset to each channel’s phase, guaranteeing that the resulting phase trajectory takes on the desired phase at the specified time n = ρno. In other words, a phase correction is introduced in each channel that sets the phase of the modified filter output Image at n = ρno to the phase at n = no in the original time scale. Denoting the phase correction by Image, the modified channel signal becomes

(8.28)

Image

where Image and where Image and Image are the interpolated versions of the original amplitude and phase functions. An inconsistency arises, however, when preservation of the temporal envelope is desired at more than one time instant. One approach to resolving this inconsistency is to allow specific groups of channel components to contribute to different instants of time at which invariance is desired [33],[34],[37],[39].

The approach to invariance can be described by time-expanding the signal in Figure 8.10a that has a high- and low-frequency component, each with a different starting time. If all channels are “phase-synchronized” (also referred to as “phase-locked”), as above, near the low-frequency event, the phase relations at the high-frequency event are changed and vice versa. For this signal, with two events of different frequency content, it is preferable to distribute the phase synchronization over the two events; the high-frequency channels being phase-synchronized at the first event and the low-frequency channels being phase-synchronized at the second event. Equation (8.28) can then be applied to each channel group using the time instant for the respective event, thus phase-locking phases of channels that most contribute to each event.

One approach to assign channels to time instants uses the individual envelopes of the filter bank outputs [33]. Accordingly, the filter bank is designed with a short prototype filter such that each filter output reflects distinctive events that characterize the temporal envelope of the input signal. Channels are then clustered according to their similarity in envelope across frequency. The onset time of an event is defined within each channel as the location of the maximum of the channel envelope ak[n] and is denoted by no(k). It is assumed that the signal is of short duration with no more than two events and that only one onset time is assigned to each channel; more generally, multiple onset times would be required. A histogram of onset times is formed, and the average values within each of the two highest bins are selected as the event times. These times are denoted by Image and Image, and each of the k channels is assigned to Image or Image based on the minimum distance between no(k) and the two possible event times. The distance is given by Image where p = 1, 2. The resulting two clusters of channels are denoted by Image, 2 and where for each p, kp runs over a subset of the total number of bands. (For simplicity, the subscript p on k will henceforth be dropped.)

Finally, based on the channel assignment, a phase correction is introduced in each channel, making the phase of the modified filter output Image at time Image equal to the phase at the event time Image in the original time scale. Denoting the phase correction for each cluster by Image, the modified channel signal becomes

Figure 8.10 Time-scale expansion (×2) using channel phase correction: (a) original; (b) expansion with phase correction at 5 ms; (c) expansion with phase correction in clustered channels; (d) expansion without phase correction.

SOURCE: T.F. Quatieri, R.B. Dunn, and T.E. Hanna, “A Subband Approach to Time-Scale Modification of Complex Acoustic Signals” [33]. ©1995, IEEE. Used by permission.

Image

(8.29)

Image

where Image and where p refers to the first or second cluster.

Short-Time Processing — To process a waveform over successive frames, we extract a signal segment every L samples, perform time-scale modification and phase correction on each segment using the phase vocoder, and then overlap and add the modified segments to give the final synthesis [33],[34],[37],[39]. We can thus think of this sequence of operations as combining the OLA and FBS synthesis on the modified STFT. Specifically, the filter-bank (satisfying the FBS constraint) modification is first applied to each windowed segment ƒmL[n] = w[mLn]x[n]. The frame length L is set to half the window length, i.e., Nw = 2L, and the window w[n] is chosen such that Image, i.e., the overlapping windows form an identity (thus satisfying the OLA constraint). The two event times for the mth frame are selected as above, and saved. The procedure is repeated for frame m + 1. However, if the most recent event from frame m falls at least L/4 samples inside the current frame m + 1, then this event is designated the first event of frame5 m + 1. With this condition, the second event time is found via the maximum of the histogram of the channel event onset times on frame m +1 (excluding the previously chosen event time). Each channel is then assigned to a time instant based on the minimum distance between the two event times and the measured onset time no(k). In addition, a frame is allowed to have no events by setting a histogram bin threshold below which a no-event condition is declared. In this case, channel phase offsets are selected to make the channel phases continuous across frame boundaries, i.e., the phase is allowed to “coast” from the previous frame.

5 This approach is similar in style to the pitch-synchronous overlap-add method of time-scale modification introduced in, Chapter 7. However, as described in Chapter 7, methods to achieve this synchrony rely on cross-correlating adjacent frames or determining consistent time instants within a glottal cycle, and not on synchronizing the phases of a filter-bank decomposition.

Example 8.2        Time-scale expansion can result in improved audibility of closely-spaced components for a variety of complex acoustic signals consisting of sums of rapidly damped sinewaves such as the sounds from mechanical impacts. An example of time-scale expansion of a sequence of transients from a closing stapler is shown in Figure 8.11, demonstrating the temporal and spectral fidelity in the time-scaled reconstruction [33],[39]. The above short-time processing approach was applied with a triangular window w[n] of duration 10 ms and a 5-ms frame update, satisfying the OLA constraint. Each short-time segment was passed through a filter bank with 21 uniformly spaced filters hk[n], designed using a 2-ms prototype filter with Gaussian shape, satisfying the FBS constraint. Two event times were estimated on each frame using the histogram analysis described above. A goal in this example is to preserve the spectral envelope while time-expanding the temporal envelope of the signal. Although for the signal illustrated, the original spectrum was approximately preserved in the time-scaled signal, an observed difference is the narrowing of resonant bandwidth, a change which is consistent with stretching the temporal envelope. Image

Figure 8.11 Time-scale expansion (×2) of a closing stapler using filter-bank/overlap-add modification: (a) original and time-expanded waveform; (b) spectrograms of part (a).

SOURCE: T.F. Quatieri, R.B. Dunn, and T.E. Hanna, “A Subband Approach to Time-Scale Modification of Complex Acoustic Signals” [33]. ©1995, IEEE. Used by permission.

Image

8.4.2 Phase Coherence of Quasi-Periodic Signals

We saw in Example 8.2 that our time-scale modification technique increases the distance between short-duration events in a signal, event time instants being estimated from a very short (2 ms) duration analysis filter in the filter bank. Consequently, applying this technique to speech will modify the pitch of the speaker as well as the articulation rate during quasi-periodic voiced regions (Exercise 8.8). Nevertheless, the technique can be extended to time-scale quasi-periodic signals by lengthening the analysis filter and by phase locking within every frame.

Puckette [32] and Laroche and Dolson [18] observed that for quasi-periodic signals, reduced reverberance in synthesis can be obtained by achieving phase coherence across filter bank channels that correspond to channel clusters. Each cluster is defined by dominant spectral regions for successive short-time segments. Specifically, for each short-time segment, a dominant spectral region is given by a “peak channel,” which is a channel whose amplitude, |X(nL, k)|, is larger than its four nearest neighbors with respect to the frequency variable k. Channel clusters are then formed around each peak channel according to the (smallest) distance of a channel center frequency from each peak. For each cluster, the channel phases are then locked to the phase of the peak. Phase locking in this case means that the original phase relations, i.e., the difference between the phase of the peak channel and channels within its cluster in the original time scale, are preserved in the new time scale. For an analysis filter length of about a few pitch periods, e.g., about 20 ms, a channel peak often occurs near high-amplitude harmonics and channel clusters are formed around these dominant spectral values. In the context of the above short-time processing approach, consider the short-time segment ƒmL[n] = w[mLn]x[n]. Let the phase of the kth filter-bank output of this segment be denoted by Image and the corresponding phases of the pth cluster be denoted by Image. The phases of the pth cluster of the time-scaled version of this segment are then given by Image. Denote the peak channel of the pth cluster by the index Image. Then phase coherence of the modified signal can be achieved in the pth cluster by applying a phase offset correction Image such that for the kth channel within the pth cluster

Image

where n is taken at the center of the analysis and synthesis frames. The phase differences are thus preserved across channels within the pth cluster (but only at the center of the short-time segment). Because the phase of the dominant channel Image of each cluster is preserved, time-scaled short-time segments add essentially coherently across consecutive frames.

Another approach to accomplish phase coherence in synthesis of quasi-periodic sequences by the phase vocoder was proposed by Sylvestre and Kabal [17],[49]. In time-scale modification, with short-time processing, the phase of a channel is reset at the onset of each frame to its value in the original time scale; this results, however, in a discontinuity in the unwrapped phase at a frame boundary, not necessarily a multiple of 2π. To remove the discontinuity, a fixed phase offset is first added to the resulting channel phase followed by a perturbation to the instantaneous frequency to enforce phase continuity (Exercise 8.5). The desired sequence is then obtained by concatenating successive modified short-time segments rather than by an overlap-add synthesis.

8.5 Constant-Q Analysis/Synthesis

We have seen repeatedly throughout the text the time-frequency resolution tradeoffs associated with the short-time Fourier transform (STFT). That is, according to the uncertainty principle reviewed in Chapter 2, we cannot obtain unlimited good time and frequency resolution simultaneously. A short analysis window w[n] giving excellent temporal resolution implies poor frequency resolution and vice versa. Therefore, we select one “reasonable” time-frequency resolution tradeoff that is used in analyzing an entire speech waveform regardless of the speech event. Stationary sounds, e.g, steady vowels, and nonstationary sounds, e.g., plosive to vowel transitions, are processed with the same analysis window, typically about 20 ms in duration. This limitation can result in excessive smearing of transitions and transient sounds.

Ideally, we desire a time-frequency representation whose resolution can be adapted to the time and frequency characteristics of the sound, for example, giving good temporal resolution with rapidly changing and short-lived events and good frequency resolution in spectrally sharp regions. In this section, we do not attempt this general representation, but rather present an alternative time-frequency distribution called the wavelet transform which is one step toward more flexible time-frequency resolution. The wavelet transform achieves constant-Q resolution whereby time resolution increases and frequency resolution decreases with increasing frequency. In the context of speech processing, the importance of this transform lies in its providing a model of front-end auditory filter analysis. The goal of this section is twofold: first to describe the essential theory of the wavelet transform and compare it to the STFT from a filter bank perspective, and then to look briefly at a few of its applications to speech processing.

We begin this section with revisiting a problem in time-frequency analysis that motivates the wavelet transform. This leads to the wavelet transform approach and its theory for continuous time. We then describe a discrete-time filter-bank implementation of the wavelet transform that ties us back to the theme of this chapter. Finally we take a brief look at its application to speech processing, including time-scale modification, pitch estimation, and coding of speech and through these applications we show the link to auditory filter-bank front-end models. This leads us to an enticing glimpse of auditory modeling in Section 8.6, the final topic of this chapter.

8.5.1 Motivation

The problem that motivates the wavelet transform is the measurement of local frequency content in nonstationary signals. To illustrate the problem, consider the signal of Figure 8.12a consisting of two high-frequency short-duration tones superimposed on the sum of two low-frequency tones. The limitation of standard Fourier analysis on signals of the type in Figure 8.12a is that it measures the frequency content of the entire signal and so does not characterize the change in frequency content with time. This observation motivated the short-time Fourier transform (STFT) of Chapter 7 given in continuous time by

Image

Figure 8.12 Low-frequency signal with superimposed high-frequency bursts: (a) waveform; (b) wideband spectrogram of (a).

Image

whose squared magnitude |Xa(τ, ω)|2 we have referred to as the spectrogram. [For later convenience, we have made a slight notational change from Equation (8.16).] This involves, as we have seen, looking at the signal through a sliding window and computing the Fourier transform under the window for each window shift τ. A limitation of the STFT is the fixed duration of the window w(t); with w(t), we cannot always simultaneously resolve short-lived events and closely-spaced long-duration tones, as illustrated in the spectrogram of Figure 8.12b. We have seen this tradeoff more quantitatively stated in the uncertainty principle that limits time-frequency resolution such that the product of duration D(x) and bandwidth B(x) of a signal x(t) must exceed a constant (Chapter 2), i.e.,

(8.30)

Image

In speech processing, such a limitation is of importance in identifying, for example, closely-spaced harmonic frequencies simultaneously with occurrence times of glottal pulses and short-duration plosive events (Exercise 8.12).

In defining an “ideal” time-frequency transform, acknowledging that we cannot defeat the uncertainty principle in the context of the Fourier transform,6 we seek to minimize the limitations of the uncertainty principle for time-frequency localization. We also require that the transform be invertible (as with the invertibility of the STFT) and be a useful representation for numerous applications.

6 We will see in Chapter 11 that uncertainty is a function of framework (e.g., time or scale [inverse of frequency]) and definition (e.g., local or global quantities). Indeed, with a different framework and quantities of interest, the uncertainty principle need not exist or may be characterized by a different resolution constraint.

8.5.2 Wavelet Transform

An approach to dealing with the uncertainty principle, if not beating it, is to compute many spectrograms with different analysis window durations. As illustrated in Figure 8.13, the wavelet transform can be thought of as a collage of pieces of spectrograms based on different analysis windows and thus different time-frequency resolutions. Specifically, short windows are used at high frequency for good time resolution, and long windows are used at low frequency for good frequency resolution. Figure 8.14 shows how the wavelet transform time-frequency resolution cell changes with frequency, in contrast to the fixed time-frequency resolution of the STFT. We see from this perspective that we don’t defeat the uncertainty principle, i.e., the area of the cell is fixed, but rather its relative dimensions change as we move around the time-frequency plane. This concept of analysis with different resolutions was developed independently in many fields including speech, image, and seismic processing, as well as quantum mechanics. The mathematics of this multi-resolution analysis was developed in the early 1980s and gave a strong foundation for the wavelet transform, as well as provided a unifying framework for the ideas formulated in the context of different applications [8],[22].

The continuous wavelet transform (CWT) formalizes the notion of adapting time resolution to frequency. In beginning the mathematical development, we define a set of functions as the time-scaled and shifted versions of a prototype h(t), i.e.,

Image

Figure 8.13 The wavelet transform as a collage of spectrograms for the multi-component signal of Figure 8.12a. A short window at high frequency gives good time resolution, while a long window at low frequency gives good frequency resolution. The lower right panel is obtained by piecing together regions of the upper four panels.

Image

where h(t) is the basic wavelet, hτ,a(t) are the associated wavelets, τ is the time shift, and a is the scaling factor, as illustrated in Figure 8.15. We then define the continuous wavelet transform (CWT) as (where * here denotes complex conjugation)

(8.31)

Image

which we can think of as a measure of “similarity” of x(t) with h(t) at different scales and time shifts. The factor Image is present to normalize the energy in the wavelets. With a scale factor a < 1, the basic wavelet is contracted and the resulting Image is shifted past the signal x(t), multiplied, and integrated, i.e., convolved with x(t). We can think of the wavelet transform Xw(τ, a) from a filtering viewpoint (as we did with the STFT), i.e.,

Image

Figure 8.14 Adaptation of window size to frequency in the wavelet transform (left panel) in contrast to a fixed window in the STFT (right panel).

Image

Figure 8.15 Schematic of a basic wavelet and its associated wavelets at different scales.

Image

where * denotes convolution and where a smaller scale corresponds to wider-bandwidth filters. We refer to |Xw(τ, a)|2 as the scalogram,7 in contrast to the spectrogram |X(τ, ω)|2. An alternative interpretation of the wavelet transform is as a “zoom lens” at different time scales, i.e., we can rewrite Equation (8.31) as (Exercise 8.15)

7 The motivation for the squaring operation will become clear in Chapter 11, where we interpret both the spectrogram and scalogram as energy densities that are part of a larger class of time-frequency distributions.

(8.32)

Image

where now we keep the filter h(t) unscaled and scale the signal x(t). The concept of scale in general relates to time scale, but is analogous to frequency, as illustrated in the following example:

Example 8.3        Consider a basic wavelet in the form of a modulated window, i.e.,

h(t) = w(t)eot.

Then the CWT becomes

Image

For each scale a, this expression can be thought of as an STFT at the frequency Image with a sliding window Image. The CWT adapts the window size to frequency. At a small scale a, corresponding to a high frequency Image, the wavelet is narrow in time and so has a wide bandwidth. At a large scale a, corresponding to a low frequency Image, the wavelet is wide in time and so has a narrow bandwidth. As we increase scale (decrease frequency) the wavelet transform analyzes the signal with a window of decreasing bandwidth, thus giving good frequency resolution for low frequency and good time resolution for high frequency. Therefore, loosely speaking, as we have discussed, we can think of the scalogram as a collage of pieces of different spectrograms. We also see from this example that scale varies inversely with frequency. A comparison of the spectrogram and the scalogram, using a 10-ms Hamming window w[n], is illustrated in Figure 8.16 for the signal of Figure 8.12 consisting of two low-frequency tones and two high-frequency clicks (bursts). We see that the scalogram reveals both the clicks and the tones, while the spectrogram, in resolving the clicks, is forced to merge the two tones. Image

Figure 8.16 Comparison of the spectrogram |X(τ, ω)|2 and scalogram |Xw(τ, a)|2 for the multi-component signal of Figure 8.12.

Image

As with the STFT, we want to consider the problem of the invertibility of the wavelet transform, i.e., the conditions under which we can recover a signal x(t) from its wavelet transform. It can be shown that for a large class of basic wavelets, h(t), x(t) can be recovered from a superposition of wavelets hτ,a(t), i.e., the inverse continuous wavelet transform (ICWT)

(8.33)

Image

under the condition

Image

which is called the admissibility condition (Exercise 8.13). This condition implies that h(t) has zero mean, i.e., Image because at ω = 0 the denominator in the above condition is zero, i.e., having zero mean h(t) must “wiggle” [8]. In addition, the admissibility condition requires that H (ω) not decay “too slowly” in frequency and thus has a bandpass characteristic; furthermore, it can be shown (from Parseval’s theorem) that its time response h(t) also cannot decay “too slowly” [8]. Therefore, we see the motivation for the nomenclature wavelet. Equation (8.33) implies that x(t) can be written as a superposition of shifted and dilated wavelets.

The relation in Equation (8.33) also lends itself to a basis function interpretation whereby the continuous wavelet transform Xw(τ, a) measures the projection of x(t) onto the basis hτ,a(t), i.e., Xw(τ, a) is the inner product of x(t) and hτ,a(t). (In this chapter, we use the term “basis” loosely. Strictly, a function set is a basis for a vector space of signals if it provides a unique representation of each signal in the space [3].) The above invertibility result holds even though the elements of the wavelet basis hτ,a(t) are generally not orthogonal;8 to be orthogonal, the inner product of any two different wavelets must be zero.

8 A similar synthesis formula can be found for the STFT which is different from our inversion formula Equation (7.8) of Chapter 7 [8]. The basis here is the shifted and modulated window w(t), fixed in its time/frequency resolution. As with the continuous wavelet transform, the STFT basis is not orthogonal.

8.5.3 Discrete Wavelet Transform

Implementation of the CWT and ICWT requires discretizing the scale a, shift τ, and time t. As with the discrete STFT, we also face the issue of synthesis from the discretized CWT. We begin by discretizing shift and scale. We can think of sampling the CWT in scale a and shift τ to form a set of wavelet coefficients cn,m called the discrete wavelet transform of x(t):

Image

where

Image

The wavelet coefficients cn,m represent the inner product of x(t) with the discretized wavelet basis hn,m(t), which are the original wavelets sampled in scale and in shift. The wavelet coefficients are analogous to the coefficients of the discrete STFT, X(n, k). Under certain conditions the discretized basis is orthogonal and complete9 and so reconstruction of x(t), i.e., the inverse discrete wavelet transform, is given by (Exercise 8.13)

9 A set of wavelets defines a complete basis if any signal can be reconstructed from linear combinations of the wavelet basis.

(8.34)

Image

where the orthogonality condition is expressed as

Image

i.e., the inner product of hn,m(t) and hp,q(t) is zero when the wavelets are different. The sampling requirements on shift and scale for the wavelet basis hn,m(t) to satisfy orthogonality, and thus invertibility through Equation (8.34), are very different than for invertibility with the STFT because the (standard) STFT requires a uniform sampling in frequency and a uniform sampling in time (time decimation) (Figure 8.18). In general, it is not easy to find a basis (derived from the basic wavelet), together with a sampling strategy, to meet the orthogonality condition. Nevertheless, under certain conditions, it is possible to invert the discrete wavelet transform even when the basis is not orthogonal [8]. Orthogonality, however, makes the inversion process straightforward [via Equation (8.34)] and leads to an efficient filtering implementation of the discrete wavelet transform and its inverse [22].

If the wavelet basis hn,m(t) constitutes what is referred to as a frame, the reconstruction of a signal x(t) is always possible [8],[22]. The basis hn,m(t) is a frame if there exists some A > 0 and B > 0 such that for all x(t)

(8.35)

Image

where the inner product

Image

and the norm

Image

The values of A and B reflect the degree of redundancy of the basis. When A = B we have a tight frame (and thus not redundant) and the wavelet basis can be shown to be orthogonal and thus to have a reconstruction formula given by Equation (8.34). When the frame condition Equation (8.35) holds, but orthogonality does not, the basis is redundant and a more involved reconstruction formula is required [8],[22][22].

One particular natural sampling in scale and shift is that of dyadic (or octave) sampling where for each scale am = 2m for m = 1, 2, 3,… we shift at τn = nam for n = 1, 2, 3…. This is considered a “natural” sampling because as the scale increases by a factor of two (i.e., the bandwidth of the wavelet decreases by a factor of two), the sampling rate of shift decreases by a factor of two (half the bandwidth requires half the sampling rate). In other words, the wavelets are partitioned in octave bands and the time (shift) sampling is commensurate with bandwidth. A view of this dyadic sampling is shown in Figure 8.17. The dyadic wavelet basis is then given by

Image

Figure 8.17 Sampling of scale and shift in a dyadic wavelet basis. The wavelets are partitioned in octave bands and the time shift is commensurate with bandwidth: am =1, 2, 4,… 2m… and τn = nam.

Image

From a signal processing perspective, the dyadic discrete wavelet transform can be considered as the output of a filter bank with constant-Q, octave-band, bandpass filters with impulse responses 2m/2h(2mtn). A comparison of the increasing filter bandwidths of dyadic wavelets and the uniform filter bandwidths of the discrete STFT (discrete in frequency and continuous in time), as well as the corresponding filter impulse responses, is shown in Figure 8.18. In part because of the efficient implementation and auditory- and visual cortex-like time-frequency properties of dyadic wavelets, a large part of wavelet theory has involved finding dyadic wavelet bases that are orthogonal and that are useful in a variety of applications [22].

Figure 8.18 Comparison of the sampling requirements for the discrete STFT and the discrete dyadic wavelet transform from a filtering perspective. Panels (a) and (d) show the required filters in frequency, while (b) and (e) show their counterparts in time. The discrete STFT filters have constant bandwidth while the discrete dyadic wavelets have constant-Q bandwidth. Panels (c) and (f) give the respective time-frequency “tiles” that represent the essential concentration of the basis in the time-frequency plane.

SOURCE: O. Rioul and M. Vetterli, “Wavelets and Signal Processing” [43]. ©1991, IEEE. Used by permission.

Image

Example 8.4        One basic wavelet that has found widespread use is a function that approximates a differentiator, illustrated in Figure 8.19a. A fascinating observation by Mallat [22] is that this wavelet (among others)10 satisfies certain properties that allow a signal to be reconstructed from local maxima (with respect to shift τ) of |Xw(τ, a)| over a sampled scale, i.e., from the maxima of the discrete set of filter-bank outputs. The maxima of the wavelet channel outputs indicate fast variations in the signal. An example of Xw(τ, a) sampled at a dyadic scale and the resulting maxima is shown in Figure 8.19. The original signal and its superimposed reconstruction are also given. The reconstruction uses an iterative algorithm similar in style to the STFT magnitude-only iterative reconstruction algorithm of Chapter 7. As stated, the wavelet filter maxima correspond to sharp changes and transients in a signal (as can be seen in the figure) and as such are useful in a variety of contexts, such as pitch and glottal closure estimation described in the following section, and in modeling aural data reduction via sampling the auditory front-end filter-bank outputs that are described later in this chapter. Image

10 It is sufficient that the basic wavelet be the derivative of a function with energy concentrated near the frequency origin [22].

Figure 8.19 Signal representation by maxima of the discrete wavelet transform with respect to shift: (a) a wavelet chosen to approximate a differentiator; (b) a signal and its superimposed (essentially indistinguishable) reconstruction from wavelet maxima; (c) wavelet transform outputs Xw(τ, a) for sampled at a dyadic scale; (d) points of wavelet maxima of (c), i.e., max |Xw(τ, a)| with respect to τ.

SOURCE: S. Mallat and W.L. Hwang, “Singularity Detection and Processing with Wavelets” [23]. ©1992, IEEE. Used by permission.

Image

Our last sampling consideration is that of continuous- to discrete-time conversion of the wavelet analysis and synthesis. For a discrete-time sequence x[n], the wavelet transform, discrete in time as well as shift and scale, is denoted here by Xw(n, m) (analogous to the discrete STFT):

Image

We give without proof a filtering implementation for the particular discrete-time dyadic wavelet transform. Consider a discrete-time signal x[n] derived by sampling x(t). The filtering implementation of the forward transform (Figure 8.20) is given by an iterative cascade of identical stages, each stage consisting of a lowpass [by P(ω)] and highpass [by Q(ω)] decomposition of the signal followed by two-to-one downsampling. The sequences p[n] and q[n] are derived from the basic discrete-time wavelet h[n]; specifically, p[n] = (−1)nh[−n + 1] and q[n] = h[n] [8]. It is remarkable that downsampling the highpass filter outputs at each stage (each stage representing different scales) gives the wavelet coefficients at different scales of the original continuous-time function x(t) [8]. This operation can be thought of as half-band splitting of the signal at each stage and thus is equivalent to a constant-Q, octave band analysis, as illustrated in Figure 8.18d. A similar iterative structure can be used for inverting the wavelet transform from the wavelet coefficients. The condition for invertibility on the basic discrete-time wavelet h[n], when dyadically sampled in shift and scale, is intimately related to the “perfect reconstruction” constraint on quadrature mirror and conjugate mirror digital filters (i.e., the output of the filter bank configuration equals the input) [7],[48]. The equivalence of the constraint for invertibility of a dyadic wavelet basis and for perfect reconstruction of quadrature and conjugate mirror digital filter banks gives an important relation of wavelet theory to more traditional concepts in digital signal processing [3],[8],[22].

Figure 8.20 Iterative filtering implementation of discrete wavelet transform with dyadic orthogonal wavelets. Each filter in 8.18d represents the output of the highpass filter at each stage prior to downsampling. A similar iterative structure exists for the discrete inverse wavelet transform.

Image

8.5.4 Applications

In this section, we briefly describe a few applications of the wavelet transform, beginning with a counterpart to the magnitude-only STFT speech modification algorithm developed in Chapter 7.

Time-Scale Modification — We will see in the following section that a simple model of front-end auditory processing is that of a wavelet transform along the basilar membrane which is located within the cochlear duct of the inner ear. The cochlar duct is a coiled tube, consisting of hard bone, and is filled with fluid. The basilar membrane is coiled within the cochlear duct and vibrates with an input stimulus. Each place along the membrane is characterized by a different resonant-like frequency response with the peak frequency and bandwidth decreasing as we move away from the opening of the cochlear duct, the auditory filters being approximately constant-Q. The reader may want to look ahead to Figures 8.24 and 8.25 for illustrations of the cochlear anatomy and associated filter frequency responses.

We have seen that the wavelet transform requires wavelet filters derived by scaling a basic wavelet. Irino and Kawahara [14],[15] used the above cochlear model for selecting a basic “auditory wavelet.” The basic auditory wavelet is selected as the impulse response along the basilar membrane of a cochlear filter with peak frequency about 873 Hz, which is almost the center of the audible range on a logarithmic frequency scale. This function satisfies the admissibility condition and thus provides for a continuous wavelet basis for signal representation. This basic wavelet is then scaled11 to form a discrete wavelet basis consisting of 128 channels from 55 Hz to 15 kHz. In the Irino-Kawahara auditory wavelet basis, the shift occurs every 0.5 ms. The resulting wavelet basis is not orthogonal, being highly redundant. The transform based on this wavelet filter set is referred to as the auditory wavelet transform (AWT), and the wavelet coefficients obtained from the AWT analysis of the signal are referred to as the AWT coefficients [14],[15]. Figure 8.21 shows frequency responses of the AWT on a logarithmic frequency scale over eight octaves compared against that of a cochlear filter model based on physical principles [15]. The amplitudes of the AWT were selected to match those of the cochlear filters at peak frequency. It is seen in Figure 8.21 that the essential characteristics of the cochlear filters are captured by the AWT. The analytic signal of each filter output is computed via the Hilbert transform (Chapter 2) and the resulting magnitude and phase are used for signal analysis. A possible advantage of the AWT over the STFT is that it may better represent information in auditory channels used in human perception.

11 Because the scaling values are not necessarily rational, simple sampling rate conversion is not used to obtained the scaled wavelets. A cubic spline interpolation is invoked instead. Details of this scaling technique are given in [14],[15].

Irino and Kawahara developed algorithms for reconstructing a signal from the modified wavelet transform for the purpose of time-scale modification. Suppose we are given a discrete auditory wavelet transform Xw(nL, m), where nL refers to the uniform shift with L being the number of samples in 0.5 ms. One approach to obtain a modified transform for time-scale modification is to pretend that Xw(nL, m) was determined with time shift Image and to form the desired modified auditory wavelet transform

Yw(nM, m) = Xw(nL, m).

Figure 8.21 Frequency response at each octave of (b) the AWT on a log-linear scale compared against that of (a) a cochlear filter model based on physical principles [14],[15]. The frequency responses of the cochlear model are specified by locations along the basilar membrane relative to the cochlear duct opening. The amplitudes of the AWT were selected to match those of the cochlear filter model at peak frequency.

SOURCES: T. Irino and H. Kawahara, “Signal Reconstruction from Modified Auditory Wavelet Transformation” [14]. ©1993, IEEE; “Signal Reconstruction from Modified Wavelet Transformation—An Application to Auditory Signal Modeling” [15]. ©1992, IEEE. Used by permission.

Image

This formulation is analogous to the approach we took in Chapter 7 for time-scale modification from the STFT. Also similar to the approach in Chapter 7, we then form a mean-squared-error distance metric

(8.36)

Image

that is minimized with respect to the unknown signal xe[n] embedded within its AWT Image. Unlike the counterpart signal estimation from the STFT, the resulting inverse problem is essentially impossible to solve due to the changing window (filter) associated with each wavelet channel (Exercise 8.16). In addition, as with the STFT, the phase of the modified AWT is such that it results in inconsistent discrete-wavelet transform slices at successive time shifts. Thus, a magnitude-only AWT reconstruction algorithm was proposed [14],[15], similar to that for magnitude-only STFT iterative reconstruction in Chapter 7, for a least-squared-error solution (Exercise 8.16). The following example illustrates the technique:

Example 8.5        The AWT magnitude |Xw(nL, m)| with a 0.5 ms time shift was computed for the speech utterance “kanojowa.” For a time-scale compression by a factor of two, the modified magnitude function |Yw(nM, m)| was formed with Image so that the time shift is assumed to be half the original. The initial starting sequence in the magnitude-only (least-squared-error) iteration was white random noise. Figure 8.22 shows an example of the original and time-compressed AWT magnitude along time, as well as the AWT magnitude of the reconstruction after 20 iterations. The modified waveform was judged to be smooth and free of artifacts [15]. Image

Figure 8.22 Time-scale modification with iterative least-squared-error estimation from a modified AWT magnitude: (a) original AWT magnitude for 12 channels with a 0.5 ms time shift; (b) modified AWT magnitude with an assumed 0.25 time shift; (c) AWT magnitude of the least-squared-error estimate of (b) for 20 iterations.

SOURCES: T. Irino and H. Kawahara, “Signal Reconstruction from Modified Auditory Wavelett Transformation” [14]. ©1993, IEEE. “Signal Reconstruction from Modified Wavelett Transformation—An Application to Auditory Signal Modeling” [15]. ©1992, IEEE. Used by permission.

Image

Pitch Estimation — Kadambe and Boudreaux-Bartels [16] have applied a dyadic wavelet transform to the problem of estimating pitch and glottal closure time. We saw earlier in Example 8.4 that, with an appropriate selection of a basic wavelet, i.e., the first derivative of a low-frequency energy signal, the local maxima indicate abrupt changes in a signal and these local maxima are manifested across several scales in the transform.12 We have seen in Chapters 3 and 5 that vocal cords can close abruptly, thus introducing a sharp negative pulse in the glottal flow derivative at the time of glottal closure. Kadambe and Boudreaux-Bartels found that this abrupt change manifests itself as local maxima in a dyadic wavelet basis across several dyadic scales, and that these local maxima can be exploited for estimation of glottal closure time and pitch. This concept is consistent with certain models of auditory processing [5],[9],[46],[47],[51], as we will see later in this chapter, and also consistent with one of the earliest successful pitch estimators by Gold and Rabiner [12].

12 This same property has been found empirically for certain uniformly-spaced filter banks and was used (implicitly) earlier in Section 8.4.1 in determining event times for phase locking in the phase vocoder.

Using a cubic spline function [22] as the basic wavelet (centered at 8000 Hz), Kadambe and Boudreaux-Bartels determined maxima at five different scales, a = 21, 22, … 25, with wavelet channel outputs sampled at 0.1 ms time shifts. If the locations of thresholded maxima agree to within a small difference across two scales, then an average of these locations is said to be a time of glottal closure. The pitch period is then estimated as the time difference between two glottal closure instants. Kadambe and Boudreaux-Bartels compared pitch estimation using the cubic spline wavelet with two standard pitch estimators: (1) The cepstral-based pitch estimator described in Chapter 6, and (2) The autocorrelation-based pitch estimator that we will describe in Chapter 10. Both of these classic pitch estimators require stationarity of pitch and vocal tract over an analysis window of about 2–3 pitch periods in duration, in contrast to the wavelet-based approach that uses local maxima of the wavelet transform magnitude as its feature. As such, the wavelet-based approach was able to more accurately track a time-varying pitch, while performing comparably with stationary pitch [16]. This property is illustrated in the following example:

Example 8.6        Using the basic wavelet of Figure 8.19a (i.e., an approximate differentiator), wavelet channel outputs are computed at five different scales, a = 21, 22, … 25, with 0.01-ms time shifts. The input is a nasal speech sound with rapid pitch modulation. Figure 8.23 shows the maxima of the wavelet transform magnitude for the five channels. We see a correlation of maxima across these scales that correspond to the rapidly varying pitch period. Image

Subband Speech Coding — We saw in Section 8.5.3 that the discrete wavelet transform, using a dyadic wavelet basis, performs constant-Q analysis to obtain the wavelet coefficients, and can be implemented with an iterative lowpass- and highpass-band splitting of the waveform. Almost two decades prior to the wide use of the wavelet transform, a similar analysis and implementation was performed with subband filtering for the application of speech coding using the quadrature mirror and conjugate mirror filters and conditions for perfect reconstruction alluded to earlier [7],[22],[48]. We return to subband coding in Chapter 12.

Figure 8.23 Pitch period tracking using a basic wavelet approximating a differentiator (Figure 8.19a) for a nasal with rapid pitch modulation: (a) maxima of wavelet transform magnitude for five different scales a = 21, 22, … 25 with 0.01-ms time shifts; (b) voiced nasal speech waveform corresponding to (a). Maxima are given by sample values greater than two adjacent neighboring sample values.

Image

8.6 Auditory Modeling

The major divisions of the peripheral auditory system—the outer ear, middle ear, and inner ear—are shown in Figure 8.24 [13],[27],[30]. Sound first enters the outer ear through the pinna, situated external to the head, and helps to localize sounds. Sound then travels down the auditory canal and results in vibration of the eardrum, which is the component of the outer ear that connects to the middle ear. The middle ear consists of three bones—the malleus, incus, and stapes—that act as a transformer to efficiently transport the vibrations of the eardrum to the inner ear. The middle ear is connected to the inner ear by way of the oval window. From the perspective of aural perception, the major component of the inner ear is the cochlea, which is a coiled tube having the appearance of a snail and is filled with fluid. A schematic of the uncoiled cochlea is shown in Figure 8.25a. Running about midway along the length of the cochlea and within the cochlear fluid is the basilar membrane that is held to the cochlea by bone.

Vibrations of the eardrum result in movement of the oval window that generates a compression sound wave in the cochlear fluid. This compression wave, in turn, causes a vertical vibration of the basilar membrane. Along the basilar membrane are located nearly 10000 inner hair cells in a regular geometric pattern. Embedded within the membrane are the cell bodies, above which protrude short stiff hairs (Figure 8.25b) that deflect when the basilar membrane vibrates. This deflection causes a chemical reaction within the cell bodies which finally leads to a “firing” of short-duration electrical (voltage) pulses in the nerve fibers that connect to the bottom of each inner hair cell. The nerve fibers from all inner hair cells are bunched together to form the auditory nerve (Figure 8.24). Electrical pulses run along the auditory nerve and ultimately reach the higher levels of auditory processing in the brain, where they are perceived as sound.

Figure 8.24 Primary anatomical components of the peripheral auditory system are the outer, middle, and inner ear.

SOURCE: P.H. Lindsay and D.A. Norman, Human Information Processing: An Introduction to Psychology, Academic Press, New York, NY, 1972. ©1972, Harcourt, Inc. Reproduced by permission of the publisher.

Image

Figure 8.25 Schematic of front-end auditory processing and its model as a wavelet transform: (a) the uncoiled cochlea; (b) the transduction to neural firings of the deflection of hairs that protrude from the inner hair cells along the basilar membrane; (c) a signal processing abstraction of the cochlear filters along the basilar membrane. The filter tuning curves, i.e., frequency responses, are roughly constant-Q with bandwidth decreasing logarithmically from the oval window to the scala vestibuli.

SOURCE FOR PANEL (a): D.M. Green, An Introduction to Hearing, [13]. ©1976, D.M. Green. Used by permission.

Image

When the ear is excited by an input stimulus, different regions of the basilar membrane respond maximally to different frequencies, i.e., a frequency “tuning” occurs along the membrane. We can therefore think of the response patterns as due to a bank of cochlear filters along the basilar membrane (Figure 8.25c). Measurements show a roughly logarithmic increase in bandwidth of these filters, i.e., the filters are approximately13 constant-Q in their frequency response, with bandwidth decreasing as we move away from the cochlear opening at the oval window and to the scala vestibuli which reside at the end of the basilar membrane. We saw earlier a set of modeled cochlear filters (Figure 8.21a) over an eight-octave range from 55 Hz to 15 kHz. The filters are characterized by an asymmetric frequency response with a steeper fall off to the right of their peak frequency than to the left (even on a linear frequency scale). The peak frequency at which maximum response occurs is referred to as the characteristic frequency of the cochlear filter. As we saw in our discussion of time-scale modification of Section 8.5.4, a simple model of the inner ear front-end auditory processing, therefore, is that of a wavelet transform along the vertically oscillating basilar membrane. This wavelet representation of the cochlear filters was introduced by Yang, Wang, and Shamma [51]. The constant-Q cochlear filter bank thus provides a range of analysis window (filter) durations and bandwidths with which to analyze the signal at different frequencies (Figure 8.25c). Rapidly varying signal components (e.g., the attack of the glottal pulse and plosives) are better analyzed with shorter windows than those of slower components and events (e.g., low-frequency harmonics of vowels). As with our characterization of the filter-bank outputs in the phase vocoder, the output of each cochlea filter can be thought of as an amplitude- and frequency-modulated sinewave. The envelope of each output and its instantaneous frequency and phase are important features that are responsible for exciting particular neural firing patterns used by the higher levels of auditory processing in the perception of speech.

13 Below about 800 Hz, the cochlear filters have almost equal bandwidth.

Our goal in this section is to give three different perspectives of signal processing in the auditory system that relate to the filter-bank analysis theme of this chapter. We look first at the output of the cochlear filter bank as amplitude- (AM) and frequency-(FM) modulated sinewaves and describe a simple model as to how this AM and FM generally influences firing rate of the inner hair cells. We next add to this model auditory processing elements that give improved joint time and frequency resolution, as well as enhanced sensitivity to temporal and spectral change in a signal. Finally, we investigate how different rates of change in signals within the AM-FM sinewave representation may be used and processed by different auditory components within the auditory pathway. These different perspectives are useful in motivating certain speech signal processing techniques to follow, and also lay the groundwork for auditory principles, such as critical band theory and perceptual masking, that will be described and used as needed in other parts of the text.

8.6.1 AM-FM Model of Auditory Processing

Consider a quasi-periodic waveform, in continuous time, consisting of a sum of sinewaves with slowly varying amplitude Ap(t) and frequency Ωp(t), each sinewave of the form xp(t) = Ap(t) cos[Image p(t)] with Image. If a single sinewave were to enter each cochlear filter, then the amplitude and phase of each filter output are governed by the single sinewave input. Let us suppose that one objective of auditory processing is to measure (and perceive) the amplitude and frequency of the sinewave. The nerve fibers attached to the bottom of each inner hair cell fire when the envelope of the output is large enough, i.e., if Ap(t) > T (T being a threshold of firing); furthermore, an increase in the envelope gives an increase in the firing rate. The average firing rate associated with a particular inner hair cell is obtained by integrating the firing rates over many nerve fibers that are attached to each hair cell. This property implies that the spectral content of the input sound is traced out by measuring the average firing rate of the fiber bundles along the basilar membrane because the amplitude of a sinewave Ap(t) entering each cochlear filter is determined by samples of the vocal tract spectrum (Figure 8.26). This model is called the place theory of hearing because the input spectrum is reflected in the average firing rate associated with each cochlear filter tuned to a particular frequency along the basilar membrane [9],[13],[27],[30]. This simple model is appropriate for high-pitched speakers because the harmonic spacing is large enough so that one sinewave may indeed effectively enter only one cochlear filter. If there is no interaction at higher auditory levels across cochlear channels, then sinewaves in the input are viewed independently of one another.

For lower-pitched speakers, on the other hand, more than one sinewave will enter a cochlear filter. (This is more likely to occur at high frequency than at low frequency where the filters are more narrow.) In this case, the envelope is determined by a sum of sinewaves over the cochlear filter bandwidth. Nevertheless, the envelope still reflects the average energy in the input spectrum over each cochlear band and thus the vocal tract spectrum. However, with more than one sinewave entering a cochlear filter, the output envelope becomes pitch-dependent, especially for high-frequency cochlear filters of wide bandwidth (Figure 8.27). Specifically, the envelope will peak more at the glottal pulse instant, particularly for high-frequency cochlear filters, which give better time resolution than low-frequency filters because of their short impulse responses. This abrupt change in the envelope corresponds to a (temporally) local increase in nerve firings and thus the possible encoding of pitch in these nerve firings. Observe that the relative phase of the input sinewaves will influence the shape of the envelope in this case and this is more likely to occur with decreasing pitch and with cochlear filters of increasing characteristic frequency. The effect of phase change on firing rate has been demonstrated using a discrete-time cochlear model [19]. In speech signal processing, therefore, changing of the phase may influence auditory perception, especially of low-pitched sounds. For example, we saw in the phase vocoder that original harmonic phase relations can be lost in synthesis and result in a reverberant character to the speech. This loss of phase coherence can change the firing rate pattern (Figure 8.27). For example, the coherence loss can result in a less peaky envelope at the output of a cochlear filter and thus a “blurred” firing rate as seen by higher auditory levels. Preservation of phase may also be important for the second auditory processing mechanism we now describe.

Figure 8.26 Auditory processing of a single slowly varying AM-FM sinewave as input to a cochlear filter, according to the place theory of hearing. The average firing rate is obtained by integrating the firing rate over many nerve fibers for a particular inner hair cell (IHC) and is roughly proportional to the input sinewave amplitude Ap(t).

Image

This second mechanism is called phase synchrony. Until now we have ignored the absolute phase of the filter output and used only its envelope. The absolute phase, however, is also used in auditory processing because for low-frequency filters (about < 1000 Hz), nerve fibers fire in synchrony with sinewave peaks, the distance between firings therefore providing an estimate of the frequency of an input dominated by a single sinewave. If a single harmonic enters the cochlear filter, then the synchrony can result in a measure of the harmonic frequency. This is called the temporal theory of hearing [9],[13],[27],[30]. When multiple sinewaves enter the cochlear filter, there is evidence that, in the speech context, the synchrony occurs on the formant frequency dominating the cochlear band, and not necessarily on harmonic frequencies, or occurs on harmonic frequencies closest to the dominant formant frequency. Phase synchrony, therefore, can give measurements of either harmonic or formant frequency, depending on the number of sinewaves entering a cochlear filter and on the cochlear filter bandwidth. Observe, however, that although neural firings occur on peaks of low-frequency sinewaves, this firing will not occur on every sinewave peak. Thus one needs to look across multiple nerve fibers to make the appropriate measurement; the integration of neural firing patterns across adjacent fibers can provide the desired frequency measurement. Furthermore, phase synchrony across cochlear filter channels (in contrast to across nerve fibers within a cochlear channel) may also be exploited. This phenomenon gives further motivation to preserve sinewave phase relations in speech processing.

In summary, it appears then that sinewave phase relations are important both at high frequencies and at low frequencies, but for different reasons. At high frequencies, the envelope of cochlear filter outputs, and thus corresponding nerve firing patterns, are more prone to distortion with a change in sinewave phase relations. At low frequencies, a change in sinewave phase relations can degrade frequency measurements that rely on phase synchrony and the resulting correlation of firing patterns within and across cochlear channels.

Figure 8.27 Auditory processing of a multiple of slowly varying AM-FM sinewaves as input to a cochlear filter, likely to occur with low pitch and at cochlear filters of high characteristic frequency. In this case, change in the phase relations of the input can alter the envelope shape and firing patterns of inner hair cell (IHC) nerve fibers, and thus, perhaps, perception of the sound.

Image

8.6.2 Auditory Spectral Model

The auditory system has a unique sensitivity to rapid signal variation and can maintain this sensitivity under harsh conditions [30]. We look now at a mathematical model of the auditory processing functions that may be responsible for this capability. From above, the front-end stage of the auditory system can be modeled by a filter bank of the form

y(t, s) = h(t, s) * x(t)

where h(t, s) represents the impulse responses of the (roughly) constant-Q cochlear filters at location s along the basilar membrane and where * denotes convolution. Due to the near constant-Q nature of the cochlear filters, as we have seen, the filters being related by a simple dilation along the basilar membrane, the output y(t, s) can be approximated by a wavelet transform of the input signal x(t).

In this auditory model of Yang, Wang, and Shamma [51], the differentials of y(t, s), both in time and frequency, are represented in higher levels of auditory processing. The differential in time at the output of each filter is introduced by the inner hair cell transduction, and is followed by a compressive nonlinearity and lowpass filtering (also introduced by the inner hair cell transduction):

z(t, s) = g[∂t y(t, s)] * w(t)

where w(t) denotes the lowpass filter, g(t) is the nonlinearity, and for simplicity, ∂t denotes the partial derivative with respect to time t. The nonlinearity and lowpass filtering operations can be approximated by half-wave rectification and smoothing to form an envelope of the temporal derivative of each filter output.14 The next operation is a differential in frequency, across cochlear filters, introduced by lateral inhibition [13],[30].

14 Observe that when y(t, s) is a slowly varying modulated sinewave, then the envelope of ∂t y(t, s) is approximately the product of AM and FM. For y(t, s) = A cos (ωt), ∂t y(t, s) = sin(ωt) so that z(t, s) = g[∂t y(t, s)] * w(t) ≈ Aω.

Lateral inhibition is a mechanism implemented by neural networks found in many biological systems. In vision, such a network exists in the retina and highlights fast transitions in an image as with edges. Likewise, in the auditory pathway (specifically, in the anteroventral cochlear nucleus), it has been shown that lateral inhibition enhances discontinuities of a signal’s spectral content along the frequency (place) axis. Numerous lateral inhibitory networks exist, the simplest of which consists of mutual inhibition (subtraction) of the activity of two nearest-neighbor neural fibers. Mutual inhibition is implemented by neural shunting circuits that provide ratio processing of their inputs, which is a typical neural process throughout the body. We will look briefly at such networks in Chapter 11 in a different perspective on AM-FM estimation by the auditory system.

We approximate the lateral inhibition (subtraction) operation by differentiation with respect to location along the basilar membrane, which can be thought of as enhancing spatial changes across the basilar membrane channels. In the model of Yang, Wang, and Shamma, this differentation is followed by smoothing in frequency by a filter υ(s), so that the next stage becomes

(8.37)

Image

where Image is the derivative of g(t) with respect to its argument, and the two-dimensional smoothing function w(t, s) = w(t)* υ(s). This pattern is referred to as the auditory spectrum which may enhance perceptually useful features of the input waveform [51]. Under certain conditions, the function Image, and results in the sampling of ∂t y(t, s) at its extrema (i.e., maxima and minima) scaled by the mixed partial derivative ∂st y(t, s) and smoothed by w(t, s). Yang, Wang, and Shamma [51] and Slaney [47] have shown that, under certain conditions, a signal can be recovered from these samples (i.e., at extrema) or related features using an iterative approach similar to that alluded to in Example 8.4 of Section 8.5.3. (Nevertheless, there is not evidence that the auditory pathway requires this reconstruction.)

The auditory spectrum was analyzed with and without the compressive nonlinearity g(t). From Equation (8.37), removing the nonlinearity, results in the time differential input ∂t x(t) passed through the differential filtersh(t, s) that represents the derivative of h(t, s) with respect to frequency (place) along the basilar membrane. The procedure is equivalent to a filter bank much more narrowly tuned than the wideband cochlear filters, with filters centered around the characteristic frequencies. An example, taken from Wang and Shamma [52] is given in Figure 8.28. An interesting hypothesis put forth by Wang and Shamma is that the differential cochlear filter bank operates in parallel with the original cochlear filter bank, thus providing simultaneously a narrowband and wideband time-frequency representation to higher levels of the auditory pathway at each place along the basilar membrane.

Figure 8.28 Example of differential filter. The solid line is the wideband cochlear filter, while the dotted line is the corresponding narrowband differential filter.

SOURCE: K. Wang and S.A. Shamma, “Self-Normalization and Noise-Robustness in Early Auditory Representations” [52]. ©1994, IEEE. Used by permission.

Image

When the hair cell nonlinearity is taken into account, the resulting auditory spectrum is a function of both the original cochlear filters and differential cochlear filters, and, in particular, the output reflects the ratio of the energy of its narrowband differential filter to that of its corresponding wideband cochlear filter and is implemented by the neural shunting circuits discussed above. This self-normalization has the effect of enhancing a spectral line above a background noise level, and may in part be responsible for the superior robustness of the auditory system in noise [52]. The enhancement is greater when the cochlear bandwidth increases relative to the bandwidth of its corresponding differential filter. We will explore the robustness of the auditory system to noise from different perspectives in Chapter 13.

8.6.3 Phasic/Tonic View of Auditory Neural Processing

In this section, basic principles of neural processing are described with respect to “fast” and “slow” sound components. In particular, we describe Chistovich’s [5] auditory model which, in the various low- and high-level stages of auditory processing, consists of phasic and tonic systems that operate in parallel and respond, respectively, to fast and slow sound types within an auditory channel. Phasic systems respond primarily to dynamic features such as changing temporal envelope characteristics and they detect, for example, sound onsets and offsets in frequency bands. Tonic systems respond primarily to the slowly varying spectral content of a signal within a band such as spectral patterns and formant transitions. Our formulation here is based in part on an exposition by Delgutte [9] on phasic and tonic neural mechanisms for the perception of speech. For each phasic and tonic mechanism, we look at both low-level auditory nerve and high-level cochlear nucleus neural processing. There is a great deal more interconnection and parallism in the cochlear nucleus than in the auditory nerve fibers which, by contrast, are simply a two-dimensional array of fibers organized according to the characteristic frequency (CF).

Phasic Systems— Auditory Nerve: There are often rapid changes in the amplitude envelopes of filtered sound events, and their temporal relations provide important information. A plosive consonant, for example, has an abrupt increase in high frequency spectral energy in the release of the burst, followed by an abrupt increase in low-frequency energy during the onset of the vowel; the time difference between the events was referred to in Chapter 3 as the voice onset time. Different voice onset times cue different plosive consonants. The degree of change in the amplitude envelope in a single frequency band is also important; for example, affricate consonants (e.g., “ch” in “chop”) have an abrupt temporal envelope, while fricative consonants (e.g., “sh” in “shop”) have a gradual envelope in high-frequency bands. In Chistovich’s model, these different dynamic envelope characteristics lead to distinct patterns in neural firings of the phasic system component of the auditory nerve. The burst onset and vowel onset of a plosive, for example, have a large and abrupt discharge of nerve firings, at high-CF and low-CF phasic fibers, respectively. In the affricate and fricative consonants, an abrupt amplitude envelope yields a larger neural discharge peak than a gradual one in the same high-CF phasic fiber.

Following a large increase of nerve firings is a gradual decay in the firing rate called the adaptation [13],[30]. During adaptation, the response of the auditory nerve to future stimuli is much suppressed, especially to steady spectral components following the initial firing increase. Adaptation can occur on different time scales, ranging from a few milliseconds to several seconds to several minutes. An important property of adaptation is that it enhances change in a spectrum over time because a steady spectrum maintains the adaptation and contributes little to the discharge pattern, giving less responsiveness over time, while spectral contrast will revitalize nerve firings. One simple model of this contrast enhancement is that the past input to the nerve fiber is subtracted from the present so that only the change remains [27]. During adaptation, auditory nerves need change to respond. Steady or slowly changing spectral components are thus largely unseen by the phasic neural components of the auditory nerve, their measurement made by the tonic system components. A simplified view of the phasic response to a schematized plosive/vowel transition is shown in Figure 8.29. Observe that neural discharge also occurs at the modulation rate of an amplitude-modulated envelope, e.g., the pitch period of a speech waveform. Although we show in Figure 8.29 the discharge near peaks in the AM, the firing may occur only sporadically at these times. The reader should argue these neural response patterns based on the bandwidth of low-CF and high-CF cochlear filters.

Cochlear Nucleus: As firing patterns from acoustic transients travel up the auditory pathway, they are further enhanced by certain neural cells in the cochlear nucleus which reside in the lower part of the brain beyond the auditory nerve; these cells respond primarily to the onset envelope of a stimulus, giving little or no sustained response thereafter. Such cells are called onset cells and are found at almost all stages in the auditory pathway beginning at the cochlear nucleus. There is evidence that onset cells also tend to discharge at the modulation rate of the envelope of an amplitude-modulated tone (e.g., the pitch period of a speech waveform), and, in particular, synchronize to every peak in the AM envelope cycle, integrating information from closely-spaced neural channels.15 Onset cells are also capable of a “double onset” mechanism, as with firing at both the burst release and the onset of a following vowel, thus perhaps directly encoding the voice onset time of a plosive consonant in their response pattern. These cells thus can respond to information from very different frequency bands, as well as neighboring channels.

15 An onset cell does not necessarily discharge on the peak of a single channel envelope so that its synchronizing to peaks in the envelope requires integration of neighboring neural channels. Observe also that the onset cell, synchronizing on the amplitude envelope, does not phase lock to peaks in the underlying sinewave from a cochlear-filter output, as we described takes place at the auditory nerve level.

Figure 8.29 Schematized view of the phasic response of neural fibers in the auditory nerve to a plosive/vowel transition. Average firing rates (AFR) are illustrated for (b) a low-CF and (c) a high-CF channel of the auditory nerve for the speech waveform in panel (a). There is an average background discharge rate due to spontaneous emission of firings when no stimulus is present. In this schematic, the low-CF fiber responds to the vowel onset, while the high-CF fiber responds to the plosive and glottal pulse onsets. Observe that steady spectral components are suppressed in a phasic response.

Image

We have seen that at both the auditory nerve and cochlear nucleus levels, the shape of an amplitude envelope of an auditory channel determines the phasic response of the neural fibers and cells. In speech signal processing, therefore, we must preserve the amplitude envelope shape within frequency bands by maintaining phase relations at the waveform level, as we saw earlier in Section 8.6.1; for example, a blurring of the attack or modifying the AM of an envelope alters firing patterns along an auditory channel. The envelope relation between channels (a different kind of phase relation) must also be preserved because onset cells, which respond to the envelope, gather information from both adjacent and distant frequency bands.

Tonic Systems— Auditory Nerve: There are a number of ways in which the auditory pathway may represent the slowly varying spectral content of a signal over time. We introduced earlier one approach in the place theory of hearing in which the input spectrum is reflected in the average firing rate associated with each cochlear filter tuned to a particular frequency along the basilar membrane, the average firing rate tracing out the formant energy across CF. The place theory is applicable to voiced as well as unvoiced sounds. There are many fascinating subtleties to this model [9]. The dynamic range and level of the input, for example, play a major role in the representation. Nerve fibers with a low spontaneous firing rate (i.e., the firing rate with no stimulus present) are activated at a low threshold and over a small dynamic range (20–40 dB), while nerve fibers with a high spontaneous firing rate are activated at a higher threshold and over a larger dynamic range (40–80 dB). High-threshold fibers provide a place representation of the formant pattern for moderate and high stimulus levels, while low threshold fibers provide the formant pattern for low stimulus levels. Another important feature of the place theory is that through feedback from the brainstem, which provides stimuli to the outer hair cells (the second large group of hair cells along the basilar membrane), the dynamic range of the auditory nerve fibers of some inner hair cells is shifted by about 15–30 dB in favor of higher-intensity stimuli. In addition, when this feedback is activated, frequency resolution of an auditory channel is improved. Such activation occurs particularly in noise and for specific channels associated with spectral bands that have low signal-to-noise ratio.

The above representation gives an average firing rate at each CF which is proportional to the stimulus level within spectral bands, and thus the spectrum is traced out across CF. Alternatively, in the temporal theory of hearing that we introduced earlier, in contrast to the place theory, phase synchrony may be used to measure the frequency of the stimulus (< 1 kHz). In this case, the tonic system takes on two forms at the auditory nerve level. Interspike intervals can be used to make very precise frequency measurements. We saw earlier that, with phase synchrony, either formant or harmonic (close to a dominant formant) frequency of the input signal can be measured, depending on the complexity (e.g., number of harmonics) of the input and the bandwidth of the channel. Remarkably, this information is coded in processors that are almost instantaneous in the sense that they have temporal resolution that equals the inverse of the formant or harmonic frequency, at least up to about 3000 Hz, or on the order of less than 1 millisecond of temporal resolution [9]. Because at the auditory nerve level the phase synchrony occurs at sinewave peaks, fine changes in the formant or harmonic frequencies can be tracked (Figure 8.30).

This high resolution of the auditory system presents an important implication for speech signal processing. Typical short-time Fourier transform analysis, with analysis windows of duration 10–20 ms or greater, cannot achieve such temporal resolution. It follows that analysis tools of greater time resolution than the short-time Fourier transform are needed to exploit temporal patterns that are essential in human perception. We return to this problem in Chapter 11. We have already encountered a second important implication: Preservation of channel phase relations is required to maintain the ability of the auditory nerve to provide phase synchrony across channels for formant and harmonic frequency estimation.

Cochlear Nucleus: The tonic component of the cochlear nucleus is characterized by two important cell classes: the primary unit cells and the chopper unit cells. Place and temporal functions are similar in style to those of the tonic component of the auditory nerve fibers. The primary unit cells provide the better time resolution, with very precise phase locking to harmonic frequencies or band-dominant formant frequencies (or harmonics closest to dominant formant frequencies), but can integrate across many auditory nerve fibers. Primary cells have dynamic range problems similar to low-threshold auditory nerve fiber cells. Chopper unit cells, on the other hand, have better dynamic range properties than primary unit cells (40 dB range of stimulus levels), in spite of the limited dynamic range of their input. This property is probably due to the fact that they integrate information from many auditory nerve fibers, while perhaps also invoking mutual inhibition among inputs from auditory nerve fibers. On the other hand, chopper cells are characterized by poorer temporal resolution, phase locking at only < 1 kHz. Thus, primary unit and chopper cells complement each other: Primary unit cells give precise representation of frequencies by their temporal discharge pattern, while chopper cells provide a place representation of spectra over a wide dynamic range. The implications for speech signal processing are similar to those for the tonic response patterns of auditory nerve fibers. Because at this high auditory level the cells integrate information from different auditory nerve fibers, timing and spectral content representations at the auditory nerve fiber level should be controlled with the higher levels in mind.

Figure 8.30 Schematic of average firing rate (AFR) [panel (b)] in estimation of changing formant frequency [panel (a)] by tonic auditory nerve component. Phase synchrony with respect to formant peaks allows fine time resolution.

Image

8.7 Summary

In this chapter, we developed a sinewave model of the filter-bank outputs for the FBS method for quasi-periodic speech signals. This channel model led to the phase vocoder for speech analysis and synthesis. In this context, we introduced the concept of phase coherence across sinewave outputs in synthesis. Loss of phase coherence is known to give a reverberant quality in the phase vocoder synthesis and a number of techniques were described for preserving this property, particularly in the application of time-scale modification. Various limitations of the phase vocoder lead to a speech analysis/synthesis method based on explicit modeling and estimation of sinewave components to be described in Chapter 9.

In this chapter we also introduced the wavelet transform, which is a generalization of filter-bank analysis/synthesis that involves constant-Q filters. We saw that the wavelet transform can serve as an approximate model of the front-end auditory cochlear filter bank. In addition, we used this model to speculate on the cause of aural phase sensitivity. Moreover, this model provided a stepping stone to more elaborate models of auditory neural processing in both the low-level auditory nerve fiber and the high-level cochlear nucleus. For example, we introduced Chistovich’s hypothesis of phasic and tonic neural response patterns to fast and slow sound components, respectively, of the speech waveform. These auditory models provide a framework for various auditory-motivated speech signal processing throughout the remainder of the text.

EXERCISES

8.1 In this exercise you are asked to explore the effect of the channel phase adjustment of Equation (8.5) in FBS synthesis.

(a) Show that making pk = 1 results in the FBS constraint that for Image, i.e., Image, then w[rN] = 0 for r = ±1, ±2, … , or the more useful constraint that the duration of the analysis window w[n], Nw, is less than N.

(b) Show that with pk = 1, multiple copies of the input signal arise when the FBS constraint is not satisfied, specifically,

Image

(c) Derive the expression in Equation (8.7) of EXAMPLE 8.1 that results by applying the linear-phase adjustment factor Image with no = 20, removing multiple copies of x[n] and thus reverberation in synthesis.

8.2 Consider filters of a filter bank (Section 8.3, Figure 8.2) that are symmetric about π so that ωN−k = 2πωk where Image and assume for simplicity that N is even. Show Equation (8.14), i.e.,

X(n, ωk) = X*(n, ωN − k).

From Equations (8.13) and (8.14), show that the sum of two symmetric channels k and Nk can be written as [Equation (8.15)]

Image

8.3 Show that the derivative of the continuous-time STFT phase θ(t, Ωk) with respect to time can be expressed as

Image

where a(t, Ωk) and b(t, Ωk) are the real and imaginary parts of the STFT X(t, Ωk), respectively. Hint: Recall the analogous frequency-domain expression of Equation (6.8) from Chapter 6.

8.4 In this exercise we derive the output of the bandpass filter, hk(t) = w(t)ejΩk (of Section 8.3.1) to a sinewave with frequency Ω (t) and amplitude A(t), i.e.,

x(t) = A(t)cos[Image(t)]

with

Image

We think of x(t) as one sinewave component of a quasi-periodic signal. The kth filter output is given by

Image

where t′ is the starting non-zero point of w(t − τ). We assume that the amplitude and frequency functions satisfy the slowly varying conditions of Section 8.3.1 over the time interval [t′, t″] given for quasi-periodic case and illustrated in Figure 8.5.

(a) Writing Image ( t) as

Image

so that

Image

given the slowly varying conditions of Section 8.3.1 for quasi-periodicity, show that

Image

(b) Assume that the frequency response of hk(t), Hk(Ω) = W(Ω − Ωk), is flat with unity response near Ω = Ω(t′). Given your result in part (a), show that yk(t) can be written as

Image

or

Image

where Image.

(c) Given your result in part (b), show that the STFT of x(t) at Ω = Ωk is given by

Image

The phase of X(t, Ωk) is then given by

Image

and the phase derivative by

Image

Noting that t′ can be replaced by t under the assumption that Ω(t) changes negligibly over the duration of the analysis window w(t), argue therefore that

(8.38)

Image

and thus the kth channel output is given by Equation (8.21) when the pth slowly varying harmonic component of a quasi-periodic input falls in the bandwidth of the kth bandpass filter.

(d) Argue for the discrete-time counterpart of Equation (8.38) given in Equation (8.22).

8.5 We described in Section 8.3 the loss in phase coherence that occurs in time-scale modification when using samples of the phase derivative in the phase vocoder implementation of Figure 8.8 in which the frequency compressed (or expanded) signal is time-scaled during playback. Consider now the alternative “direct” implementation of time-scale modification with the phase vocoder given in Figure 8.9 in which the time-scaling occurs by decimation/interpolation of the sinewave amplitude and frequency functions.

(a) Explain why there results in the system of Figure 8.9 a loss of the original phase relations among sinewaves, thus giving an objectionable reverberant quality to the synthesis. Argue that a loss of phase coherence occurs even when the unwrapped phase is obtained from the principal phase value rather than the phase derivative as in Figure 8.9. Hint: Consider Equation (8.27).

(b) Argue that a single phase offset correction to each function Image in Equation (8.26) cannot preserve the original phase relations in the channel outputs over all time.

(c) Given the original principal phase function, Image, for each channel, propose an approach to correct the time-scaled phase function Image in Equation (8.26) so that loss in phase coherence does not occur, i.e., so that the original phase relations are maintained. Hint: Consider adding a phase offset and frequency perturbation to each channel. Also, consider that there may be no solution to this exercise if you attempt to maintain the original phase relations at all time samples.

8.6 Suppose that the amplitude of the filter frequency response H(ω) in Equation (8.23) follows a Gaussian function and that the filter phase is zero. Give an expression for the amplitude and phase of the sequence y[n] in Equation (8.23) for a sinewave input with constant amplitude A and linearly changing frequency ω[n] = αn. How does your result effect analysis and synthesis in the phase vocoder? What are the implications for the phase vocoder when the phase of H(ω) is nonzero?

8.7 Suppose we are given the temporal envelope and the Fourier transform magnitude of a signal. We want to generate a time-scaled signal that has the given spectral magnitude and, with an appropriate selection of the Fourier transform phase, has a time-scaled version of the original temporal envelope.

(a) Consider a single decaying sinewave of the form

x[n] = an cos(ωn)u[n]

with u[n] the unit step, and suppose we stretch the temporal envelope anu[n] so that it becomes an/2u[n]. Explain why this altered temporal envelope is not “consistent” with the Fourier transform magnitude of the original signal.

(b) Generalize your argument in part (a) for signals consisting of sums of decaying sinewaves, thus further explaining why methods that attempt to achieve a close match to both a spectral magnitude and a modified temporal envelope by Fourier transform phase selection may not be consistent with the relationship between a sequence and its Fourier transform.

8.8 Explain why the time-scale modification technique used in Example 8.2, when applied on speech, will modify the pitch of the speaker as well as the articulation rate. Consider the speech events that are detected by the 2-ms analysis filter. (See also Exercise 7.18.)

8.9 Argue qualitatively that excitation-driven techniques such as those based on linear prediction or homomorphic deconvolution, provided that the source and vocal tract impulse response are estimated accurately, do not suffer from loss of phase coherence in time-scale or pitch modification.

8.10 The oldest form of speech coding device is the channel vocoder which was invented by Homer Dudley [6]. Figures 8.31 and 8.32 show the analysis/synthesis structure for the channel vocoder.

(a) Suppose the input x(t) is a slowly varying steady-state voiced sound and that the bandpass filters are complex and ideal each with bandwidth of 200 Hz, and that taken together, they cover the input speech bandwidth of 5000 Hz. Assuming that only one harmonic of the input passes through each bandpass filter, roughly sketch the real part of the output of a bandpass filter which is located near 2500 Hz. (The specific number, 2500 Hz, is not important.)

(b) The cascade of the magnitude operator and lowpass filter yields an estimate of the amplitude envelope of the output of each bandpass filter. Suppose the lowpass filter is ideal and has a bandwidth of about 100 Hz. Roughly sketch the output of the lowpass filter with input being the magnitude of the complex bandpass output from part (a). What is the required decimation rate, i.e., samples per second, at the bandpass output? What is the total number of channel filter parameters bk required per second from all of the filters?

(c) Denote each bandpass filter by hk(t) and the source in the synthesis by e(t). Derive an expression for the output of the synthesizer, assuming the magnitude signals bk are held constant. (In practice, interpolation of bk will be required.) Suppose each bandpass filter is zero-phase. Sketch the real part of the output of the synthesizer, assuming a voiced input.

(d) Describe conceptually the difference between the channel vocoder and the phase vocoder.

Figure 8.31 Block diagram of channel vocoder analyzer.

SOURCE: L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals [42]. ©1978, Pearson Education, Inc. Used by permission.

Image

Figure 8.32 Block diagram of channel vocoder synthesizer.

SOURCE: L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals [42]. ©1978, Pearson Education, Inc. Used by permission.

Image

8.11 Consider a discrete-time signal x[n] passed through a bank of filters hk[n] where each filter is given by a modulated version of a baseband prototype filter h[n], i.e.,

Image

where h[n], a Hamming window, lies over a duration 0 ≤ n < Nw, and Image is the frequency sampling interval. In this exercise, you are asked to time-scale expand some simple input signals by time-scale expanding the filter-bank outputs.

(a) State the perfect reconstruction constraint, i.e., the output equals the input, with respect to the values Nw and N, when the filter-bank outputs are summed.

(b) If the input to the filter bank is the unit sample δ[n], then the output of each filter is a complex exponential with envelope ak[n] = h[n] and phase Image. Suppose that each complex exponential output is time-expanded by two by interpolation of its envelope and phase (i.e., place a zero between every other time sample and lowpass filter to create a signal of twice the original length). Derive a new perfect reconstruction constraint, with respect to values Nw and N, so that the summed filter-bank outputs equal δ[n].

(c) Suppose now that the filter-bank input equals

x[n] = δ[n] + δ[nno],

and that the filter-bank outputs are time-expanded as in part (b). Derive a sufficient condition on Nw, N, and no so that the summed filter-bank output is given by

y[n] = δ[n] + δ[n −2no],

i.e., the unit samples are separated by 2no samples rather than no samples.

8.12 In this exercise you investigate some of the properties of the continuous wavelet transform and its discrete counterpart on speech signals.

(a) Suppose you are given a signal consisting of two sinewaves at 3500 Hz and 3600 Hz, added to two impulses spaced apart by 5 milliseconds. Design a discrete STFT whose magnitude (spectrogram) reveals all four signal components. Then design a discrete wavelet transform that reveals all four signal components. Describe your STFT and wavelet transform designs qualitatively, considering, for example, the duration and shape of the analysis window and basic wavelet, respectively, for the required time-frequency resolution. In each case, sketch the approximate two-dimensional function (in time-frequency for the STFT and time-scale for the wavelet transform). Discuss the relative advantages for each two-dimensional representation.

(b) Repeat part (a) for the speech signal of Figure 8.33 consisting of a high-frequency voiced plosive having a low-frequency voice bar, followed by a low-pitched (100 Hz) vowel. Assume the voice bar consists of two tones at 100 Hz and 200 Hz. Design your STFT and wavelet transform to resolve the harmonics of the vowel and of the voice bar, while also resolving the onset of the plosive and onset of voicing (i.e., the onset of the vowel).

Figure 8.33 Speech waveform components for Exercise 8.12.

Image

8.13 In this exercise you show the invertibility of the continuous wavelet transform and its discrete counterpart.

(a) Prove the invertibility of the continuous wavelet transform under the admissibility condition on the basic wavelet given in Section 8.5.2. Hint: First determine the Fourier transform of hτ,a(t), Hτ,a(ω). Then use the generalized Parseval’s theorem:

Image

Finally, substitute the expression for the wavelet transform Xw(τ,a) into the continuous inverse wavelet transform Equation (8.33), assume that interchanging the resulting integrals is valid, and then work out the algebra.

(b) Prove the invertibility of the discrete wavelet transform under the orthogonality condition on the discretized wavelet basis given in Section 8.5.3.

8.14 Argue whether it is possible to design a wavelet-like transform that has good time resolution at low frequencies and good frequency resolution at high frequencies. If so, then propose such a wavelet-like basis. Hint: Consider scaling a wideband low-frequency “basic wavelet” up to high-frequency “wavelets.”

8.15 Show that the wavelet transform can be expressed as

Image

so that the transform acts as a “zoom lens,” expanding or contracting the signal relative to the basic wavelet.

8.16 This exercise addresses the construction of a time-scaled sequence from a modified wavelet transform as described in Section 8.5.

(a) Set up equations for a closed form solution that minimizes the distance metric Equation (8.36) between the modified wavelet transform and the wavelet transform of the time-scaled signal estimate. Explain why these equations are difficult, if not impossible, to solve in terms of the changing wavelet window (filter) for each channel.

(b) Propose a magnitude-only iterative solution to estimate a time-scaled sequence from a modified wavelet transform. The method should attempt to minimize the distance metric of the form of Equation (8.36), but between a modified wavelet transform magnitude and the wavelet transform magnitude of the signal estimate. Hint: Consider the iterative solution for time-scale modification obtained from a modified STFT magnitude in Chapter 7.

8.17 In this exercise, we explore an approach to time-scale modification that relies on a filter-bank signal representation in the phase vocoder method. The output of each filter is viewed as an amplitude- and phase-modulated sinewave, the amplitude and unwrapped phases of which are interpolated to perform time-scale modification. Unlike in our phase vocoder of Section 8.3, however, the bandpass filters Hk(ω) do not have a flat response in the frequency vicinity of a sinewave input. As before, with time-scale modification by a factor ρ, the time-scale modified filter output is given by

(8.39)

Images

where Images is the channel envelope and Images is the unwrapped phase, both interpolated/decimated by the factor ρ. In this exercise you show that for a single AM-FM sinewave input, this filter-bank transformation results in a new AM-FM sinewave for which the AM and FM have been stretched. An example of using this method to time-expand a sinewave with flat amplitude and linear FM is illustrated in Figure 8.34, where the bandpass filters are Gaussian in shape and satisfy the FBS constraint. The FM begins at 1000 Hz and has a 15000 Hz/s sweep rate.

(a) Show that scaling the interpolated/decimated phase function in Equation (8.39) by ρ maintains the original instantaneous frequency, i.e., the phase derivative, of each filter output, but stretched or compressed in time.

(b) Consider now an AM-FM sinewave input of the form x[n] = a[n]e[n], where Image. Argue that under certain “slowly varying” conditions on a[n] and ω[n], x[n] is approximately an eigenfunction of a linear time-invariant system, and thus of each filter in the filter bank. Argue that this approximation results in

Image

where the FM (ω[n]) of the signal has been transduced to an AM (|Hk(ω[n])|) within each channel.

Figure 8.34 Time-scale modification of FM sinewave using a Gaussian-based filterbank: original waveform and time-scaled version (upper pair); spectrograms of original waveform and time-scaled version (lower pair).

SOURCE: T.F. Quatieri and T.E. Hanna, “Perfect Reconstruction Time-Scaling Filterbanks” [41]. ©1999, IEEE. Used by permission.

Image

(c) Assume now that each filter in our filter bank is zero phase, i.e., ∠Hk(ω) = 0. Also assume that the filter bank satisfies the FBS constraint, i.e., Image. With time-scaling the filter amplitude and phase output as in Equation (8.39), show that the sum of modified filter outputs is given by

Image

that is, the amplitude and frequency of the FM signal are approximately time-scaled without distortion. The duration of the sinewave has thus been increased while slowing the rate of change of its AM and FM.

(d) Consider a complex filter bank designed with a prototype filter and with some specified filter spacing over a 5000-Hz bandwidth. The prototype filter is selected to meet the FBS constraint (i.e., the output equals the input) when no modification is applied. If the number of filters over the full 5000 Hz band is N = 25, what is the maximum filter length for your result in part (c) to hold. (The example in Figure 8.34 uses this filter bank with a Gaussian prototype filter.)

8.18 (MATLAB) In this MATLAB exercise, use the workspace ex8M1.mat, as well as the function uniform_bankx.m located in the companion website directory Chap_exercises/chapter8. This exercise steps you through a design of the phase vocoder and explores the issue of phase coherence, as well as time-scale modification, for a speech waveform.

(a) The first step is to design a filter bank that satisfies the FBS constraint, specifically with filter spacing of 50 Hz. We assume a sampling rate of 10000 samples/s and design a prototype 20 ms Hamming analysis filter. The function uniform_bankx.m can be used to compute the desired filter bank using the code:

1. prototype = hamming(200)

2. fbank = uniform_bankx(50, prototype,’ 100, 10000, 2).

Observe that the 100 bandpass filters are complex and that the last argument in uniform_bankx.m selects for plotting either the sum of the resulting filter-bank impulse responses (2) or the frequency response magnitude of this sum (1). Now repeat the above design with a filter spacing of 500 Hz and plot the filter-bank response sum. Why is this larger filter spacing not appropriate for the phase vocoder? With the filter bank in the 2-D array fbank, write a MATLAB function to perform analysis and synthesis. Use the speech waveforms mtea_10k and ftea_10k in the workspace ex8M1.mat as input to the filter bank. Perform this operation with both a 50 Hz and 500 Hz filter spacing and describe differences after listening to the reconstructions (using the MATLAB sound.m function).

(b) In this part use the complex filter bank with 50 Hz spacing that resides in the array ƒbank from part (a). Write a MATLAB function to compute the amplitude envelopes and phase derivatives of the filter bank outputs, numerically integrate the phase derivative, and perform synthesis. One approach to computing the phase derivative is to first unwrap the phase of each channel (measured modulo 2π) and then compute the first difference. Define the first phase difference as the initial phase value so that a running sum of the phase differences will reproduce the original (unwrapped) phase exactly. You have now designed a phase vocoder. Use the speech waveforms mtea_10k and ƒtea_10k in the workspace ex8M1.mat as input to the filter bank. How do the reconstructions differ from that of part (a) both visually and aurally? Has phase coherence been maintained? Now modify the initial phase offset of each channel using the MATLAB function rand.m or randn.m, with normalization to the interval [−π, π] and comment on the reconstruction, again visually and aurally. Has phase coherence been maintained? Do you find any perceptual differences in the reconstructions from male and female speech?

(c) With your phase vocoder from part (b) design a time-scale modification system for an integer rate-change factor using the “direct” approach described in Section 8.3.2. Synthesize both time-compressed and time-expanded versions of the waveforms mtea_10k and ƒtea_10k for a variety of integer rate-change factors, and comment on the reconstruction quality. Has phase coherence been preserved?

8.19 (MATLAB) In this MATLAB exercise, you use the function aud_transform.m, the workspace ex8M2.mat, and the auditory filter responses located in the companion website directory Chap_exercises/chapter8. This exercise investigates the channel outputs for a filter bank consisting of measured cochlear filter responses.

(a) The subdirectory Auditory_filters contains frequency responses measured along an actual basilar membrane of a cat (similar to that of a human). The suffix on each file name gives the characteristic frequency in Hertz, spanning an 8000-Hz range. The function aud_transform.m plots the frequency response from a desired file and converts the frequency response into a minimum- or zero-phase impulse response. Plot frequency responses and create minimum-phase impulse response for a variety of cochear filters of low- and high-characteristic frequency. (Auditory filters are said to be minimum phase.)

(b) Now use the speech waveforms mtea_10k and ftea_10k (at 10000 samples/s) in ex8M2.mat as input to your selected filters and plot the amplitude envelope for the filter outputs. Comment on the envelope characteristics for the filters with different speech sound classes (e.g., plosives vs. voiced sounds) and with the male and female speech. Describe how these envelopes might affect neural firing patterns in the auditory nerve and in the cochlear nucleus.

(c) Generate the minimum-phase impulse response for the lowest-characteristic frequency cochlear filter. Consider the fact that the cochlear filters are approximately constant-Q down to about 800 Hz, after which the bandwidths remain about constant. Describe the implication for time-frequency resolution of an auditory filter bank whose filter bandwidths were to continue to decrease logarithmically below 800 Hz.

BIBLIOGRAPHY

[1] B.A. Blesser, “Audio Dynamic Range Compression for Minimum Perceived Distortion,” IEEE Trans. Audio and Electro. Acoustics, vol. AU–17, no. 1, pp. 22–32, March 1969.

[2] A.C. Bovik, J.P. Havlicek, and M.D. Desai, “Theorems for Discrete Filtered Modulated Signals,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 3, pp. 153–156, Minneapolis, MN, April 1993.

[3] C.S. Burrus, R.A. Gopinath, and H. Guo, Introduction to Wavelets and Wavelet Transforms, Prentice Hall, Englewood Cliffs, NJ, 1998.

[4] J.P. Carlson, “Digitized Phase Vocoder,” Proc. Conf. on Speech Communications and Processing, Boston, MA, Nov. 1967.

[5] L.A. Chistovich, V.V. Lubinskaya, T.G. Malinnikova, E.A. Ogorodnikova, E.I. Stoljarova, and S.J.S. Zhukov, “Temporal Processing of Peripheral Auditory Patterns of Speech,” chapter in The Representation of Speech in the Peripheral Auditory System, R. Carlson and B. Granstrom, eds., pp. 165–180, Elsevier, Amsterdam, 1982.

[6] H. Dudley, R. Riesz, and S. Watkins, “A Synthetic Speaker,” J. Franklin Inst., vol. 227, no. 739, 1939.

[7] A. Croisier, D. Esteban, and C. Garland, “Perfect Channel Splitting by Use of Interpolation/Decimation-Tree Decomposition Techniques,” Int. Conf. Information Sciences and Systems, pp. 443–446, Patras, Greece, Aug. 1976.

[8] I. Daubechies, Ten Lectures on Wavelets, SIAM, 1992.

[9] B. Delgutte, Auditory Neural Processing of Speech, chapter in The Handbook of Phonetic Sciences, W.J. Hardcastle and J. Laver, eds., Blackwell (Oxford), 1997.

[10] M. Dolson, “The Phase Vocoder: A Tutorial,” Computer Music Journal, vol. 10, no. 4, pp. 14–27, Winter 1986.

[11] J.L. Flanagan and R.M. Golden, “Phase Vocoder,” Bell System Technical Journal, vol. 45, no. 9, pp. 1493–1509, Nov. 1966.

[12] B. Gold and L.R. Rabiner, “Parallel Processing Techniques for Estimating Pitch Periods of Speech in the Time Domain,” J. Acoustical Society of America, vol. 34, no. 7, pp. 916–921, 1962.

[13] D.M. Green, An Introduction to Hearing, John Wiley and Sons, New York, NY, 1976.

[14] T. Irino and H. Kawahara, “Signal Reconstruction from Modified Auditory Wavelet Transform,” IEEE Trans. Signal Processing, vol. 41, no. 12, pp. 3549–3554, Dec. 1993.

[15] T. Irino and H. Kawahara, “Signal Reconstruction from Modified Wavelet Transform—An Application to Auditory Signal Modeling,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 85–88, Minneapolis, MN, March 1992.

[16] S. Kadambe and G.F. Boudreaux-Bartels, “Application of the Wavelet Transform for Pitch Detection of Speech Signals,” IEEE Trans. on Information Theory, vol. 38, no. 2, pt. 2, pp. 917–924, March 1992.

[17] J. Laroche, “Time and Pitch Scale Modification of Audio Signals,” chapter in Applications of Digital Signal Processing to Audio and Acoustics, M. Kahrs and K. Brandenburg, eds., Kluwer Academic Publishers, Boston, MA, 1998.

[18] J. Laroche and M. Dolson, “Improved Phase Vocoder Time-Scale Modification of Audio,” IEEE Trans. Speech and Audio Processing, vol. 7. no. 3, pp. 323–332, May 1999.

[19] E. Lindemann and J.M. Kates, “Phase Relationships and Amplitude Envelopes in Auditory Perception,” Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Mohonk Mountain House, New Paltz, NY, Oct. 1999.

[20] D. Malah, “Time-Domain Algorithms for Harmonic Bandwidth Reduction and Time Scaling of Speech Signals,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 121–133, April 1979.

[21] D. Malah and J.L. Flanagan, “Frequency Scaling of Speech Signals by Transform Techniques,” Bell System Technical Journal, vol. 60, no. 9, pp. 2107–2156, Nov. 1981.

[22] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, New York, NY, 1998.

[23] S. Mallat and W.L. Hwang, “Singularity Detection and Processing with Wavelets,” IEEE Transactions on Information Theory, vol. 38, no. 2, pp. 617–643, March 1992.

[24] P. Maragos, J.F. Kaiser, and T.F. Quatieri, “Energy Separation in Signal Modulations with Application to Speech Analysis,” IEEE Trans. Signal Processing, vol. 41, no. 10, pp. 3024–3051, Oct. 1993.

[25] R.J. McAulay and T.F. Quatieri, “Speech Analysis-Synthesis Based on a Sinusoidal Representation,” IEEE Trans. Acoustics, Speech, Signal Processing, vol. ASSP–34, no. 4, pp. 744–754, Aug. 1986.

[26] J.A. Moorer, “Signal Processing Aspects of Computer Music,” Proc. IEEE, vol. 65, no. 8, pp. 1108–1137, Aug. 1977.

[27] B.C.J. Moore, An Introduction to the Psychology of Hearing, 2nd Edition, Academic Press, Boston, MA, 1988.

[28] H. Nawab and T.F. Quatieri, “Short-Time Fourier Transform,” chapter in Advanced Topics in Signal Processing, J.S. Lim and A.V. Oppenheim, eds., Prentice Hall, Englewood Cliffs, NJ, 1988.

[29] A.V. Oppenheim and R.W. Schafer, Digital Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1975.

[30] A. Pickles, An Introduction to Auditory Physiology, Academic Press, 2nd Edition, New York, NY, 1988.

[31] M.R. Portnoff, “Time-Scale Modification of Speech Based on Short-Time Fourier Analysis,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 29, no. 3, pp. 374–390, June 1981.

[32] M. Puckette, “Phase-Locked Vocoder,” Proc. IEEE 1995 ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, Mohonk Mountain House, New Paltz, NY, Oct. 1995.

[33] T.F. Quatieri, R.B. Dunn, and T.E. Hanna, “A Subband Approach to Time-Scale Modification of Complex Acoustic Signals,” IEEE Trans. Speech and Audio Processing, vol. 3, no. 6, pp. 515–519, Nov. 1995.

[34] T.F. Quatieri, R.B. Dunn, and T.E. Hanna, “Time-Scale Modification with Temporal Envelope Invariance,” Proc. IEEE 1991 Workshop on Applications of Signal Processing to Audio and Acoustics, Mohonk Mountain House, New Paltz, NY, Oct. 1993.

[35] T.F. Quatieri, R.B. Dunn, and R.J. McAulay, “Signal Enhancement in AM–FM Interference,” Technical Report TR–993, Lincoln Laboratory, Massachusetts Institute of Technology, May 1994.

[36] T.F. Quatieri, R.B. Dunn, R.J. McAulay, and T.E. Hanna, “Underwater Signal Enhancement Using a Sinewave Representation,” Proc. IEEE Oceans92, pp. 449–452, Newport, RI, Oct. 1992.

[37] T.F. Quatieri, R.B. Dunn, R.J. McAulay, and T.E. Hanna, “Time-Scale Modification of Complex Acoustic Signals in Noise,” Technical Report TR–990, Lincoln Laboratory, Massachusetts Institute of Technology, Jan. 1994.

[38] T.F. Quatieri and R.J. McAulay, “Phase Coherence in Speech Reconstruction for Enhancement and Coding Applications,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp. 207–210, Glasgow, Scotland, May 1989.

[39] T.F. Quatieri and R.J. McAulay, “Audio Signal Processing Based on Sinusoidal Analysis/Synthesis,” chapter in Applications of Digital Signal Processing to Audio and Acoustics, M. Kahrs and K. Brandenburg, eds., Kluwer Academic Publishers, Boston, MA, 1998.

[40] T.F. Quatieri, T.E. Hanna, and G.C. O’Leary, “AM–FM Separation Using Auditory-Motivated Filters,” IEEE Trans. Speech and Audio Processing, vol. 5, no. 5, pp. 465–480, Sept. 1997. Also in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Atlanta, GA, vol. 2, pp. 977–980, May 1996.

[41] T.F. Quatieri and T.E. Hanna, “Perfect Reconstruction Time-Scaling Filterbanks,” IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 2, pp. 495–498, Phoenix, AZ, March 1999.

[42] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice Hall, Englewood Cliffs, NJ, 1978.

[43] O. Rioul and M. Vetterli, “Wavelets and Signal Processing,” IEEE Signal Processing Magazine, vol. 8, no. 4, pp. 14–38, Oct. 1991.

[44] S. Roucos and A.M. Wilgus, “High-Quality Time-Scale Modification of Speech,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp. 490–493, Tampa, FL, March 1985.

[45] X. Serra, “A System for Sound Analysis/Transformation/Synthesis Based on a Deterministic Plus Stochastic Decomposition,” Ph.D. Thesis, CCRMA, Department of Music, Stanford University, 1989.

[46] S. Seneff, “Pitch and Spectral Estimation of Speech Based on Auditory Synchrony Model,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp. 36.2.1–36.2.4, San Diego, CA, March 1994.

[47] M. Slaney, D. Naar, and R.F. Lyon, “Auditory Model Inversion for Signal Separation,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 2, pp. 77–80, Adelaide, Australia, April 1994.

[48] M.J. Smith and T.P. Barnwell, “Exact Reconstruction for Tree-Structured Subband Coders,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–34, no. 3, pp. 431–441, June 1986.

[49] B. Sylvestre and P. Kabal, “Time-Scale Modification of Speech Using an Incremental Time-Frequency Approach with Waveform Structure Compensation,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 81–84, San Francisco, CA, 1992.

[50] W. Torres and T.F. Quatieri, “Estimation of Modulation Based on FM-to-AM Transduction: Two-Sinusoidal Case,” IEEE Trans. Signal Processing, vol. 47, no. 11, pp. 3084–3097, Nov. 1999.

[51] X. Yang, K. Wang, and S.A. Shamma, “Auditory Representations of Acoustic Signals,” IEEE Trans. Information Theory, vol. 38, no. 2, pp. 824–839, March 1992.

[52] K. Wang and S.A. Shamma, “Self-Normalization and Noise-Robustness in Early Auditory Representations,” IEEE Trans. Speech and Audio, vol. 2, no. 3, pp. 421–435, July 1994.

[53] S.G. Zauner, “Deterministic Signal Reconstruction from the Temporal and Spectral Correlation Functions,” Thesis, Technischen Universität Wien, Sept. 1997.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset