Chapter 11
Nonlinear Measurement and Modeling Techniques

11.1 Introduction

Throughout the text, we have seen or alluded to examples of speech events that occur on a fine time scale, for example, on the order of about a pitch period or less. Such speech events, which we will loosely refer to as “fine structure” in the speech waveform, arise from either time-varying linear or nonlinear elements of the speech production mechanism and have importance in distinguishing sounds and voice types. In the context of the linear model, rapid transitions occur across speech phones such as unvoiced-to-voiced transitions, abrupt onsets of sounds such as the attack of a plosive or a voiced sound, and events that are short-lived, closely spaced, or even overlapping such as the burst and aspiration components of a plosive. In this linear context, rapid formant and pitch modulation also occurs, as with formant movement through a dipthong and with vibratto in the singing voice. We introduced mathematical models for many of these events in previous chapters.

Nonlinear aspects of speech production also contribute to speech fine structure. We saw in Chapter 5 that nonlinear coupling between the vocal tract pressure and glottal flow velocity is responsible for rapid formant and bandwidth modulation within a pitch period; such coupling is assumed to be nonexistent in the linear source/filter model. In particular, first formant modulation within a pitch period is manifested by a rapidly increasing bandwidth and the truncation effect that can characterize the speech waveform. Also, nonlinear modes of vocal cord vibration can occur that are responsible for secondary glottal pulses within a glottal cycle (the diplophonic voice), erratic glottal pulses across successive pitch periods (the creaky voice) (Figure 10.15), and small glottal pulse time variations, referred to as pitch jitter (the hoarse voice). Modulations and secondary sources may also arise from the interaction of vortices with vocal tract boundaries such as the epiglottis (false vocal folds), teeth, or occlusions in the oral tract. As an example, a “vortex ring” might modulate the effective area of the vocal tract and thus modulate the response pattern. In a second example, a vortex, traveling along the oral tract, may be converted to a secondary acoustic sound source. In these latter nonlinear models, vortices are generated by a fast-moving air jet from the glottis, quite different from the small compression and rarefraction perturbations associated with the static (i.e., without net displacement) acoustic sound wave in the tract. We alluded to this nonacoustic phenomenon in Chapters 3 and 4 and will more formally describe it in this chapter.

Speech fine structure is often difficult to measure with signal processing techniques such as the short-time Fourier transform and the wavelet transform that are characterized by time-frequency resolution tradeoffs. That is, according to the uncertainty principle reviewed in Chapter 2, we cannot obtain unlimited time and frequency resolution simultaneously, which (to some extent) is often required in the representation of speech fine structure. A purpose of this chapter is to describe alternative high-resolution measurement techniques that aim to “beat” the uncertainty principle and reveal fine structure not seen by the standard approaches.

In Section 11.2, we first review the limitation of the short-time Fourier transform and wavelet transform associated with the uncertainty principle, and discuss some generalizations of these transforms in addressing this limitation. We next move beyond the STFT and wavelet transform in Section 11.3 to more advanced time-frequency analysis methods, including the Wigner distribution and its variations referred to as bilinear time-frequency distributions. These distributions, which seek to undermine the uncertainty principle, have historically been refined and applied in estimating fine structure speech events, largely in the context of linear speech models. In the latter half of this chapter, we introduce a second approach to analysis of fine structure motivated by nonlinear aeroacoustic models for spatially distributed sound sources and modulations induced by nonacoustic fluid motion. Section 11.4 presents the aeroacoustic modeling approach which provides the impetus for the high-resolution Teager energy operator described in Section 11.5. This operator is characterized by a time resolution that can track rapid signal energy changes within a glottal cycle and is the basis for a method of high-resolution estimation of amplitude and frequency modulation in speech formants.

11.2 The STFT and Wavelet Transform Revisited

In Chapter 7, we saw conditions under which an STFT analysis window, typically 20–30 ms in duration, can result in a poor representation of nonstationary resonances or harmonics.1 Formal studies also have been made of the inadequacy of the STFT in representing speech dynamics. In a study by Smits [45], for example, a 25-ms analysis window was shown for the STFT to result in many undesirable effects, including staircase formant tracks of rapidly varying formants and flattening-off of formants close to voicing onset. Likewise, in Chapter 8 (also Exercise 8.12), we saw that the wavelet transform, being constrained to good time resolution for high frequencies or good frequency resolution for low frequencies, is inadequate for the representation of many speech sounds.

1 In all analysis methods considered thus far, speech parameters are considered stationary under the analysis window, typically about 25 ms in duration. Only in certain special cases have we considered or alluded to explicit nonstationary models. These include a polynomial model for linear prediction coefficients (Exercise 5.13) and a Gaussian model for time-varying harmonic components (Section 9.4.5). The time-varying linear prediction analysis has been shown to lead to formant transitions that tightly cluster for the three plosive categories /p/, /t/, and /k/; with a stationary assumption, a more diffuse clustering occurs [35].

It behooves us, therefore, to seek alternative analysis methods that provide a time-frequency resolution with which fine structure speech parameters can be represented. This objective takes us to an approach based on nonlinear signal analysis methods that we refer to as bilinear time-frequency distributions. Before giving this more advanced time-frequency analysis, however, we present a common perspective on the STFT and wavelet transform analysis methods using a basis representation, thus setting the stage for more advanced time-frequency analysis.

11.2.1 Basis Representations

We saw in Chapter 8 that in the mathematical development of the wavelet transform, in continuous time, we define a set of functions as the time-scaled and shifted versions of a prototype h(t), i.e.,

Image

where h(t) is the basic wavelet, hτ,a(t) are the associated wavelets, τ is the time shift, and a is the scaling factor. We then define the continuous wavelet transform of a signal x(t) as

(11.1)

Image

which we can think of as a measure of “similarity” or correlation of x(t) with h(t) at different scales and time shifts, as was illustrated in Figure 8.15. We think of the functions hτ, a(t) as basis functions in our representation that are concentrated at time τ and scale a. We saw in Chapter 8 that the wavelet transform is invertible under the admissibility condition in Equation(8.33). Also under this condition, we can show a relation analogous to that of Parseval’s Theorem:

(11.2)

Image

[with Ch defined in Equation(8.33)] where we think of |Xw(τ, a)|2 as an energy density in time and scale (Exercise 11.1). Equation (11.2) states that the total energy in the signal is the integral of the energy density over time and scale with the appropriate scale factor.

A basis representation can also be formulated for the STFT. In continuous time, we define the STFT basis function as

hτ, Ω(t) = w(t − τ)ejΩt

which is concentrated near time τ and frequency Ω. Then the continuous-time STFT is written as

(11.3)

Image

In this case, the total energy in the signal is the integral of the energy density over time and frequency, provided that the energy in the analysis window is unity (Exercise 11.1). Both basis representations can be discretized, as we have seen with the discrete STFT and the discrete wavelet transform in Chapters 7 and 8, respectively.

11.2.2 Minimum Uncertainty

For the STFT and the wavelet transform, the basis functions define the time-frequency resolving power of the transform. In both cases, this resolution is limited by the uncertainty principle, requiring, as in Chapter 2, the variance about the mean for each dimension, i.e., frequency and time or scale and time. For frequency and time, we have the variances of a short-time segment governed by those of the short-time analysis window w(t). Specifically, we are interested in the variances

(11.4)

Image

where the mean values are

(11.5)

Image

The product of the variances must not be less than the lower limit of the uncertainty principle:

Image

For the STFT, the variances, as well as their product, remain fixed as hτ,Ω(t) = w (t − τ)ejΩt moves about in the time-frequency plane. Because hτ, Ω(t) defines the resolving power of the STFT, we refer to hτ, Ω(t) as a “cell” or “atom” in the time-frequency plane. It is of interest to consider the window that gives the minimum uncertainty. For the STFT, this optimal signal can be shown equal to the Gaussian function modulated by a complex sinusoid. In other words,

(11.6)

Image

for the basis functions of the form

hτ, Ω(t) = αeb(tτ)2ejΩt

which are concentrated at time τ and frequency Ω.

An analogous result holds for time and scale with the wavelet transform. The uncertainty principle for time and scale is given by

(11.7)

Image

where Image equals the mean time of the wavelet hτ, a(t) as defined in Equation (11.5) [10],[17]. The intuition behind the changing limit Image is as follows. Suppose that we stretch out a basic wavelet in time to form a wavelet, thus increasing its time variance Image. Because scale can be thought of as the reciprocal of frequency (Example 8.3), the scale variance Image must also increase and so does the product Image. In stretching out the basic wavelet, the mean time (i.e., location) of the resulting wavelet likewise increases. Equation (11.7) thus provides a constraint between these three increasing quantities.2

2 Nevertheless, the uncertainty relations for frequency and scale are essentially the same because normalizing the scale variance by the mean time in the uncertainty relation for time and scale, we obtain for Equation (11.7) Image. This is consistent with Figure 8.14 which shows that, although the time-frequency atom of the wavelet transform changes with frequency, in contrast to the fixed time-frequency resolution of the STFT, the Heisenberg constant remains fixed in the time-frequency plane.

For the wavelet transform, the modulated Gaussian is not the function to achieve the lower bound Image; rather, for the wavelet transform the gammachirp function achieves this lower bound [17]. The gammachirp function is of the form

g(t) = αtbe−ctejt+dlog(t)]

where the envelope αtbect is a gamma distribution function and where the complex exponential ejt+dlog(t)] has instantaneous frequency decreasing with time, Image, and thus the name “gammachirp.” The difference in the functions that achieve the Heisenberg limit for the STFT and the wavelet transform, corresponding to different basis representations, further illustrates that the formulation of an uncertainty principle depends on the particular framework in which it is derived.

In the frequency domain, the gammachirp filter has an asymmetric spectral magnitude characteristic. Moreover, it is fascinating that its particular shape has been found to provide a good fit to a variety of auditory filters, including the cochlear filters along the basilar membrane that we studied in Chapter 8 and critical band masking filters that we will study in Chapter 13 [17],[27]. It is also fascinating to observe that the optimality (in the sense of giving minimum uncertainty) for gammachirp filters in aural perception is analogous to that of modulated Gaussian functions in visual perception. Indeed, measurements in a cat’s visual cortex have found the presence of modulated Gaussian filters with respect to spatial resolution [11]. In these biological systems, the modulated Gaussian and the gammachirp filters represent a highly redundant and overcomplete basis.3 This redundancy, however, may be desirable in that the loss of certain filters still allows a representation from which a signal may be recovered.

3 This is in contrast to a complete representation where there are as many representation coefficients as there are sample values of a signal or an undercomplete representation for which some signals cannot be recovered.

In both the STFT and the wavelet transform, we have enforced an atom (or “cell”) with a particular time-frequency (scale) resolution on the signal. We can, on the other hand, consider adapting the time-frequency resolution of each atom to account for local signal properties [28]. Adaptive bases can be formulated to meet certain desired conditions such as completeness and invertibility. One adaptive basis is the local cosine basis which nonuniformly divides the time axis while uniformly segmenting frequency to adapt to local temporal properties. The dual to the local cosine basis is the wavelet packet basis that uniformly shifts the analysis window in time and nonuniformly divides the frequency dimension, adapting to the local frequency properties of the signal. A comparison of the “atomic tilings” of a wavelet packet basis and local cosine basis is shown in Figure 11.1. The particular cosine or wavelet packet basis can be selected with algorithms referred to as “best basis” searches that adapt to the particular time-frequency characteristics of the signal [28]. The basis is chosen to represent a signal, in a mean-squared-error sense, with a set number of time-frequency atoms. These techniques result in basis coefficients that highlight certain properties of a signal and have found usefulness in speech coding and noise reduction. Recall that we saw an example in Chapter 8 of how wavelet coefficients (albeit, not a “best basis”) can represent signal singularities from which the signal can be reconstructed.

Figure 11.1 Comparison of (a) wavelet packet and (b) local cosine bases. With the wavelet packet basis, for each nonuniform frequency division there is one time resolution. With the local cosine basis, for each nonuniform time division there is one frequency resolution.

SOURCE: S. Mallat, A Wavelet Tour of Signal Processing [28] (Figures 1.4 and 1.5 of Chapter 1). ©1998, Academic Press. Used by permission of the publisher.

Image

11.2.3 Tracking Instantaneous Frequency

An important illustration of the properties of the STFT and the wavelet transform is through the problem of estimating a time-varying frequency in a complex exponential signal of the form Image. For this signal, we have defined the instantaneous frequency as the derivative of the phase, i.e., Image. Recall also that we have considered the STFT magnitude squared of a signal, Image, as an energy density in frequency and the magnitude squared of the signal, Image, as an energy density in time. With these signal characteristics, it can then be shown that average and instantaneous frequency are related by (Exercise 11.2)

(11.8)

Image

This is an intriguing result relating the average frequency, as defined in the frequency domain, to average instantaneous frequency, as defined in the time domain using our definition of instantaneous frequency. Consider now the average bandwidth of a signal as defined in Equation (11.4). We saw in Chapter 2 (Exercise 2.8) that this average bandwidth for a continuous-time signal of the form Image can be expressed as

(11.9)

Image

The first term in Equation (11.9) is the average rate of change of the signal envelope a(t) and thus reflects how fast the general signal shape is changing, as well as the signal duration in that a shorter signal gives a more rapidly changing envelope. We will see shortly that the quantity Image can be interpreted as an instantaneous bandwidth of the signal. The second term is the average deviation of the instantaneous frequency Image around the average frequency Image and thus is a measure of frequency spread.

One application of the STFT and wavelet transform is the estimation of the instantaneous frequency of a complex sinewave Image. Indeed, we have seen this application in Chapter 9 in sinewave analysis where peak-picking the STFT magnitude was applied to estimate sinewave amplitudes, frequencies, and phases. It can be shown that the sinewave frequency (Image corresponds to the location of the maximum of the spectrogram |X(t, Ω)|2, provided that the instantaneous bandwidth Image and instantaneous frequency Image have small variation over the duration of the analysis window [28]. This “ridge tracking” [28] method of instantaneous frequency estimation is expressed formally as

Image

and

Image

provided the instantaneous frequency and bandwidth are slowly varying over the analysis window.

In ridge tracking, a problem arises with rapidly changing frequency, as occurs, for example, with rapidly changing pitch; in this case, our assumption of slow variation in instantaneous frequency and bandwidth breaks down and the ridges become less distinct and meaningful. We saw in Figure 7.8 an example of this signal characteristic for a single frequency-modulated sinewave. An example of this property for a harmonic complex of sinewaves, comparing constant pitch and linearly changing pitch, is shown in Figure 11.2. We see that the spectral slice in the latter case is ragged at high frequencies, which across time gives no distinct ridge or harmonic structure. Exercise 11.3 asks the reader to investigate this condition for ridge structure in more detail.

The wavelet transform also provides instantaneous frequency estimation by using ridges of the scalogram, defined in Chapter 8 as |Xw(t, a)|2. In particular, with time-varying frequency, a scalogram ridge analysis can more accurately track rapid changes at high frequencies by virtue of its increased temporal resolution. As with the STFT, under certain conditions on how fast the sinewave amplitude and frequency changes, the wavelet ridges correspond to the sinewave instantaneous frequency, i.e.,

Image

and

Image

where c is an appropriate constant that relates scale to frequency. As an example, for a single sinewave, an improvement in frequency estimation over the STFT has been demonstrated for a hyperbolic chirp frequency, a signal well-suited to wavelet analysis because the rate of frequency change increases with frequency [28]. In this case, a staircase frequency track arises using the spectrogram, while the scalogram ridges approximately follow the hyperbolic chirp [28].

Figure 11.2 Short-time spectral comparison of two voiced synthetic waveforms: (a) waveform with constant pitch; (b) log-spectral magnitude of (a); (c) waveform with linearly changing pitch; (d) log-spectral magnitude of (c). The waveforms are obtained by exciting a linear system with a pulse train. The linear system consists of resonances at 500 Hz and 1500 Hz. The period of the pulse train is fixed at 5 ms in (a), while it is increasing by 0.2 ms per period in (c). The Hamming window used in the short-time analysis is superimposed on the waveforms.

Image

The wavelet transform, as well as its extension to adaptive bases, can also help in the representation of speech signal components, for example, transient and rapidly varying sinewaves, as was noted in Chapter 9. Nevertheless, this resolution gain is limited. With harmonic spectra, for example, there is a limit to frequency resolution at high frequencies, especially when frequencies are closely spaced (Exercises 8.12 and 11.3). The problem is that, regardless of the configuration of time-frequency atoms, these approaches are constrained by the time-frequency uncertainty principle, both in the representation and in the estimation of speech features; the transforms are obtained by correlating the signal with a time-frequency atom that is characterized by a limited time-frequency resolution. As pointed out by Mallat [28], ideally, we want a time-frequency distribution that does not spread energy in time and frequency. Our next section moves in this direction with a nonlinear signal representation, in contrast to our analysis thus far that has been linear.

11.3 Bilinear Time-Frequency Distributions

In this section, we describe a number of time-frequency representations that possess many of the properties lacking in the spectrogram and scalogram. We find, however, that these advantages may be traded off for a loss in some of the desirable properties of the conventional representations.

11.3.1 Properties of a Proper Time-Frequency Distribution

The spectrogram |X(t, Ω)|2 and the scalogram |Xw(t, a)|2 are two examples of an energy density that describes how energy is distributed in time and frequency (or scale); hence we use the nomenclature “time-frequency distribution” interchangeably with the term “energy density.” Often the energy density is viewed similarly to that of a probability density function. As such, means, variances, and integrals along one or two dimensions should behave similarly. Based on this relation to a probability function, we first define what is referred to as a proper time-frequency distribution [10].

One property that must be satisfied by a proper time-frequency distribution P(t, Ω) is positivity, i.e.,

P(t, Ω) > 0

because we want P(t, Ω) to represent energy density. A second property is that the marginals must be satisfied. By this we mean

(11.10)

Image

That is, integration of the distribution over all frequencies at a particular time yields the energy density at that time, while integration of the distribution over all time at a particular frequency yields the energy density at that frequency. As shown above, we denote these purely time and frequency energy densities by P(t) and P(Ω), respectively. A third desired property is that the total energy of the distribution sum to one:

Image

Then, from the marginal property in Equation (11.10), the total energy E equals the energy in the signal, i.e.,

Image

which equals unity. This unity energy constraint simply implies that we have properly normalized the signal in order that the notion of density be valid. Another property of a proper density function is that the density function has finite support in one of the variables, given that the signal itself or its spectrum has finite support in that variable. That is, if a signal is zero outside the range [t1, t2], then we want the distribution P(t, Ω) to be zero outside this range. We call this a weak finite support condition [10]. A similar condition can be made along the frequency variable, i.e., for a bandlimited signal. Observe that the energy density cannot have weak finite support in both time and frequency because this would violate the uncertainty principle. An even stronger condition, the strong finite support condition, is that strips of zero energy over specific time or frequency intervals map to strips of zero energy in the time-frequency distribution. This finite support constraint is loosely related to a property that preserves the number of components; for example, the distribution of a two-component signal should reflect only these two components and not additional unwanted terms, sometimes referred to as cross terms, due to nonlinearity of a distribution operator. This notion of cross terms will become clearer as we proceed.

Based on this density concept and the above properties, we can now give certain global and conditional averages and variances of the time and frequency variables. We have already encountered the global averages and variances in our earlier discussion of the uncertainty principle. These are given in Equations (11.4) and (11.5), respectively, that represent the spread of time (duration) and the spread of frequency (bandwidth) and mean time and frequency. We are also interested in the conditional quantities of the mean frequency at a certain time and the dual quantity of the mean time at a certain frequency. We often also refer to these conditionals as “local” quantities. Motivated by our analogy with a probability density function, we write Image and, likewise, Image, and thus define conditional mean frequency and time, respectively, as

(11.11)

Image

Furthermore, when the marginals are satisfied, we have from Equation (11.10)

image

Likewise, the corresponding conditional variances are given by

(11.12)

Image

where, when the marginals are satisfied by P(t, Ω), then, as with the mean expressions, we can replace P(t) and P(Ω) by |x(t)|2 and |X(Ω)|2, respectively.

Finally, we are interested in the relation between the global and local means and variances. Here again, we are motivated by our probability density function analogy. For the means, we can show that

(11.13)

Image

For the variances, Cohen [10] has shown that for an arbitrary proper time-frequency distribution, the global variances of frequency and time are given by

(11.14)

Image

We see then that there are present two contributions to the global variances. For the frequency variable, for example, we have (1) the average over all time of the conditional variance in frequency and (2) the average over all time of the conditional mean frequency about its global mean. An analogous result holds for the time variable. Exercise 11.4 asks you to work through the derivation of these results.

To gain intuition for these concepts, we consider the specific case of a complex sinewave of the form Image. For this signal, we saw in Equation (11.8) the relation between the global frequency average and the instantaneous frequency. This relation supports the notion of the phase derivative Image as the instantaneous frequency, i.e., the frequency at each time of the complex sinewave Image, because the relation reveals the global frequency average as the average of Image weighted by a2(t) at each time. Moreover, when the marginals are satisfied, it follows from Equation (11.13) that the (conditional) mean frequency at a particular time is the instantaneous frequency, i.e., Image (Exercise 11.5). The dual result for the global time average is Image, where θ(Ω) = ∠X(Ω). We call Image the group delay which, likewise, when the marginals are satisfied can be shown to equal the (conditional) mean time at a particular frequency. When the marginals are satisfied, we thus have the equation pair:

(11.15)

Image

It is important to note that the marginals are not necessarily satisfied for an arbitrary time-frequency distribution P(t, Ω), and thus instantaneous frequency and group delay cannot always be computed as above from the distribution. We will see shortly that the spectrogram is one such distribution.

Consider now the global and conditional variances for the complex sinewave Image. We have seen above that the instantaneous frequency is the (conditional) average of frequencies that exist at a particular time. Therefore, we can think of the conditional variance as the spread in frequency about this average at each time. We have also seen that the global variance of frequency for the complex sinewave is given by Equation (11.9). To gain further insight into global variance, we rewrite Equation (11.9) as

(11.16)

Image

A dual expression can be obtained for the global variance of time. By comparing Equations (11.14) and (11.16), Cohen has argued that under the condition that the marginals are satisfied and that the conditional variance is positive (a plausible constraint), then the conditional variance of frequency at a particular time is given by

(11.17)

Image

Because Image is the spread of frequency at a particular time, it can be thought of as an “instantaneous bandwidth” [10], depending only on the amplitude and not the phase of the signal. For a constant amplitude, frequency-modulated sinewave, the instantaneous bandwidth is zero, there being one frequency at each time instant, i.e., the derivative of the phase [10]. In addition, in this case, we see in Equation (11.16) that the only contribution to global bandwidth is the spread of instantaneous frequency about the global mean frequency.

Consider now applying the above concepts to signals with more than one complex sinewave component, as, for example, voiced speech signals. The value of the instantaneous bandwidth of each component determines whether a time-frequency distribution is capable of revealing the components. For example, for a two-component signal Image, the two instantaneous frequencies at a time instant t are revealed as separate ridges in the distribution, provided that (Exercise 11.7) [10]

(11.18)

Image

Likewise, the separation of signal components in time at a particular frequency is determined by the time spread of each component at that frequency relative to its group delay (Exercise 11.7). These constraints are meaningful only when the marginals of a time-frequency distribution are satisfied and thus Equations (11.15) and (11.17) hold.

The properties of a proper time-frequency distribution are summarized as:

P1: The distribution is positive.

P2: The total energy integrates to one.

P3: The marginals are satisfied.

P4: The conditional mean frequency and its variance are positive and, for a complex sinewave Image, give the instantaneous frequency Image and bandwidth Image, respectively (when property P3 holds). The dual relations for conditional mean time and its variance also hold.

P5: The distribution has strong finite support and there are no cross terms.

We now look at the properties of a number of specific distributions.

11.3.2 Spectrogram as a Time-Frequency Distribution

Two energy densities that we have studied thus far are the spectrogram, based on the STFT, and the scalogram, based on the wavelet transform. In general, these energy densities do not satisfy many of the properties of our proper time-frequency distribution. In fact, typically, an arbitrary energy density violates some of our desired properties. For illustration, we focus in this section on the spectrogram. Analogous properties of the scalogram can be found in [10],[28].

We begin with the total energy for the spectrogram. Denoting the analysis window by w(t), it is possible to show that the total energy E is given by (Exercise 11.8)

Image

Therefore, if the energy of the window function is unity, then the energy in the spectrogram equals the energy in the signal. The marginals, i.e., integrating out time and frequency from the spectrogram, however, are not the desired time and frequency densities, |x(t)|2 and |X(Ω)|2, respectively (Exercise 11.8). A consequence of this is that the conditional mean frequency and its variance, although positive, do not in general equal the instantaneous frequency and instantaneous bandwidth, respectively, for a complex sinewave of the form Image. Specifically,

Image

Nevertheless, it can be shown that as the analysis window duration goes to zero, approaching an impulse, then the conditional mean approaches the instantaneous frequency [10]. In this case, however, the conditional variance Image goes to infinity, making difficult the measurement of instantaneous frequency [10]. The dual relations for conditional mean time and its variance also do not generally hold in the measurement of group delay.

Another violation of a desired property is that of finite support. A signal with finite support does not maintain this property in the spectrogram, as with, for example, a sinewave of finite length; the onset and offset are smeared in time in the time-frequency distribution. Moreover, the spectrogram has neither weak nor strong finite support. These limitations are a reflection of the analysis window being “entangled” with the signal within the distribution [10]; we see in Equation (11.3) that the time-frequency resolution of the spectrogram is limited by the window’s time-frequency resolution that is constrained by the uncertainty principle. Finally, we note that the spectrogram can introduce cross terms when more than one signal component is present. We earlier defined cross terms to mean terms that are not present in the original signal and that are created because the distribution is nonlinear in the signal. In the spectrogram, however, these cross terms are often hidden by the original components in the distribution [10].

In the next section, we explore the Wigner distribution that preserves many of the properties that the spectrogram violates. We will see, however, that certain desirable properties of the spectrogram are not held by the Wigner distribution. We then investigate variations of the Wigner distribution that merge the good properties of the spectrogram and the Wigner distribution. Finally, we illustrate the use of such distributions in speech analysis.

11.3.3 Wigner Distribution

The Wigner distribution was originally designed in the context of quantum mechanics by Wigner [53] and later in 1948 was introduced into signal processing by Ville [52], who was motivated by the measurement of instantaneous frequency. Since the work of Ville, further development of the Wigner distribution in the context of signal processing has been made, most notably in a series of papers by Claasen and Mecklenbräuker [7],[8],[9]. In this and the following few sections, we describe the Wigner distribution and its variations in continuous time. The extensions of the Wigner distribution to discrete time have been developed in [8],[18].

The Wigner distribution is given by

(11.19)

Image

which can also be written in terms of the signal spectrum, i.e., (Exercise 11.11)

(11.20)

Image

The Wigner distribution is one of a class of bilinear distributions, where the signal x(t) enters twice into the calculation of the transform. We can view the Wigner distribution as correlating a signal with itself translated both in time and in frequency. A signal that is time- or frequency-limited remains so over the same time-frequency intervals in the Wigner representation, but zero gaps in a signal or its spectrum are not necessarily maintained in the distribution (Exercise 11.11). The Wigner distribution thus has weak, but not strong finite support. A consequence of bilinearity, however, is that the Wigner distribution typically goes negative and thus its meaning as an energy density is sometimes questioned.

Another important property of the Wigner distribution is that, unlike the spectrogram, it satisfies the time-frequency marginals (Exercise 11.11):

(11.21)

Image

and because the marginals are satisfied, the desired energy relation follows:

(11.22)

Image

In addition, when the marginals are satisfied, the conditional mean frequency and time are correctly calculated [10],[28], i.e.,

Image

The conditional variances, however, generally give less plausible results and can even go negative [10].

For certain signals, the Wigner distribution provides a representation more discernible than the spectrogram. For a chirp signal, i.e., a signal of the form Image with constant amplitude and linear frequency, over all time, the Wigner distribution is a two-dimensional delta function concentrated along the instantaneous frequency, i.e., the Wigner distribution is non-zero only along the trajectory Image. As such, for this signal the conditional variance (and hence instantaneous bandwidth) is zero:

Image

We see then that for this particular signal class, because the distribution is a two-dimensional delta function, we have, in this conditional framework, “beaten” the uncertainty principle: the conditional, or what we also think of as the local, time-bandwidth product is zero, i.e., [25]

Image

Observe, however, that the global time-bandwidth product Image is still constrained by the uncertainty principle.

On the other hand, because of the presence of the analysis window, the “beating” of the uncertainty principle by the local time-bandwidth product can never occur with the spectrogram. (Exercise 11.12 asks you to derive bounds on the conditional time-bandwidth product for an arbitrary distribution, compare differences in the local and global time-bandwidth products, and consider the conditional time-bandwidth product for the spectrogram.) For a finite-duration chirp, likewise, the Wigner distribution is spread in frequency [10]. For a cubic phase, i.e., quadratic instantaneous frequency, the Wigner distribution is concentrated along Image, but it is also not a delta function. Higher-order instantaneous frequency functions can yield quite complicated Wigner distributions, even though the conditional mean frequency always gives the instantaneous frequency function; indeed, the spread about this function can make the time-frequency presentation nebulous.

In spite of certain features of the Wigner distribution for signals consisting of one amplitude-and frequency-modulated sinewave, its major difficulty arises with more realistic signals such as those in speech consisting of sums of amplitude- and frequency-modulated resonances or harmonics. Let us consider a signal with two additive components:

x(t) = x1(t) + x2(t).

It is straightforward to show that the Wigner distribution of x(t) consists of the Wigner distribution of the signal components, but also some additional terms:

(11.23)

Image

where

Image

and similarly for W21(t, Ω). The components W12(t, Ω) and W21(t, Ω) are cross terms we alluded to earlier. These cross terms come about due to the nonlinearity of the distribution, i.e., the signal is multiplied by itself shifted in time. One must be careful, however, in thinking that nonlinearity, and thus cross terms, are peculiar to the Wigner distribution. The spectrogram can also be considered as nonlinear in the signal x(t) due to the magnitude squaring operation. The difference, as we see in the example below, is that the cross terms introduced by the spectrogram are often hidden under the self terms, while the cross terms from the Wigner distribution are typically distinct from the self terms.

Example 11.1       An example illustrating the effect of cross terms placed between signal components in time and frequency is the sum of a complex sinewave and an impulse [10] expressed as

Image

The Wigner distribution of this signal consists of three components: an impulse in frequency (the Wigner distribution of ejΩot), an impulse in time [the Wigner distribution of Image, and a sinusoidal cross term, i.e.,

Image

Figure 11.3a shows the Wigner distribution of the sum of the two signals. The cross-term contribution from the Wigner distribution is clearly distinct (a mild case in contrast to the cross terms from other multi-component signals). For comparison, Figure 11.3b shows the corresponding spectrogram that gives a smeared representation of the two components, in contrast to the ideal resolution from the Wigner distribution. The cross terms of the spectrogram, however, are hidden under its self-terms [10]. Image

As with the STFT and the spectrogram (Chapter 7), the Wigner distribution is invertible, but up to a scale factor. To show this, we can use an approach similar to that used in showing the invertibility of the STFT. Because the Wigner distribution is the Fourier transform of Image, we can invert it to obtain

Image

We then evaluate this expression at time Image and set τ = t giving

Image

which is known up to the constant Image. Also, as with the STFT and spectrogram, not every two-dimensional function of time and frequency is a Wigner distribution [10]. In particular, when a legitimate Wigner distribution is modified in a desired way, e.g., to reduce interference or noise, the resulting time-frequency function is no longer necessarily a legitimate Wigner distribution. In this case, one typically uses least-squared-error methods for signal recovery, similar to those we developed in Chapter 7 for recovering a signal from a modified STFT or STFT magnitude [5].

An interesting property of the Wigner distribution is its relation to the spectrogram. One can show that for a window function w(t), the spectrogram is the convolution of the Wigner distribution of the signal, denoted here by Wx(t, Ω), and the Wigner distribution of the window, Wh(t, Ω) [10]:

(11.24)

Image

Figure 11.3 Wigner distribution (a) and spectrogram (b) of the sum of a complex sinewave and impulse. The ideal resolution of the Wigner distribution for these components is compromised by its cross terms. The spectrogram, on the other hand, has a relative loss in resolution, but has cross terms that are hidden.

SOURCE: L. Cohen, Time-Frequency Analysis [10]. ©1995, Pearson Education, Inc. Used by permission.

Image

revealing a loss of time-frequency resolution in the spectrogram relative to that of the Wigner distribution. Nevertheless, this smoothing can have the positive effect of reducing the presence of unwanted cross terms in the Wigner distribution.

Ultimately, we want a time-frequency distribution that is useful in the visualization and estimation of speech events that are hidden in the spectrogram and scalogram energy densities. Cross terms and negativity in the Wigner distribution can prevent our reaching this objective. One approach to reduce these unwanted characteristics is to apply a window function to the product Image [10]. This operation, however, also reduces time-frequency resolution. In the next section, we look at variations of the Wigner distribution that combine the desirable resolution of the Wigner distribution with the positivity and reduced cross terms of the spectrogram.

11.3.4 Variations on the Wigner Distribution

Cohen has proposed a class of time-frequency distributions that unite the spectrogram and Wigner distributions and, in addition, provide for the creation of new distributions [10]. This broad class of time-frequency distributions is written as

(11.25)

Image

which can also be written in terms of the signal spectrum, in a fashion analogous to that of the frequency version of the Wigner distribution. The two-dimensional function k(θ, τ) is referred to as the kernel. The kernel is selected to obtain certain time-frequency distributions with desired properties. For example, a kernel of unity gives the Wigner distribution, while the kernel for the spectrogram is given by4 Image, where w(t) is the analysis window [10]. The time-frequency marginals for C(t, Ω) are satisfied if (Exercise 11.14)

4 The function Image is called the ambiguity function and is often used in radar signal processing

Image

Other conditions on the kernel guarantee properties such as total energy conservation, preservation of local and global averages, and (important for speech analysis) reduced interference from cross terms with multi-component signals [10]. For low-level cross terms in C(t, Ω), the kernel must be peaked at its origin, i.e.,

Image

The kernel for the Wigner distribution does not satisfy this property. The kernel for the spectrogram, on the other hand, satisfies this property for particular windows; e.g., a Hamming or Gaussian window results in good cross-term behavior [10]. Conditions for weak and strong finite support, important again in the speech context, can also be imposed through the kernel [10],[26].

An important distribution derived from the above canonic formulation is the Choi-Williams distribution, which has good cross-term behavior. The kernel for the Choi-Williams distribution is given by Image. This kernel gives not only reduced cross terms, but also a distribution that satisfies the marginals; thus, local averages result in instantaneous frequency and group delay for a single sinewave. An example provided by Williams and presented by Cohen [10] illustrating how the Choi-Williams distribution reduces cross terms, relative to that of the Wigner distribution, is shown in Figure 11.4.

11.3.5 Application to Speech Analysis

We return now to the topic at hand: speech analysis. One motivation for investigating time-frequency distributions is to exceed the time and frequency resolution capabilities of the spectrogram for the representation of speech fine structure. In particular, we desire a distribution that can show simultaneously transient events such as plosives with good temporal resolution and tonal events such as harmonics with good frequency resolution; furthermore, we hope that such a distribution will uncover speech events not revealed by conventional approaches. The Wigner distribution and its extensions, such as the Choi-Williams distribution for reducing cross terms, are attempts to obtain this joint time-frequency resolution. A limitation of these methods is that, although they have better resolution properties than the spectrogram, they can result in negative values and residual cross terms that make their interpretation as an energy density difficult [10], particularly for speech signals with multiple components, closely spaced in time and frequency.

Figure 11.4 Comparison of the (a) Wigner distribution, (b) Choi-Williams distribution, and (c) spectrogram for a signal consisting of the sum of a linear FM sinewave and a sinusoidal FM sinewave [10]. The Choi-Williams distribution suppresses cross terms of the Wigner distribution and improves on the FM tracking by the spectrogram.

SOURCE: L. Cohen, Time-Frequency Analysis [10]. ©1995, W.J. Williams. Used by permission of W.J. Williams.

Image

Nevertheless, positive time-frequency distributions have been developed that hold promise in speech representations in that they can simultaneously give good temporal and spectral resolution while maintaining many desirable properties of a proper distribution. Development of these methods is beyond our scope, so we give the reader merely a flavor for the approach. These positive time-frequency distributions have been developed by Pitton, Atlas, and Loughlin by working with a class of distributions of the form [37].

P (t, Ω) = |x (t)|2|X (Ω)|2Λ (t, Ω, x(t)).

This relation shows that the energy density is low when the signal is low in either time or frequency, thus ensuring strong finite support in time and frequency. The function Λ(t, Ω, x(t)) is constrained to be positive and to be such that P(t, Ω) satisfies the marginals. Under these constraints, a function Λ(t, Ω, s(t)) is then selected that “maximizes the entropy” of the resulting distribution by using a nonlinear iterative estimation scheme [37]. Based on this formulation, Pitton, Atlas, and Loughlin have developed a number of positive time-frequency distributions that satisfy the marginals, have low cross-term interference, and good time-frequency resolution. The next two examples show first a synthetic and then a real speech signal case.

Example 11.2       Consider a simulated speech resonance with linearly changing center frequency, driven by a periodic impulse train.5 Although the driving function is periodic, the signal is not periodic due to the time-varying filter coefficients. Figure 11.5 gives, for the synthetic speech signal, a comparison of a wideband spectrogram, a narrowband spectrogram, and a positive time-frequency distribution by Pitton, Atlas, and Loughlin [37]. One sees that, unlike either spectrogram, the positive time-frequency distribution shows simultaneously the harmonics of the excitation, resonances of the time-varying filter, and onset times of the excitation. An additional, interesting signal property, revealed by the positive time-frequency distribution, is noted by Pitton [37]: “… although the spectral harmonics of the positive time-frequency distribution are spaced 125 Hz about the center chirping frequency, the harmonics are not located at integer multiples of 125 Hz because the signal is not periodic and cannot be described by a Fourier series with fundamental frequency 125 Hz.” Image

5This time-varying resonance is reminiscent of that given in Example 4.2 and Exercise 4.5 where an empty bottle is filled with running water.

Example 11.3       Figure 11.6 gives a comparison of the wideband spectrogram, narrowband spectrogram, and a positive time-frequency distribution by Pitton, Atlas, and Loughlin [37] for a real speech waveform. As with the synthetic case, one sees that, unlike either spectrogram, the positive time-frequency distribution shows simultaneously the harmonics of the excitation, resonances of the time-varying filter, and onset times of the excitation, such as at the start of plosives. Image

Positive time-frequency distributions of the type illustrated in the previous examples present some important directions for speech analysis. Nevertheless, it has been argued by Pitton, Wang, and Juang [38] that these distributions have not yet “fundamentally altered” approaches to speech processing or speech and speaker recognition tasks. The expectation here is that such time-frequency distributions would replace the spectrogram in providing a means to a clearer picture of speech production and its associated parameters. The problem, however, is that with greater resolution come greater variability of speech parameters and a larger number of parameters required in the representation. It is important in many speech processing applications that we attain a small set of reliable parameters that preserve the essential dynamics of the speech signal [38]. For example, in the speaker recognition task that we study in Chapter 14, an increase in the feature vector dimension implies the need for more training data, and greater variability with increased parameter resolution can blur speaker differences. A challenge is to reduce statistical variability while preserving time-frequency resolution. Nevertheless, the progress made in time-frequency distributions for speech analysis is an important step toward improved signal processing representations of the fine structure and dynamics of the speech waveform.

Figure 11.5 Comparison of the wideband/narrowband spectrograms and a positive time-frequency distribution for a simulated time-varying speech resonance: (a) waveform; (b) wideband spectrogram; (c) narrowband spectrogram; (d) positive time-frequency distribution. The positive time-frequency distribution reveals the harmonics moving along the path of the time-varying resonance.

SOURCE: J.W. Pitton, L.E. Atlas, and P.J. Loughlin, “Applications of Positive Time-Frequency Distributions to Speech Processing” [37]. ©1994, IEEE. Used by permission.

Image

In the second half of this chapter, we look at alternative approaches to measuring speech fine structure and dynamics. As we noted in this chapter’s introduction, these approaches have been motivated by a nonlinear, aeroacoustic view of speech production. We therefore take a short but fascinating and important detour before returning to methods of high-resolution speech analysis.

Figure 11.6 Comparison of the wideband/narrowband spectrograms and a positive time-frequency distribution for the speech utterance, “… credit card, and we use that …” from a female speaker: (a) wideband spectrogram; (b) narrowband spectrogram; (c) positive time-frequency distribution. The positive time-frequency distribution simultaneously resolves the plosives (e.g., at 5.3 s), harmonics, and the time-varying formants (e.g, at 5.8 s).

SOURCE: J.W. Pitton, L.E. Atlas, and P.J. Loughlin, “Applications of Positive Time-Frequency Distributions to Speech Processing” [37]. ©1994, IEEE. Used by permission.

Image

11.4 Aeroacoustic Flow in the Vocal Tract

The classical approach in speech science has been to reduce the problem of sound generation to one of purely acoustic motion where still air in the vocal tract is perturbed only by small one-dimensional planar compressions and expansions that comprise the sound field. In Chapter 4, this assumption led to the linear wave equation for describing the pressure/volume velocity relation within and at the input and output of the vocal tract. In this model, the source is independent from the vocal tract system. We have referred to this as the linear source/filter theory. While these approximations have allowed a great deal of progress to be made in understanding how speech sounds are produced and how to analyze, modify, synthesize, and recognize sounds, the approximations have led to limitations. In reality, acoustic motion is not the only kind of air motion involved. The air in the vocal system is not static, but moves from the lungs out of the mouth, carrying the sound field along with it, i.e., it contains a nonacoustic component. This nonacoustic phenomena, yielding a difference from the linear source/filter theory, can have an impact on the fine structure in the speech waveform and thus how speech is processed.

In this section, we outline some of the history and current thinking on aeroacoustic contributions to sound generation from the vocal tract, i.e., contributions to sound due to nonacoustic fluid motion. We noted in this chapter’s introduction that these contributions may take the form of secondary sources and modulations that arise from vortices traveling along the vocal tract. The multiple sources we refer to here are not a result of the conventional linear paradigm as with, for example, voiced fricatives where a periodic glottal source occurs simultaneously with a noise source at a vocal tract constriction. Rather, we refer to the distributed sources due to a nonlinear exchange of kinetic energy in nonacoustic fluid motion into propagating sound. Likewise, the modulations we refer to are not a result of nonlinear coupling between the glottal source and the vocal tract system that we described in Chapter 5, but rather are due to aeroacoustic phenomenon.

11.4.1 Preliminaries

Conventional theories of speech production are based on a linearization of pressure and volume velocity relations. Furthermore, these variables are assumed constant within a given cross-section of the vocal tract, i.e., a one-dimensional planar wave assumption. The linear assumption neglects the influence of any nonacoustic motion of the fluid medium, or “flow,” and leads to the classical wave equation for describing the pressure/volume velocity relation. In this model, the output acoustic pressure wave at the lips is due solely to energy from an injection of air mass at the glottis. It is known that in this process, only a small fraction (on the order of 1%) of the kinetic energy in the flow at the glottis6 is converted to acoustic energy propagated by compression and rarefaction waves [15],[19]. The vocal tract acts as a passive acoustic filter, selectively amplifying some bands while attenuating others.

6 Thus far, we have used the term “flow” or “airflow,” e.g., the glottal flow waveform, in an acoustic context of local velocity perturbations. In this chapter, the term “flow” takes on a different meaning when used in a nonacoustic context of net movement of air.

This linear, one-dimensional acoustic model has been suggested to be too tightly constrained to accurately model many characteristics of human speech. In particular, there is an increasing collection of evidence suggesting that nonacoustic fluid motion can significantly influence the sound field. For example, measurements by the Teagers [46],[47],[48] reveal the presence of separated flow within the vocal tract. Separated flow occurs when a region of fast moving fluid—a jet—detaches from regions of relatively stagnant fluid. When this occurs, viscous forces (neglected by linear models) create a tendency for the fluid to “roll up” into rotational fluid structures commonly referred to as vortices (Figure 11.7). The vortices can convect downstream at speeds much slower (90% slower or more) than acoustic propagation speed. Jet flow and associated vortices thus fall in the category of nonacoustic behavior.

There has been much effort in the confirmation of the existence and importance of these vortices for sound generation, encompassing theories, measurements, mechanical models, numerical solutions to the Navier Stokes equations (describing the complete nonlinear fluid dynamics), and more practical computational models [2],[15],[19],[33],[44],[48],[49]. Here we focus on three representative works: (1) the original motivating measurements of the Teagers [47] and their insightful interpretation by Kaiser [19], (2) the mechanical model of Barney, Shadle, and Davies [2],[43], and (3) the computational model of Sinder [44]. In all three cases, speech spectral measurements are explained by the presence of both an acoustic source and a source generated from nonacoustic motion.

Figure 11.7 Jet flow through the glottis and the formation of a sequence of vortices.

Image

11.4.2 Early Measurements and Hypotheses of Aeroacoustic Flow in the Vocal Tract

The Teagers [47] speculated on the presence of nonacoustic phenomenon in the vocal tract, with a “flow not traveling by compression.” They supported this speculation with extensive measurements made in the oral cavity with a hot-wire anemometer (a device that measures airflow velocity from changes in temperature in the surface of a material). With a steadily held voiced sound, it was found that the airflow velocity is not uniform across the cross section. In one experiment, the flow velocity at the tongue differed significantly from the flow velocity at the roof of the mouth (the palate) at the same location along the oral tract. Flow velocity measurements in this experiment were made simultaneously using an array of hot-wire anemometers. The measurements also indicated that the jet from the glottis attached itself to the cavity walls, but switching back and forth (“flapping”), at approximately the first formant frequency, between the tongue and the palate. With hot-wire anemometers that could measure normal flows, as well as radial and axial flows, the Teagers also observed vortices of both an axial and a radial type. These measurements are clearly inconsistent with planar one-dimensional acoustic flow velocity as predicted from the linear wave equation. A schematic comparison of the acoustic and nonacoustic models is illustrated in Figure 11.8. Note that the hot-wire anemometer measurement captures total flow velocity. Therefore, because the acoustic velocity is very small relative to the nonacoustic flow velocity, the Teagers’ measurements are probably dominated by the nonacoustic contribution. An implication is that the measurements do not rule out the acoustic propagation and nonacoustic flow both present simultaneously. We return to this paradigm in our discussions of the work of Barney, Shadle, and Davies [2].

In order to understand where a discrepancy from conventional linear theory might arise, we return to our derivation of the wave equation in Chapter 4. The reader may recall a sleight-of-hand in arriving at the final form in Equation (4.1); the total derivative Image in Newton’s 2nd Law was replaced with the partial derivative Image. Because the particle velocity υ(x, t) is a function of space x as well as time t, we noted that the true acceleration of the air particles is given by [34],[48]

Figure 11.8 Comparison of (a) acoustic and (b) aeroacoustic models of speech production.

SOURCE: J.F. Kaiser, “Some Observations on Vocal Tract Operation from a Fluid Flow Point of View” [19]. ©1983, The Denver Center for the Performing Arts, Inc. Used by permission.

Image

Image

and therefore Equation (4.1) is correctly written as

(11.26)

Image

which is a nonlinear equation because the fluid particle velocity υ multiplies Image, consequently making it difficult to determine a general solution. The linear approximation to Equation (11.26) is accurate when the correction term Image is small relative to Image. For sound waves, the corrective term is removed because there is no net flow, only very small velocity changes from local compressions and rarefractions. When the air is moving as a fast-moving jet, however, the air velocity may not be small and thus the nonlinear correction term can grow in importance [34]. This is one of numerous contributions that would arise from solution to the nonlinear Navier Stokes equation that describes the full complexity of velocity/pressure relations in the vocal tract and glottis.

The Teagers studied a variety of vowel sounds and found a unique flow pattern for each sound [46],[47],[48]. These flow patterns included a jet attached along the inside wall of the vocal tract and vortices “entrapped” in small cavities along the vocal tract such as due to the epiglottis (false vocal folds) (Figure 11.8b). They called the latter pattern the “whistle” because the jet velocity was observed to be modulated by the vortex that expands and contracts within the small cavity.7 Another observed pattern is a fast-moving jet with large enough flow velocity to cause a train of vortices to be shed downstream (as we saw in Figure 11.7). The Teagers suggested that the presence of these traveling “smoke ring” vortices could result in additional acoustic sources throughout the vocal tract through the conversion of the nonacoustic flow to an acoustic propagation. Yet another observed pattern is a jet flapping across wall cavities, as noted earlier. For the various sounds, the Teagers found combinations of these flow patterns and a variety of interactions between them.

7 In the policeman’s whistle, a jet is deflected into the whistle cavity and a vortex builds up inside the cavity. This vortex amplitude modulates the jet at a rate determined by the vortex’s precession rate.

Motivated by the measurements of the Teagers, Kaiser re-examined the source/filter theory in light of aeroacoustic theories, giving further credence to a “jet-cavity flow paradigm” in the vocal tract [19]. Kaiser hypothesized that the interaction of the jet flow and vortices with the vocal tract cavity is responsible for much of the speech fine structure, particularly at high formant frequencies. This contribution is in addition to the glottal flow associated with the conventional linear source/filter theory, and the fine structure from the jet-cavity interactions is time-synchronized with this bulk glottal flow. He also hypothesized that the relative timing of the different flow interaction events and their effect on instantaneous formant frequencies within a glottal cycle are more important in speech processing and recognition than absolute spectral shape. Furthermore, these complex interactions may cause the appearance and disappearance of higher formants in the spectra over successive glottal cycles, with different excitations exciting different formants. Kaiser also outlined tasks needed to take the jet-cavity paradigm from the qualitative to the quantitative and functional realm, including flow measurements in mechanical models of the vocal tract, a mathematical model of acoustic wave generation from energy exchange in flow interaction, and computational models. Finally, he proposed the need for time-frequency analysis methods with greater resolution than the STFT for measuring fine structure within a glottal cycle due to the jet-cavity interaction paradigm, especially rapidly appearing and disappearing frequency components; the methods ideally should be insensitive to nonlinear transformations of the speech spectrum.8

8 Kaiser argues that because speech that is passed through a no-memory nonlinearity loses little intelligibility, the timing of jet-cavity interaction events and instantaneous formant frequencies may be more important than the absolute spectral shape [19].

Since the early work of the Teagers and Kaiser, aeroacousticans such as McGowan [33] and Hirschberg [15] have looked at more theoretical aspects of the nonacoustic velocity component within the vocal tract that result in acoustic sound sources due to interaction with the vocal tract boundaries at area discontinuities. In addition, two-dimensional numerical simulations of the Navier Stokes equation describing fluid flow in the vocal tract with appropriate boundary conditions further suggest the existence of an unsteady jet at the glottal exit from which shedding vortices downstream from the glottis can evolve [23],[49]. More recently, Barney, Shadle, and Davies [2] have taken a different approach using a dynamical mechanical model, and Sinder [44] has developed computational models of unvoiced speech, both showing that a significant and measurable portion of sound at the lips can be explained by the presence of nonacoustic vortical flow. We now briefly describe these two developments.

11.4.3 Aeroacoustic Mechanical Model

Barney, Shadle, and Davies [2] have measured the acoustic contribution from vortices using a dynamical mechanical model (DMM) with simple geometry consistent with typical average male vocal tract measurements for voiced speech. The model consists of a duct 28 cm in length and 1.7 cm in diameter. A vibration generator with moving shutters gave a sinusoidal input driving function of 80 Hz that modulated three flow volume velocities of 200, 300, and 400 cm3/s at the entrance to the duct.9

9 The self-noise component of the mechanism due to the shutter movement was estimated by measuring sound at the duct exit without airflow at the shutter entrance. With the proper relative waveform phase, this estimate was then subtracted from measurements.

Using the linear acoustic theory of Chapter 4, Barney, Shadle, and Davies predicted the acoustic spectral power at a 45° angle and 60 cm from the duct exit for all three flow velocities. The prediction also invoked a model of a piston in an infinite baffle (Chapter 5) and accounted for a directivity factor because of the 45° off axis measurement. In addition, a spectral measurement of the duct output pressure, using a pressure-sensitive electret microphone, was made by averaging 13 short-time power spectra using a 4096-point FFT with a Hanning window overlap of 15%. Spectral predictions and measurements were plotted for harmonics of the shutter speed. The result is shown in Figure 11.9a. Except for the 4th, 5th, and 6th harmonics, the measured acoustic power is underpredicted by at least 10 dB, indicating the presence of a nonacoustic contribution.

To explore this inconsistency, Barney, Shadle, and Davies made hot-wire anemometer velocity flow measurements in the duct. Hot wire anemometers were inserted along the duct through a small slot. Their measurements revealed a jet flow at the opening with shear layers that roll up to form a succession of regularly-spaced vortices downstream from the shutter opening. The flow velocity measurements indicated symmetric “smoke ring” vortices traveling down the duct from about 4 cm downstream from the moving shutters. It was also found that the mean velocity was never zero, even when the shutters were closed. Their conclusion is that the velocity field in the duct consists of an acoustic standing wave due to the fluctuating mass at the exit to the shutters and a train of vortices, one generated during each shutter cycle, drifting down the duct at the mean flow speed. A model for the total velocity disturbance in the duct is therefore given by

Image

where Image is the mean velocity of the air particles, υa(t) is the acoustic particle velocity, and υn(t) is the nonacoustic particle velocity due to the vortex transport. The number of vortices in the duct can be calculated at any particular time from the shutter speed and mean flow velocity (Exercise 11.17).

Figure 11.9 Comparison of far-field acoustic spectral power measurements (60 cm from the duct exit plane at a 45° angle to the duct center axis) and predictions for three volume velocity inputs: (a) acoustic model; (b) combined acoustic and nonacoustic models. The solid line represents the measurement and the dashed line represents the prediction. SPL denotes the sound pressure level with respect to the ambient pressure.

SOURCE: A. Barney, C.H. Shadle, and P.O.A.L. Davies, “Fluid Flow in a Dynamical Mechanical Model of the Vocal Folds and Tract, 1: Measurements and Theory” [2].©1999,Acoustical Society of America. Used by permission.

Image

Although vortices themselves are not effective sound radiators, when interacting with changing solid boundaries, they can provide significant acoustic sources that can excite resonators. For the DMM model, the open end of the duct provides such a boundary; the non-acoustic flow is transferred to pressure and velocity fluctuations that can generate radiated sound at the duct exit. Davies has developed a model by which this transfer can occur and has found an expression for the generated acoustic power associated with the vortices under the infinite baffle assumption at the duct output [12]. This model was used to predict the spectral component from nonacoustic sources outside the DMM at the 45° angle and 60 cm from the duct exit for the three flow velocities given above. Under the assumption that the acoustic and nonacoustic contributions are independent, the sum of the two terms far more accurately predicts the sound pressure level of each harmonic than the acoustic contribution alone (Figure 11.9b). Discrepancies may be due to certain simplifying assumptions including a symmetric jet from the duct opening (thus not allowing wall attachment of the jet), the nondispersive nature of the vortices, and independence of the acoustic and nonacoustic components.

The question arises whether the DMM model extends to actual vocal tract geometries, i.e., whether vortices exist in the vocal tract, and, if so, whether the vortical train provides additional acoustic sources at boundary discontinuities. To answer this question, Shadle, Barney, and Davies [43] compared hot-wire anemometer measurements for the DMM with those of an actual vocal tract during phonation of the vowel /A/. The vowel /A/ was selected because it is most similar to the geometry of the DMM. Velocity field measurements in the vocal tract within a glottal cycle were found to be similar in shape, but often different in magnitude, to those of the DMM. The flow velocity typically consisted of a set of primary pulses at the pitch rate and secondary pulses between the primary pulses. Differences from the vocal tract were probably due to the sinusoidal shutter movement, in contrast to the skewed sawtooth-like glottal flow of humans with unequal open and closed phases, and to the interaction of the vortex train with a multitude of tract boundaries, unlike the single DMM duct exit. From the similarities and explained differences, it was concluded that the DMM acts qualitatively like a voiced vocal tract and thus suggests sound generation by a vortex train in actual vocal tracts. As with the DMM, the vortex train interacts with boundaries of the vocal cavity, producing acoustic sources.

Shadle, Barney, and Davies [43] also investigated implications of the presence of the nonacoustic, vortical source for glottal waveform estimation by inverse filtering, and noted inconsistencies with inverse filtering based on the conventional linear source/filter theory. For example, at the shutter opening different glottal flow waveforms are measured for different hot-wire positions within the cross-sectional plane. Perhaps the most serious inconsistency, however, is that multiple acoustic and nonacoustic sources may exist in the vocal tract cavity, not all located at the opening of the glottis, but where the vortical train meets boundary discontinuities. Acoustic and nonacoustic sources have different travel times and excite different regions of the vocal tract cavity, i.e., different sources can excite different formants, thus extending the notion of a single inverse filter. This notion is consistent with Kaiser’s speculation of formants that disappear and reappear over successive glottal cycles [19]. The following example illustrates a speech measurement consistent with this phenomenon, but also shows that care must be taken in how we interpret our observations:

Example 11.4       Multi-source, multi-response models of production within a glottal cycle suggests as a simplification a convolutional model of speech at the lip output. This model involves more than one source with corresponding different resonant system functions. We have observed a variety of speech signals where different vocal tract impulse responses appear within a glottal cycle. It is easiest to see these events for very low-pitched speakers. An example of this phenomenon is shown in Figure 11.10a, which shows secondary responses about two thirds into the glottal cycle. Figure 11.10b zooms in on one period, illustrating that the secondary response is different in shape from the primary response, a property further confirmed by the spectra in Figure 11.10c. A 14th-order all-pole spectral fit (obtained using the autocorrelation method of linear prediction) is shown for each response. We see that, in the secondary component, the first formant (as well as the last formant) is noticeably absent, and so the two responses have different spectral characteristics. We can also see this property in time in the isolated glottal cycle (Figure 11.10b) where the low- and high-frequency components of the primary response are not present in the secondary response. Without further analysis, however, we cannot definitely give the cause of this change; different formants may arise, for example, because the secondary glottal source has a different shape from the primary pulse or because a secondary flow-induced source occurs downstream from the glottis as, for example, when a sound source is created by energy exchange from a vortex at a location along the vocal tract such as at the epiglottis. An important consequence of this difference in primary and secondary components is that the low-order autocorrelation coefficients used typically in speech analysis, e.g., in linear prediction analysis, consist of the sum of the autocorrelation coefficients from two different impulse responses (Exercise 11.18). Image

Figure 11.10 Example of secondary pulses with different spectrum from that of the primary pulses: (a) speech waveform; (b) one glottal cycle; (c) spectra (14th-order all-pole fit) of primary and secondary responses.

Image

We seek signal processing tools that can reveal the type of fine structure in the previous example, even when such fine structure is hidden, for example, when secondary sources occur with high-pitched speakers. Ideally, our signal processing tools should also help in the very difficult task of distinguishing between acoustic and nonacoustic sources.

11.4.4 Aeroacoustic Computational Model

A functional model that incorporates the nonlinear effects of nonacoustic motion allows for improved prediction and analysis of speech signals. In this paradigm, vortices in the flow lead to additional acoustic sources within the vocal tract. Speech output at the lips is computed as the sum of vocal-tract-filtered responses to the glottal source and aeroacoustic flow sources. We have seen that, although vortices themselves are not efficient sound radiators, when interacting with a non-uniform vocal tract they can be a source of sound. In this manner the vocal tract resonator can be excited, for example, at the epiglottis, velum, palate, teeth, and lips, as well as at other locations with a changing cross section. A computational model requires that we know the spectrum, strength (gain), and location of these secondary sound sources.

Sinder [44] has applied aeroacoustic theory to produce a computationally efficient model of vortex sound generation. In this model, the characteristics of the source—spectrum, strength, and spatial distribution—are automatically computed as a function of vocal tract shape and lung pressure. The essential result from aeroacoustic theory incorporated into this work is that of Howe [16]. Howe’s result relates the motion of vortices through a duct of changing cross-section to the plane wave acoustic sound field generated by that motion. This relation is used to compute the strength and location of acoustic pressure sources in the duct due to traveling vortices generated by jet flow. The model is applicable to either voiced or unvoiced speech with vortices originating from a jet formed at or downstream from the glottis. The computational model used by Sinder determines automatically the location of jet formation (i.e., flow separation boundaries), the time of vortex creation, and vortex motion based simply upon the vocal tract geometry and the local air flow speed. For voiced speech, the jet is created at the glottal exit and can contribute a distributed aspiration source during voicing. For unvoiced fricatives, the jet forms near the tongue-palate constriction. For unvoiced plosives, a glottal jet follows, in time, a jet formed at a tongue-palate constriction, thus allowing for the aspiration component of the plosive. A jet is formed only when local particle velocity exceeds a threshold. For voiced fricatives and plosives, jet formation occurs in a way similar to that of unvoiced sounds, except that in these cases the valving of the glottis modulates the jet at a constriction and thus modulates the vortex formation; the noise source due to the vortices is therefore pitch synchronous. Sinder has applied this computational model for improved distributed source creation of unvoiced sounds in speech synthesis [44] for which point (non-distributed) sources are conventionally used.

11.5 Instantaneous Teager Energy Operator

We have noted that speech fine structure, as, for example, secondary sources and modulations within a glottal cycle, is often not “seen” by conventional Fourier analysis methods that look over multiple glottal cycles. This limitation motivated the Teagers to propose an operator characterized by a time resolution that can track rapid signal energy changes within a glottal cycle [20],[21]. The Teager energy operator is based on a definition of energy that accounts for the energy in the system that generated the signal. An extension of this operator also allows for a simple algorithm for separating out from the Teager energy sinewave amplitude modulation (AM) and frequency modulation (FM) with excellent time resolution [29],[30],[31]. We refer to this algorithm as the energy separation algorithm and show its application to estimating AM and FM in speech formants.

11.5.1 Motivation

As a means to motivate the Teager energy operator and the energy separation algorithm for AM-FM estimation, we first consider the different perspective on energy introduced by the Teagers and extended by Kaiser [20],[21]. In the conventional view of “energy,” we compute the sum of squared signal elements, i.e., an average of energy density. This means that tones at 10 Hz and at 1000 Hz with the same amplitude have the same energy. Teager observed, however, that the energy required to generate the signal at 1000 Hz is much greater than that at 10 Hz. This alternative notion of energy can be understood with the sinusoidal oscillation that occurs with a simple harmonic oscillator. In this case, the total energy in the system, i.e., the sum of the potential and kinetic energy, is proportional to the product of frequency and amplitude squared.

For a harmonic oscillator, applying Newton’s law of motion to the motion of a mass m suspended by a spring with force constant k, gives the 2nd-order differential equation

Image

whose solution is the harmonic motion Image, where A is the amplitude and Ω the frequency of oscillation. The total energy is the sum of the potential energy (in the spring) and the kinetic energy (of the mass) and is

(11.27)

Image

Substituting for x(t) our harmonic solution, we obtain

Image

or

EA2Ω2

which is the “true” energy in the harmonic system, i.e., the energy required to generate the signal.

11.5.2 Energy Measurement

Consider the energy operator introduced by Kaiser [21] and defined by

(11.28)

Image

where Image. When Ψc is applied to signals produced by simple harmonic oscillators, without explicit requirement of values of mass m and spring constant k, the operator can track the oscillator’s energy (per half unit mass), which is equal to the squared product of the oscillation amplitude and frequency (Exercise 11.25):

Ψc[x(t)] = A2Ω2

for a signal of the form

x(t) = A cos(Ωt + θ).

To discretize this continuous-time operator, we can select a variety of sampling strategies. One possibility is to replace t with nT (T is the sampling period), x(t) by x(nT) or simply , x[n] Image by its first backward difference Image, and Image by Image. Then, we have (Exercise 11.25)

(11.29)

Image

where Ψd(x[n]) is the discrete-time energy operator (the counterpart to the continuous-time energy operator Ψc[x(t)]) defined as

(11.30)

Image

for discrete-time signals x[n]. Analogous to the case of a continuous-time sinewave, we can show (Exercise 11.25)

Ψd(x[n]) = A2 sin2(ω)

for a signal of the form

x[n] = A cos(ωn + θ).

This result of applying Ψd(x[n]) to a discrete-time sinewave is exact, provided that Image. Furthermore, for small values of ω, because sin(ω) ≈ ω, it follows that

Ψd(x[n]) = x2[n] − x[n − 1]x[n + 1] ≈ A2ω2

and so, as with the continuous-time operator, the discrete-time operator can track the energy in a discrete-time sinewave. These energy operators were developed by the Teagers in their work on modeling speech production [47],[48] and were first introduced by Kaiser [20],[21].

The energy operator is also very useful for analyzing oscillatory signals with time-varying amplitude and frequency. In continuous time, such real-valued AM-FM signals of the form

(11.31)

Image

can model time-varying amplitude and frequency patterns in speech resonances [29]. The signal x(t) is a cosine of carrier frequency Ωc with a time-varying amplitude signal a(t) and a time-varying instantaneous frequency signal

Image

where |q(t)| ≤ 1, Ωm ∈ [0, Ωc] is the maximum frequency deviation, and θ is a constant phase offset. Specifically, it has been shown [29],[30] that Ψc applied to an AM-FM signal can approximately estimate the squared product of the amplitude a(t) and instantaneous frequency Ωi(t) signals, i.e.,

(11.32)

Image

assuming that the signals a(t) and Ωi(t) do not vary too fast (time rate of change of value) or too greatly (range of value) in time compared to the carrier frequency Ωc. Likewise, a discrete-time version of this result can be obtained using the discrete-time energy operator Ψd(x[n]) = x2[n] − x[n − 1]x[n + 1]. Specifically, consider an AM-FM discrete-time sinewave of the form

(11.33)

Image

with instantaneous frequency

Image

where |q[n]| ≤ 1, ωm ∈ [0, ωc] is the maximum frequency deviation, and θ is a constant phase offset. Note that the continuous-time frequencies Ωc, Ωm, and Ωi have been replaced by their discrete-time counterparts ωc, ωm, and ωi. All discrete-time frequencies are assumed to be in the interval [0, π]. Then, we can show [29],[30]

Image

again provided that the signals a[n] and ωi[n] do not vary too fast (time rate of change of value) or too greatly (range of value) in time compared to the carrier frequency ωc. More quantitative constraints for these approximations to hold in both continuous- and discrete-time can be found in [29],[30]. Specific cases of an exponetially decaying sinewave and a linear chirp signal are explored in Exercise 11.25.

As with other time-frequency representations we have studied, the energy operator warrants a close look with the presence of more than one sinewave component. Consider, for example, a sequence with two components

x[n] = x1[n] + x2[n]

The output of the Teager energy operator can be shown to be (Exercise 11.21)

(11.34)

Image

where Image and Image are “cross Teager energies” which can be thought of as cross-correlation-type terms between the two sequences. The following example by Kaiser [20] gives some insight into the use of the Teager energy on a two-component signal:

Example 11.5       Consider a discrete-time signal consisting of two sinewaves

x[n] = A1 cos(ω1n + θ1) + A2 cos(ω2n + θ2).

For this case, it can be shown that the Teager energy is the sum not only of the individual energy components but also an additional cross term at the difference frequency of the two frequency components (Exercise 11.21). Figure 11.11 illustrates the energy measure for the two-component signal where the effect of the cross term is similar to that of beating between two sinewaves. For this example Image, and Image. Kaiser observes [20] “… the output of the algorithm is maximum when the composite signal is at its peak and minimum when at zero; it is as if the algorithm is able to extract the envelope function of the signal.” Image

Figure 11.11 Teager energy measure for the two-component signal of Example 11.5: (a) sum of two sinewaves; (b) Teager energy of (a).

SOURCE: J.F. Kaiser, “On a Simple Algorithm to Calculate the ‘Energy’ of a Signal,” [20]. ©1990, IEEE. Used by permission.

Image

We see then that the cross-correlation may contain useful information (as with other time-frequency representations). Nevertheless, we must separate the components prior to using the energy operator when only one component is of interest.

An important property of the Teager energy operator in discrete time, Equation (11.30), is that it is nearly instantaneous given that only three samples are required in the energy computation at each time instant: x[n − 1],x[n], and x[n + 1]. This excellent time resolution provided the Teagers with the capability to capture energy fluctuations (in the sense of squared product of amplitude and frequency) within a glottal cycle. In the context of the above harmonic oscillator formulation, the Teagers loosely referred to a “resonance” as an oscillator system formed by local cavities of the vocal tract emphasizing certain frequencies and de-emphasizing others during speech production. We saw earlier that the Teagers’ experimental work provided evidence that speech resonances can change rapidly within a single pitch period,10 possibly due to the rapidly varying and separated speech airflow and modulating vortices in the vocal tract [46],[47],[48]. In addition, vortices that shed from the jet and move downstream can interact with the vocal tract cavity, forming distributed sources due to a nonlinear exchange of kinetic energy in nonacoustic fluid motion into propagating sound. We noted that these distributed sources may reveal themselves as secondary sources within a glottal cycle. We also noted that Kaiser hypothesized that these complex interactions may cause the appearance and disappearance of higher formants in the spectra over successive glottal cycles with different excitations exciting different formants.

10 Time variations of the elements of simple harmonic oscillators can result in amplitude or frequency modulation of the oscillator’s cosine response [51].

The Teagers’ approach to revealing the result of these complex interactions from a speech pressure measurement is the following [48]. First, after locating the resonance center frequencies of the vocal tract cavity, each resonance was bandpass-filtered (with a modulated Gaussian filter) around the resonance center frequency. This filtering approximately separates out the resonant components. Each bandpass-filter output was then processed with the discrete energy operator of Equation (11.30). The energy traces for six consecutive pitch periods from the vowel sound /i/ of the Teagers’ original work is shown in Figure 11.12. The Teagers reasoned that if the formants were simply damped sinewaves, as predicted by linear acoustic theory, then the energy traces would be exponentially decaying over each pitch period (Exercise 11.25). The Teagers hypothesized that the bumps observed in Figure 11.12 within a glottal cycle, on the other hand, reflect the aeroacoustic behavior, in particular “pulsatile vortical flow interactions” with boundaries of the vocal tract; these interactions are dependent on the physical locations and types of vortices present in the vocal tract, perhaps “in three separate locations, working from the glottis to the tongue” [48]. Alternatively, as we will see shortly, one can interpret these observations as the result of amplitude and frequency modulations of cavity resonances.

Finally, given that our framework in this chapter is high resolution time-frequency analysis, it is of interest to consider the Teager energy as the basis or component of a time-frequency distribution. A detailed study is beyond the scope of this chapter. It suffices to say that a time-frequency distribution has been defined based on the Teager energy function [24]. Specifically, it has been shown that a time-frequency distribution can be defined as a modified Wigner distribution, with the property that its frequency marginal yields the Teager energy. A connection between the Teager energy operator and the ambiguity function, as well as other modifications of the Wigner distribution, has also been established for a generalized Teager operator for complex-valued signals [14]. [The ambiguity function, as we noted earlier in this chapter (footnote number 4), is related to the Wigner and spectrogram distributions.] This has led to the use of the Teager energy operator in measuring various moments of a signal and its spectrum [14].

Figure 11.12 Traces of the Teager energy operator output for the vowel sound /i/: (a) speech waveform; (b)−(d) bandpass-filtered first, second, and third formants and their energy traces.

SOURCE: H.M. Teager and S.M. Teager, “Evidence for Nonlinear Sound Production Mechanisms in the Vocal Tract” [48] (Figure 5). ©1990, Kluwer Academic Publishers, The Netherlands. Used by kind permission from Kluwer Academic Publishers.

Image

11.5.3 Energy Separation

We have seen that the Teager energy operator, when applied to an AM-FM sinewave, yields the squared product of the AM and FM components (more strictly speaking, amplitude envelope and instantaneous frequency). We now describe an approach to separate the time-varying amplitude envelope a(t) and instantaneous frequency Ωi(t) of an arbitrary AM-FM signal, based on the energy operator. First, we present a continuous-time solution for exact estimation of the constant amplitude and frequency of a sinewave and then show that the same equations approximately apply to an AM-FM signal with time-varying amplitude and frequency. We call these algorithms energy separation algorithms because an oscillator’s energy depends on the product of amplitude and frequency [31]. Next, we develop counterpart separation algorithms for discrete-time signals. We then apply discrete-time energy separation to search for modulations in speech formants that we model using AM-FM sinewaves. To simplify notation, we henceforth drop the subscripts from the continuous and discrete energy operator symbols and use Ψ for both.

Consider a cosine x(t) with constant amplitude A and frequency Ωc and its derivative

Image

Then applying Ψ yields

Image

Hence, the constant frequency and the amplitude of the cosine can be obtained from the equations

Image

Now consider the more general AM-FM sinewave of the form Image in Equation (11.31). Its derivative is

Image

To make Equation (11.32) a valid approximation, we henceforth assume the two constraints:

C1: The functions a(t) and q(t) are bandlimited with highest frequencies Ωa and Ωq, respectively, and Image

C2: Image

Further, if we define the order of magnitude of a signal z(t) to be O(z) = O(zmax) where Image, then O(y1) ≈ O(aΩa) and O(y2) ≈ O(i) (See [30],[31] for details and proofs). Since Image, then we can obtain the approximation [30],[31]

(11.35)

Image

By combining Equations (11.32) and (11.35), we obtain

Image

At each time instant, this algorithm estimates the instantaneous frequency and the amplitude envelope by using only the two instantaneous output values from the energy operator applied to the signal and its derivative. Although we derived this algorithm by assuming bandlimited modulating signals, there are also other special cases of AM-FM signals where the algorithm yields approximately correct solutions. Examples include chirp signals with linear FM, i.e., Image, whose amplitude and linear instantaneous frequency can be estimated via the above continuous-time energy separation algorithm, provided thatImage.

For the validity of the continuous-time energy separation algorithm, it is assumed that Image. There are large classes of signals satisfying this condition [30], e.g., all AM-FM signals whose modulation amounts do not exceed 50% and Ωa, Image. In discrete simulations of the continuous energy separation algorithm on noise-free AM-FM and bandpass filtered speech signals, negative Ψ(x)values are rarely encountered; when they do, they appear to be due to round-off errors. Note also that at times to when Ψ[x(to)] = 0, we have a(to) ≈ 0 if Ωi(t) is assumed to be always positive. At such rare time instants, we need additional information to estimate Ωi(to). For example, in discrete simulations, we can interpolate Ωi(to) from its immediate neighbors.

Consider now the discrete-time counterpart to the continuous energy separation algorithm for AM-FM sequences of the form of Equation (11.33). Our objective is to estimate their instantaneous frequency

Image

and to estimate their amplitude envelope a[n] from the discrete-time energy operator output, Ψ(x[n]) = x2[n] − x[n − 1]x[n + 1], and from the operator applied to differences (that approximate the signal derivative). A number of different discrete-time energy separation algorithms have been derived using different approximating derivatives and different assumptions on the form of the AM and FM functions [31]. One algorithm is based on symmetric differences of the form

Image

that looks one sample ahead and one sample behind in time.11 Assuming that the instantaneous frequency ωi[n] is a finite sum of cosine functions and assuming bandwidth constraints on a[n] and q[n] of the types C1 and C2 above, then using Ψ(x[n]) and Ψ(y[n]) we obtain the following formulas for estimating the time-varying frequency and amplitude envelope:

11 Alternative discrete algorithms result if we replace derivatives with backward and forward differences x[n] − x[n − 1] and x[n + 1] − x[n], respectively.

(11.36)

Image

The frequency estimation part assumes that Image because the computer’s implementation of arcsin(u) function assumes that Image. Thus, this discrete energy separation algorithm can be used to estimate instantaneous frequencies Image the sampling frequency. Therefore, doubling the original sampling frequency ƒs allows the algorithm to estimate frequencies up to Image. Note that, if x[n] = A cos(ωcn + θ), then the AM-FM separation formulas yield the exact constant frequency and amplitude (Exercise 11.22). In [31] the (mean absolute and rms) errors of this and other discrete energy separation algorithms are obtained in estimating the amplitude and frequency of synthetic AM-FM signals. On the average (for AM and FM amounts of 5%–50%), the algorithms yield very small errors in the order of 1% or less. In the presence of added white Gaussian noise at a 30-dB SNR, the algorithm yields errors in the order of 10% or less; this low error is probably achieved because the two estimation equations above are the quotients of two functions, each of which is approximately a low-bandwidth function.

Example 11.6       Consider the AM-FM sinewave with a sinusoidal FM and decaying sinusoidal AM:

Image

where

Image

Figure 11.13 shows the Teager energy and the estimated amplitude envelope and instantaneous frequency using the discrete energy separation algorithm. Because the AM is much larger than the FM, the FM is hidden in the energy function. Image

An important property of the discrete energy separation algorithm in Equation (11.36) is that the short “window” (five samples) required by the algorithm implies excellent temporal resolution (Exercises 11.19 and 11.26) and thus it is useful in instantaneously adapting during speech transitions and within a glottal cycle. This time resolution is often not achieved by other AM-FM estimation methods such as the Hilbert transform12 (Exercise 11.19).

12 Although experiments on speech signals show that the energy separation algorithm yields similar AM-FM estimates, in a mean-squared error sense, to that of the Hilbert transform, the extremely short window allows the separation algorithm to instantaneously adapt. In addition, the separation algorithm requires less computation [39].

With speech fine structure due to secondary sources and modulations of the nonlinear models described earlier as motivation, a model of a speech resonance within a pitch period by an exponentially-damped AM-FM signal Image was proposed [29],[31]. In this model, the instantaneous frequency ωi[n] = ωc + ωmq[n] models the deviation of a time-varying formant from its center value ωc and r ∈ (0, 1) is related to the rate of energy dissipation. By assuming that the time-varying amplitude and frequency modulating signals can be modeled by a sum of slowly varying sinusoids within a pitch period, i.e., the AM and FM do not vary too fast in time or too greatly compared to the carrier ωc, it follows that the energy separation algorithm can be applied to estimate the AM and FM variations within a glottal cycle. The following example illustrates the energy separation approach on a speech signal:

Figure 11.13 AM-FM estimation using the energy separation algorithm on a synthetic signal: (a) AM−FM signal of Example 11.6; (b) Teager energy; (c) estimated amplitude envelope; (d) estimated instantaneous frequency. A 10000 samples/s sampling rate is assumed.

Image

Example 11.7       Figure 11.14a shows a filtered segment of a speech vowel /e/ sampled at 30 kHz, extracted around a formant at ƒc = 3400 Hz by convolving the speech signal with a bandpass Gaussian filter’s impulse response h(t) = exp(−α2t2) cos(2πƒct), properly discretized and truncated [29],[30] with α = 1000. The corresponding Teager energy is shown in Figure 11.14b. The observed multiple pulses are similar to the “energy pulses” [48] found by the Teagers (Figure 11.12). The observations provide evidence that speech resonances exhibit a structure that does not originate from a linear time-invariant resonator model because, as we noted earlier, applying Ψ to an impulse response x[n] = Arn cos(ωcn + θ) yields a decaying exponential Ψ(x[n]) = A2r2n sin2(ωc) without any multi-pulse structure. Figure 11.14c shows the estimated (via the energy separation algorithm) amplitude signal, where we see a strong AM modulation and two pulses per pitch period. The estimated instantaneous frequency in Figure 11.14d oscillates around its center value with a deviation that reaches 200 Hz. It contains some isolated narrow spikes, which are usually caused either by amplitude valleys or by the onset of a new pitch pulse. These spikes can be eliminated by post-smoothing the frequency signal using a median filter [31]. Excluding these narrow spikes, in vowels the instantaneous frequency and amplitude envelope profiles follow simple oscillatory, roughly sinusoidal, patterns. Typically, two-to-four amplitude pulses and formant frequency oscillations are seen within a pitch period in experiments with signals from speech vowels, particularly in higher formants. Consistent with the Teagers’ original observations, for signals from low formants of vowels it is typical to observe one energy pulse per pitch period. This may be partially explained by a large amount of damping, or by a low amount of modulation for the specific speaker/sound/formant combination. Image

Figure 11.14 Estimation of AM–FM in a speech resonance (around 1500 Hz) using the energy separation algorithm: (a) signal from filtered speech vowel; (b) Teager energy of (a); (c) estimated amplitude envelope; (d) estimated instantaneous frequency.

Image

Finally, we note that the AM–FM model of a single resonance does not explicitly take into consideration that actual speech vowels are quasi-periodic and usually consist of multiple resonances. Both of these phenomena may affect the AM-FM estimates. We mentioned above that the pitch periodicity induces narrow spikes in the amplitude and (mainly) the frequency signal around the onset of each glottal pulse. In addition, the effect of neighboring formants that have not been completely rejected by the bandpass filter is to cause “parasitic” FM and AM modulations, which have a smaller order of magnitude than the main observed modulations [31]. Various approaches have been proposed for jointly estimating the multiple AM–FM sinewave components of speech signals [22],[50].

11.6 Summary

We have seen that speech fine structure in the speech waveform can arise from either time-varying linear or nonlinear elements of the speech production mechanism. These speech events typically occur on the order of a glottal cycle and are thus difficult to reveal and measure with signal processing techniques such as the short-time Fourier transform and the wavelet transform that are characterized by time-frequency resolution tradeoffs. In this chapter, we described alternative high-resolution measurement techniques that aim to undermine the uncertainty principle and reveal fine structure not seen by the standard approaches. These more advanced time-frequency analysis methods included the Wigner distribution and its variations, referred to as bilinear time-frequency distributions, that have historically been developed in the context of estimating fine structure speech events associated largely with linear models. We also introduced a second approach to the analysis of fine structure motivated by the nonlinear aeroacoustic models for spatially distributed sound sources and modulations induced by nonacoustic fluid motion. We described qualitatively this modeling approach which then provided the motivation for the high-resolution Teager energy operator. Although significant progress has been made in nonlinear measurement and modeling techniques for speech analysis and synthesis, as we indicated throughout this chapter, the speech research community has only brushed the surface of these challenging and important areas.

An underlying assumption in this chapter is that a human listener perceives the fine structure that we are attempting to measure. It behooves us then to close with a few words on the time-frequency resolution of the auditory system. We do this briefly, and speculatively, from the perspectives of the two primary analysis tools of this chapter: (1) advanced time-frequency representations and (2) the Teager energy operator. From the perspective of advanced time-frequency analysis, we saw numerous speculations in Chapter 8. For example, an interesting possibility that arose from the auditory model of Yang, Wang, and Shamma [54] is that the auditory system performs jointly wideband and narrowband analysis. In this model, good joint time-frequency resolution is obtained from the lateral inhibition that occurs across adjacent auditory filters, e.g., in the form of differences across adjacent channels.

The second analysis tool of this chapter, the Teager energy operator, is consistent with the hypothesis that the auditory system exploits an FM transduction mechanism. Psychoacoustic experiments by Saberi and Hafter [41] provide evidence that FM and AM sounds must be transformed into a common neural code before binaural convergence in the brain stem. They showed that listeners can accurately determine if the phase of an FM signal presented to one ear is leading or lagging the phase of an AM signal presented to the other ear, with peak performance occurring when the signals are 180° out of phase. Saberi and Hafter’s explanation of their experiments is based on FM-to-AM transduction in which the instantaneous frequency of an FM signal sweeps through the passband of a cochlear filter and the change in frequency is transformed into a change in amplitude. Motivated by this hypothesis, an approach to AM–FM estimation was developed based on the transduction of FM into AM by linear filters and on using the quotient of the amplitude envelopes of two overlapping filters (Exercise 11.23) [32],[40]. An auditory neural model was then proposed for how AM-FM separation may be achieved from pairs of overlapping auditory filter envelope outputs [3]. Specifically, it was shown that the transduction approach can be realized as a bank of auditory-like bandpass filters followed by envelope detectors and laterally inhibiting shunting neural networks that perform ratio processing, i.e., quotients between signal levels.13 The resulting nonlinear dynamical system is capable of robust AM–FM estimation in noisy environments. The relation of this auditory processing model with the Teager energy operator lies in the observation that the transduced auditory filter output envelope is to a first approximation the product of AM and FM in the output (more strictly, the product of instantaneous amplitude and frequency) [40]. (This interpretation can be argued when the transduction filters have linear shape.) An interesting speculation is that the amplitude envelope of an auditory filter’s output, being thus related to the “energy” required to generate the input sinewave, may provide a robust representation for further auditory stages.

13 Shunting neural networks have been used to explain a range of biological phenomena through ratio processing. For example, these networks have been shown to provide dynamic range compression and contrast enhancement in human vision [13]. Since shunting neural networks provide ratio processing, it is natural to implement the transduction approach with a feedforward, two-neuron, shunting circuit in which one neuron provides an FM estimate and the other provides an AM estimate. The coupled equations for neuronal activity can be written as ordinary differential equations with time-varying coefficients [3].

In spite of these intriguing possibilities, it is important to emphasize that such auditory models are speculative. Even if this sort of signal processing is exploited, the mechanism may not be a linear one. For example, the models assume that the filter shapes are constant. In the cochlea, however, there is a fast-acting automatic gain control (AGC) that provides compression in the main part of the filter passband (the tip of the filter), while leaving the gain in the low-frequency portion of the filter (the tail) unaffected. The compression may be the result of an almost instantaneous saturating nonlinearity in the cochlear mechanics [42], and could both introduce fluctuations in the system output for a low-frequency sinusoid with constant amplitude and reduce the fluctuations in an AM-FM modulated tone. Incorporating the influence of such a nonlinearity and other mechanisms, such as those described in Chapter 8, is essential in understanding the full complexity of signal processing in the auditory system.

Exercises

11.1 Show in continuous time for the wavelet transform that the total energy in a signal equals the scaled energy in the scalogram, i.e., the relation in Equation (11.2) is satisfied. Then show in continuous time for the STFT that the total energy in a signal equals the energy in the spectrogram:

Image

provided that the energy in the analysis window is unity.

11.2 Consider the complex exponential signal Image. For this signal we have defined the instantaneous frequency as the derivative of the phase, i.e., Image. Show that the average and instantaneous frequency are related by

Image

11.3 Figure 11.2 shows an example of short-time spectra of two synthetic speech waveforms, both from a linear time-invariant (steady) vocal tract with a transfer function H(Ω). The difference in the two cases lies in the excitation. For Figure 11.2a, the underlying excitation is a pure impulse train with pitch Ωo, while for Figure 11.2b, the pitch changes linearly as

Ω0(t) = Ω0 + αt.

(a) Write the expression for a sinewave model for the speech waveform corresponding to each spectrum in Figure 11.2 in terms of the vocal tract transfer function H(Ω) and pitch (Ωo or Ωo + αt).

(b) Using Figure 7.8, explain the observed spectra in terms of short-time Fourier analysis. What is the implication of your reasoning for the use of spectral peak-picking in Chapter 9 for accurate estimation of sinewave parameters?

(c) Under the condition that the pitch is changing rapidly under the short-time analysis window, propose an analysis scheme that may result in more accurate estimates of sinewave amplitude and frequency than simple peak-picking. Consider methods such as the wavelet transform and the Teager energy-based AM-FM separation method. If there appears to be no simple way to improve performance with these methods, then explain your reasoning.

11.4 Derive the expressions in Equations (11.11) and (11.12) for the conditional mean and variance of time and frequency. Derive also the expressions in Equations (11.13) and (11.14) for the relation between conditional and global quantities. You will need the relations Image and, likewise, Image.

11.5 From the expressions in Equation (11.13) for the global frequency average and its time-domain dual, argue that the instantaneous frequency and group delay are given by the conditional mean frequencyImage and conditional mean time Image, i.e., Equation (11.15), provided the marginals are satisfied.

11.6 Consider a signal consisting of a complex sinewave of the form x(t) = a(t)ejΩt. Compute the instantaneous bandwidth of the following special cases:

(a) x(t) = AejΩt

(b) x(t) = AejΩtu(t)

(c) x(t) = Ae−αtejΩt

(d) x(t) = Ae−αtejΩtu(t)

where u(t) is the unit step. What special characteristics arise with the unit step signal u(t)?

11.7 In this problem, you consider the separability of signal components for time-frequency distributions.

(a) Suppose that for a particular time-frequency distribution, the instantaneous frequency and bandwidth are given by Equations (11.15) and (11.17). Argue that Equation (11.18) must hold for the “ridges” of the distribution to reveal the components in frequency at a particular time instant.

(b) Consider now the dual problem of the separation of signal components in time at a particular frequency as revealed in a time-frequency distribution. Argue that for a two-component signal spectrum of the form

X (Ω) = M1 (Ω)eθ1(Ω) + M2 (Ω)eθ2(Ω)

that the spectral magnitudes M1(Ω), M2(Ω) and group delays θ1(Ω), θ2(Ω) are constrained by

Image

when the marginals of the distribution are satisfied.

11.8 In this problem, you consider properties of the spectrogram |X(t, Ω)|2 of a signal x(t), i.e., the squared STFT magnitude, as a time-frequency distribution.

(a) Show that with analysis window w(t)

Image

and thus that if the energy of the window function is unity, then the energy in the spectrogram equals the energy in the signal.

(b) Show that integrating the spectrogram over frequency gives

Image

and thus the time marginal is not satisfied. Show the dual relation when integrating over time.

11.9 Define the time spread of a signal, x(t), as

Image

where Image is the temporal mean of the signal. Similarly, define the frequency spread or bandwidth of x(t) as

Image

(a) Compute the time-bandwidth product, Image, for the following signal:

x1(t) = ea(tt1)2 cos Ω1t.

(b) Compute the time-bandwidth product for the following signal:

x2(t) = eb(tt2)2 ej[ct2+m sin(Ωmt)+Ω2t].

(c) Can x2(t) be designed to have the same time-bandwidth product as x1(t)? If so, what are the constraints on the signal parameters b, c, m, t2, Ωm, and Ω2?

11.10 One approach to achieve good joint time-frequency resolution is through a geometric or additive mean of the narrowband and wideband spectrograms, given by [|X1(n, ω)|2|X2(n, ω)|2]1/2 and [|X1(n, ω)|2 + |X2(n, ω)|2]/2, respectively. [|X1(n, ω)|2 denotes the narrowband spectrogram and |X2(n, ω)|2 the wideband spectrogram. Describe which of the properties P1–P5 of Section 11.3.1, that characterize a proper time-frequency distribution, holds for each of these mean distributions.

11.11 In this problem you investigate various properties of the Wigner distribution.

(a) Show that the Wigner distribution can be written in terms of its spectrum as in Equation (11.20).

(b) Show, using Equations (11.19) and (11.20), that a signal compact in time or in frequency remains compact in these dimensions within the Wigner distribution, thus avoiding loss in time-frequency resolution. Show, however, that zero gaps in a signal or its spectrum are not preserved in the distribution. Therefore, the Wigner distribution has weak, but not strong finite support.

(c) Show that the marginals, Equation (11.21), and thus the energy equivalence relation, Equation (11.22), are satisfied for the Wigner distribution. Why are the marginals for the Wigner distribution more desirable than those for the spectrogram?

11.12 In this problem, you consider the local and global time-bandwidth products for a distribution.

(a) Starting with the relation between a joint time-frequency density P(t, Ω) and the conditional densities P(Ω|t) and P(t|Ω), derive expressions for the conditional (local) time and frequency variances Image and Image. How is the product Image bounded?

(b) Compare differences in the local and global time-bandwidth products, i.e., between Image and Image.

(c) Because of the presence of the analysis window, explain why “beating” of the uncertainty principle by the local time-bandwidth product can never occur with the spectrogram. This property is therefore in contrast to that of the Wigner distribution.

11.13 Determine the Wigner distribution, W(t, Ω), for the signal

x(t) = ej(at2+bt+c).

How is the conditional mean frequency for the Wigner distribution related to the instantaneous frequency for this signal?

11.14 Show that if

Image

then the time-frequency marginals of the canonic distribution in Equation (11.25) are satisfied.

11.15 Consider a voiced speech waveform where the pitch is linearly changing and the vocal tract is steady. Design a waveform time-warping procedure resulting in a steady pitch. What is the effect of time-warping on the vocal tract impulse response? Discuss the implications of your time-warping operation on the STFT time-frequency resolution when the pitch is rapidly varying. How might you exploit the modified resolution in sinusoidal analysis/synthesis? Hint: Consider the stationarity of the resulting sinewave amplitudes and frequencies.

11.16 Consider the following psychoacoustic experiment. Suppose we add two synthetically generated vowels, both having steady pitch and steady vocal tract. Human listening experiments indicate that the two vowels are not perceived as perceptually separate, i.e., we do not hear two distinct vowels. However, when we add vibrato, i.e., sinusoidal modulation to the pitch [i.e., Ωo + α sin(Ωmt), Ωo being the pitch] of one vowel, then a listener is more likely to perceive two separate vowel sounds, indicating the importance of modulation in human perception. Given the constant-Q filter-bank model of auditory cochlear front-end filters of Chapter 8 and the filter transduction approach to AM-FM separation described in Section 11.6, propose an auditory mechanism for sound separation based on the presence of modulation.

11.17 Consider the dynamical mechanical model of Barney, Shadle, and Davies [2], described in Section 11.4.3. Suppose that the shutter speed is 80 Hz, the mean flow rateImage is 80 cm/s, and the duct length is 17 cm. Assuming a train of equally-spaced vortices moving at the mean flow rate, compute the number of vortices in the duct at any time.

11.18 We have observed that multiple acoustic and nonacoustic sources may exist in the vocal tract cavity, not all located at the opening of the glottis, especially where the vortical train meets boundary discontinuities. Consider now determining a solution approach for estimating the frequency response of each contribution from a measured correlation function of a very low-pitched speaker.

(a) Consider an example where a primary source pulse occurs at the glottis due to a conventional acoustic source and a secondary source pulse occurs downstream, e.g., behind the teeth due to a traveling vortex, i.e., the speech waveform is given by

s[n] = δ[n] * h1[n] + δ[nn0] * h2[n]

where h1[n] and h2[n] are responses to the primary and secondary source pulses, respectively. Show that the autocorrelation function of s[n] consists of two components, one near the origin and the second near n = n0, i.e.,

rs[τ] = r11[τ] + r22[τ] + r12[τ + n0] + r12[−τ − n0]

where

Image

An important observation here is that the low-order correlation coefficients that are typically used in speech analysis, e.g., linear prediction, consist of the sum of correlation coefficients from two different impulse responses. The higher-order coefficients near n = n0 are due primarily to the cross-correlation between these two responses. Argue why the correlation coefficients around n = n0 are not symmetric for the conditions of this problem.

(b) To do accurate modeling, our goal becomes to separate the correlation functions r11[τ] and r22[τ]. In general, this is a difficult problem. To simplify, we constrain n0 to be large enough to separate the sum near the origin, r11[τ] + r22[τ], and the cross-correlation function, r12[τ + n0]. Show that we can then formulate the following quadratic equation

X2(ω) + A(ω)X(ω) + B(ω) = 0

with

Image

where |S11(ω)|2, |S22(ω)|2, and |S12(ω)|2 are the Fourier transforms of r11[τ], r22[τ], and r12[τ], respectively. Show that the two solutions to this quadratic equation are |S11(ω)|2 and |S22(ω)|2.

11.19 Figure 11.15 shows (a) a voiced speech segment, (b) its bandpass-filtered version (with a modulated Gaussian filter centered at the first formant), and (c) the envelope of the bandpass-filtered waveform. The envelope is the amplitude of the bandpass-filtered waveform obtained from the energy separation algorithm of Section 11.5. Assume that the bandpass filter separates the first formant exactly (i.e., there is no leakage from neighboring formants) and assume that the vocal tract and vocal cords are in an approximate steady-state.

(a) Explain why the envelope measurement of Figure 11.15c violates the simple voiced speech model of a linear system driven by a periodic pulse train. Suggest possible “anomalies” in speech production that may describe the observed envelope.

(b) How well do you think the STFT (computed with a window of duration 2.5 times a pitch period) would capture the fine structure in the observed envelope?

Figure 11.15 Voiced speech waveform with “anomaly”: (a) waveform; (b) bandpass-filtered waveform; (c) envelope of bandpass-filtered waveform.

Image

(c) Suppose that you attempt to estimate the AM and FM of a resonance where more than a cycle of modulation occurs over a very short-time duration, say, less than a pitch period. Comment on problems you might have in estimating these modulations by ridge tracking with the STFT (the spectrogram). Does the “instantaneous” nature of the Teager energy-based methods help avoid these problems? Explain your reasoning.

(d) Compare the temporal resolution of AM-FM estimation based on the Teager operator and based on the Hilbert transform for a finite data segment with rapidly varying modulation. Hint: Consider the Hilbert transform response of Equation (2.10).

11.20 We showed in Section 11.5 that the Teager energy operator Ψ(x) output for a sinewave of the form

x(t) = A cos(Ωt)

is given by

Ψ(x) = A2 Ω2.

Also we derived in Section 11.5 a method to separate the amplitude A and frequency Ω, given both the Teager energy output of the signal and its derivative, i.e., Ψ(x) and Image. Suppose, on the other hand, you are given x(t) and its integral Image. Derive a method to separate the amplitude A and frequency Ω from the Teager energy of x(t) and y(t), i.e., from Ψ(x) and Ψ(y). Why might this algorithm have better frequency resolution than the derivative form for very low-frequency sinewaves?

11.21 In this problem, we consider the Teager energy for the sum of two sequences:

x[n] = x1[n] + x2[n]

(a) Show that the output of the discrete-time Teager energy operator can be expressed as

Image

where Image and Image are the Teager energy cross terms.

(b) For the sum of two sinewaves

x[n] = A1 cos(ω1n + θ1) + A2 cos(ω2n + θ2)

show that the Teager energy is the sum not only of the individual energy components but also an additional cross term at the difference frequency of the two frequency components.

11.22 Show that when x[n] = A cos(ωcn + θ), then the AM-FM separation formula of Equation (11.36) yields the exact constant frequency and amplitude.

11.23 In this problem, you consider the FM-to-AM transduction approach to AM-FM sinewave estimation. Consider an AM-FM signal of the form Image. The AM component is A(t). The FM component is considered as the instantaneous frequency, Image. Suppose the output of a front-end auditory filter, y(t), in response to the input signal, is the convolution of the impulse response of the filter, h(t), with the input signal, i.e. y(t) = x(t) * h(t). We define the amplitude envelope of the filter output as the rectified signal |y(t)|. (We assume h(t) is in analytic form.) Transduction consists of sweeping an FM signal through a nonuniform bandpass filter; the change in frequency induces a corresponding change in the amplitude envelope of the filter output.

(a) Show that under a quasi-stationary assumption on the sinewave AM and FM, an FM-to-AM transduction approximation holds such that the amplitude envelope of the filter output is given by |y(t)| ≈ A(t)|H[Ω(t)]|, where H[Ω] is the frequency response of the filter.

(b) The AM-FM estimation problem can be solved by taking the ratio of the output of two overlapping auditory filters, H1[Ω(t)] and H2[Ω(t)]. Specifically, under the transduction approximation, show that

Image

so that when the filters have a Gaussian shape, one can solve for Ω(t) and recover the FM component under certain mild constraints. Then show that the recovered FM component can be used to recover the AM component. What general class of overlapping bandpass filters allow AM and FM estimation through transduction?

11.24 (MATLAB) In this problem, you explore the analysis of frequency-modulated sinewaves using two different time-frequency distributions: the short-time Fourier transform (STFT) magnitude and the Wigner distribution. There are three functions you will use: cross.m, stftm.m, and wigner.m located in companion website directory Chap_exercises/chapter11.

      Run cross.m in your MATLAB workspace by typing cross. This will generate two (among other items) signals x and x1 that you will use. x1 is a single FM sinewave and x consists of the sum of two crossing FM sinewaves.

(a) Run stftm.m (with a 256-point window) on x 1 and also run wigner.m on x1. Compare and explain your results. The function stftm.m, in addition to a spectrogram, also displays an estimate of instantaneous frequency equal to the location of the spectrogram ridge corresponding to the peak in the STFT magnitude at each time instant.

(b) Repeat part (a) for the signal x with crossing frequency trajectories. Include discussion of the phenomena at the trajectory intersection. Also explain any peculiarities of the Wigner distribution. What difficulties might you encounter in doing frequency estimation by ridge tracking on the Wigner distribution?

(c) One way to achieve better temporal resolution with the STFT magnitude is to shorten the analysis window. Run stftm.m with window lengths 128, 64, and 32, achieving increasingly better time resolution. Explain your observations.

(d) Load the speech segment speech1_10k (at 10000 samples/s) of workspace ex11M1.mat located in companion website directory Chap_exercises/chapter11 and run wigner.m on it. Compare with the spectrogram with regard to the difficulty in interpreting spurious vertical and horizontal striations that come about from cross terms. How does this affect your ability to perceive speech fine structure?

Note: To run wigner.m successfully, you must truncate the input speech signal length to the nearest power of 2. (This is why the signals in cross.m are of length 1024.)

11.25 (MATLAB) In this problem we explore the Teager energy operator in continuous and discrete time.

(a) Show that for a simple continuous-time harmonic oscillation of the form

x(t) = A cos(Ωt + θ)

the continuous-time energy operator Equation (11.28) gives the energy (within a scale factor) required to generate the signal, i.e., A2Ω2.

(b) Show that discretizing Image and Image by first backward differences in the continuous-time energy operator leads to the time-shifted and scaled discrete-time energy operator in Equation (11.29).

(c) Show that for a simple discrete-time harmonic oscillation of the form

x[n] = A cos(ωn + θ),

the discrete-time energy operator Equation (11.30) gives Ψd(x[n]) = A2 sin2(ω). Explain why your result is exact provided that the frequency Ω < π/2, i.e., one quarter of the sampling frequency. Then show that the operator gives approximately the energy as the squared product of frequency and amplitude, ω2A2, for small values of ω. Suppose that you limit the frequency ω < π/8. What percent relative error do you entail in measuring the energy A2ω2?

(d) Show that the discrete-time Teager energy for an exponentially-damped sinewave of the form

Image

is given by

Ψd(x[n]) = A2r2n sin2(ω).

How can you approximate the resulting expression for small frequency? Write MATLAB code for the discrete-time energy operator Equation (11.30) and apply it to a damped sinewave with r = 1, r = .99, and Image. Let the frequency change from 100 Hz to 2000 Hz in steps of 500 Hz (for a sampling rate of 10000 Hz). What observations can you make?

(e) Using the discrete-time energy operator Equation (11.30), derive the Teager energy for a chirp signal with linear frequency sweep

Image

i.e., for a signal of the form

Image

Assume that the rate of change in frequency is small enough so that Image and Image. Use your MATLAB code for the discrete-time energy operator from part (d) and apply it to the chirped sinewave with ωo = 500 Hz. Let the sweep rate change from Δω = 100 Hz to 2000 Hz in steps of 500 Hz (for a sampling rate of 10000 Hz). What observations can you make?

11.26 (MATLAB) In this problem you investigate the temporal resolution of the discrete energy separation algorithm of Equation (11.36).

(a) Explain why the algorithm in Equation (11.36) requires only five samples to perform AM–FM estimation.

(b) Concatenate two sinewaves of the form A cos(2πƒt), each of 0.5 s duration, the first with A = 1 and ƒ = 500 Hz, and the second with A = 1.5 and ƒ = 700 Hz. (In discrete form, assume a sampling rate of 10000 samples/s.) Use the provided energy separation function (amfm_sep.m in companion website directory Chap_exercises/chapter11) to estimate the AM and FM of the signal. Comment on the temporal resolution in comparison with other AM-FM estimators (from time-frequency distributions) we have studied in this chapter.

11.27 (MATLAB) You will find in the companion website directory Chap exercises/chapter11 functions amfm_sep.m, gaussian_filt.m, specgram_speech.m and a workspace ex11M2.mat. In this problem, you are asked to analyze a speech waveform to determine certain underlying modulations of the speech.

(a) Estimate the center frequency of the second formant of the waveform speech_10k (at 10000 samples/s) in ex11M2.mat and apply a Gaussian bandpass filter (using gaussian_filt.m) to approximately filter out this second formant. Plot the resulting amplitude- (AM) and frequency-modulated (FM) filtered signal over the signal range [501:1000]. (Use any technique you have learned thus far to estimate the center frequency of the second formant.)

(b) Now apply amfm_sep.m to the filtered signal from part (a) and plot the AM and FM estimates and Teager energy estimate over the signal range [501:1000]. The function amfm_sep.m implements the AM-FM energy separation algorithm. Plot also the conventional energy measure (signal squared); how does it differ from the Teager energy?

(c) Use the function specgram_speech.m to find the spectrogram of the entire signal speech_10k and observe the global trend on the second formant. Then redo parts (a) and (b) to find the AM and FM estimates over the signal sample range [501:3000] (in contrast to [501:1000]). In the former, you will see the detail in the AM and FM over a long time interval, as well as the general trends. In the latter, you will see the detail over a short time interval.

(d) Propose an explanation for your observations in parts (b) and (c). Think about the AM and FM within a glottal cycle; speculate on how the speech source and the vocal tract activity might contribute to what you observe. Finally, might the AM–FM energy separation algorithm have certain limitations? Can you propose an alternative AM–FM estimator that might address these limitations?

Bibliography

[1] J.D. Anderson, Jr., Modern Compressible Flow, McGraw-Hill, New York, NY, 1982.

[2] A. Barney, C.H. Shadle, and P.O.A.L. Davies, “Fluid Flow in a Dynamical Mechanical Model of the Vocal Folds and Tract, 1: Measurements and Theory,” J. Acoust. Soc. Am., vol. 105, no. 1, pp. 444–455, Jan. 1999.

[3] R. Baxter and T.F. Quatieri, “Shunting Networks for Multi-Band AM-FM Decomposition,” Proc. of IEEE 1999 Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, Oct. 1999.

[4] B. Boashash, “Estimating and Interpreting the Instantaneous Frequency of a Signal,” Proc. IEEE, vol. 80, pp. 520–538, April 1992.

[5] G.F. Boudreaux-Bartels and T.W. Parks, “Time-Varying Filtering and Signal Estimation Using Wigner Distribution Synthesis Techniques,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 34, pp. 442–451, 1986.

[6] A. Bovik, P. Maragos, and T.F. Quatieri, “AM-FM Energy Detection and Separation in Noise Using Multiband Energy Operators,” IEEE Trans. Signal Processing, special issue on wavelets, vol. 41, no. 12, pp. 3245–3265, Dec. 1993.

[7] T.A.C.M. Claasen and W.F.G. Mecklenbr, äuker,“The Wigner Distribution—A Tool for Time-Frequency Analysis—Part 1: Continuous Time Signals,” Philips Jour. Research, vol. 35, pp. 217–250, 1980.

[8] T.A.C.M. Claasen and W.F.G. Mecklenbr, äuker,“The Wigner Distribution—A Tool for Time-Frequency Analysis—Part 2: Discrete-Time Signals,” Philips Jour. Research, vol. 35, pp. 276– 300, 1980.

[9] T.A.C.M. Claasen and W.F.G. Mecklenbr, äuker,“The Wigner Distribution—A Tool for Time-Frequency Analysis—Part 3: Relations with Other Time-Frequency Signal Transformations,” Philips Jour. Research, vol. 35, pp. 372–389, 1980.

[10] L. Cohen, Time-Frequency Analysis, Prentice Hall, Englewood Cliffs, NJ, 1995.

[11] J.G. Daugman, “Uncertainty Relation for Resolution in Space, Spatial Frequency, and Orientation Optimized by Two-Dimensional Visual Cortical Filters,” J. Opt. Soc. Am. A, vol. 2, no. 7, pp. 1160–1169, July 1985.

[12] P.O.A.L. Davies, “Aeroacoustics and Time-Varying Systems,” J. Sound Vibration, vol. 190, pp. 345–362, 1996.

[13] S. Grossberg, “The Quantized Geometry of Visual Space: The Coherent Computation of Depth, Form, and Lightness,” The Brain and Behavioral Sciences, vol. 6, pp. 625–692, 1983.

[14] R. Hamila, J. Astola, F. Alaya Cheikh, M. Gabbouj, and M. Renfors, “Teager Energy and the Ambiguity Function,” IEEE Trans. on Signal Processing, vol. 47, no. 1, pp. 260–262, Jan. 1999.

[15] A. Hirschberg, “Some Fluid Dynamic Aspects of Speech,” Bulletin de la Communication Parlee, vol. 2, pp. 7–30, 1992.

[16] M.S. Howe, “Contributions to the Theory of Aerodynamic Sound, with Application to Excess Jet Noise and the Theory of the Flute,” J. Fluid Mechanics, vol. 67, part 3, pp. 625–673, 1975.

[17] T. Irino and R.D. Patterson, “A Time-Domain, Level-Dependent Auditory Filter: The Gammachirp,” J. Acoustical Society of America, vol. 101, no. 1, pp. 412–419, Jan. 1997.

[18] J. Jeong and W.J. Williams. “Alias-Free Generalized Discrete-Time Time-Frequency Distributions,” IEEE Trans. on Signal Processing, vol. 40, no. 11, pp. 2757–2765, Nov. 1992.

[19] J.F. Kaiser, “Some Observations on Vocal Tract Operation from a Fluid Flow Point of View,” chapter in Vocal Fold Physiology: Biomechanics, Acoustics, and Phonatoy Control, R. Titze and R.C. Scherer, eds., The Denver Center for the Performing Arts, pp. 358–386, May 1983.

[20] J.F. Kaiser, “On a Simple Algorithm to Calculate the ‘Energy’ of a Signal,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, Albuquerque, NM, pp. 381–384, April 1990.

[21] J.F. Kaiser, “On Teager’s Energy Algorithm and its Generalization to Continuous Signals,” in Proc. 4th IEEE Digital Signal Processing Workshop, Mohonk (New Paltz), NY, Sept. 1990.

[22] R. Kumaresan, A.G. Sadasiv, C.S. Ramalingam, and J.F. Kaiser, “Instantaneous Non-Linear Operators for Tracking Multicomponent Signal Parameters,” Proceedings Sixth SP Workshop on Statistical Signal & Array Processing, Victoria, British Columbia, Canada, pp. 404–407, Oct. 1992.

[23] J. Liljencrants, “Numerical Simulations of Glottal Flow,” chapter in Vocal Fold Physiology, J. Gauffin and B. Fritzell, eds., Raven Press, 1990.

[24] W. Lin and P. Chitrapu, “Time-Frequency Distributions Based on Teager-Kaiser Energy Function,” Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Atlanta, GA, vol. 3, pp. 1818–1821, May 1996.

[25] P.J. Loughlin and K.L. Davidson, “Positive Local Variances of Time-Frequency Distributions and Local Uncertainty,” Proc. IEEE-SP Int. Symp. on Time-Frequency and Time-Scale Analysis, pp. 541–544, Oct. 1998.

[26] P.J. Loughlin, J. Pitton, and L.E. Atlas, “Bilinear Time-Frequency Representations: New Insights and Properties,” IEEE Trans. Signal Processing, vol. 41, pp. 750–767, no. 2, Feb. 1993.

[27] R.F. Lyon, “The All-Pole Gammatone Filter and Auditory Models,” chapter in Proc. Forum Acusticum 96, Antwerp, Belgium.

[28] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, New York, NY, 1998.

[29] P. Maragos, T. F. Quatieri, and J. F. Kaiser, “Speech Nonlinearities, Modulations, and Energy Operators,” Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Toronto, Canada, pp. 421–424, May 1991.

[30] P. Maragos, J. F. Kaiser, and T. F. Quatieri, “On Amplitude and Frequency Demodulation Using Energy Operators,” IEEE Trans. Signal Processing, vol. 41, no. 4, pp. 1532–1550, April 1993.

[31] P. Maragos, J. F. Kaiser, and T. F. Quatieri, “Energy Separation in Signal Modulations with Application to Speech Analysis,” IEEE Trans. Signal Processing, vol. 41, no. 10, pp. 3024–3051, Oct. 1993.

[32] R. McEachern, “How the Ear Really Works,” Proc. IEEE-SP Int. Symp. on Time-Frequency and Time-Scale Analysis, pp. 437–440, Oct. 1992.

[33] R.S. McGowan, “An Aeroacoustic Approach to Phonation,” J. Acoustical Society of America, vol. 83, pp. 696–704, 1988.

[34] P.M. Morse and K.U. Ingard, Theoretical Acoustics, Princeton University Press, Princeton, NJ, 1986.

[35] K. Nathan and H. Silverman, “Time-Varying Feature Selection and Classification of Unvoiced Stop Consonants,” IEEE Trans. Speech and Audio Processing, vol. 2, no. 3, pp. 395–405, 1991.

[36] A. Papoulis, The Fourier Integral and its Applications, McGraw-Hill, New York, NY, 1962.

[37] J.W. Pitton, L.E. Atlas, and P.J. Loughlin, “Applications of Positive Time-Frequency Distributions to Speech Processing,” IEEE Trans. Speech and Audio Processing, vol. 2, no. 4, pp. 554–566, Oct. 1994.

[38] J.W. Pitton, K. Wang, and B.H. Juang, “Time-Frequency Analysis and Auditory Modeling for Automatic Recognition of Speech,” Proc. IEEE, vol. 84, no. 9, pp. 1199–1215, Sept. 1996.

[39] A. Potamianos and P. Maragos, “A Comparison of the Energy Operator and the Hilbert Transform Approach to Signal and Speech Demodulation,” Signal Processing, vol. 37, pp. 95–120, May 1994.

[40] T.F. Quatieri, T.E. Hanna, and G.C. O’Leary, “AM-FM Separation Using Auditory-Motivated Filters,” IEEE Trans. Speech and Audio Processing, vol. 5, no. 5, pp. 465–480, Sept. 1997.

[41] K. Saberi and E.R. Hafter, “A Common Neural Code for Frequency- and Amplitude-Modulated Sounds,” Nature, vol. 374, pp. 537–539, April 1995.

[42] P.M. Sellick, R. Patuzzi, and B.M. Johnstone, “Measurement of Basilar Membrane Motion in the Guinea Pig Using the M, Imageauer Technique,” J. Acoustical Society of America, vol. 72, no. 1, pp. 1788–1803, July 1982.

[43] C.H. Shadle, A. Barney, and P.O.A.L. Davies, “Fluid Flow in a Dynamical Mechanical Model of the Vocal Folds and Tract, 2: Implications for Speech Production Studies,” J. Acoustical Society of America, vol. 105, no. 1, pp. 456–466, Jan. 1999.

[44] D.J. Sinder, Speech Synthesis Using an Aeroacoustic Fricative Model, Ph.D. Thesis, Rutgers University, New Brunswick, NJ, Oct. 1999.

[45] R. Smits, “Accuracy of Quasistationary Analysis of Highly Dynamic Speech Signals,” J. Acoustical Society of America, vol. 96, no. 6, pp. 3401–3415, Dec. 1994.

[46] H.M. Teager, “Some Observations on Oral Air Flow During Phonation,” IEEE Trans. Acoustics, Speech, Signal Processing, vol. ASSP–28, no. 5, pp. 599–601, Oct. 1980.

[47] H.M. Teager and S.M. Teager, “A Phenomenological Model for Vowel Production in the Vocal Tract,” chapter in Speech Science: Recent Advances, R.G. Daniloff, ed., College-Hill Press, San Diego, CA, pp. 73–109, 1985.

[48] H.M. Teager and S.M. Teager, “Evidence for Nonlinear Sound Production Mechanisms in the Vocal Tract,” chapter in Speech Production and Speech Modelling, W.J. Hardcastle and A. Marchal, eds., NATO Adv. Study Inst. Series D, vol. 55, Bonas, France; Kluwer Academic Publishers, Boston, MA, pp. 241–262, 1990.

[49] T.J. Thomas, “A Finite Element Model of Fluid Flow in the Vocal Tract,” Computer Speech and Language, vol. 1, pp. 131–151, 1986.

[50] W.P. Torres and T.F. Quatieri, “Estimation of Modulation Based on FM-to-AM Transduction: Two-Sinusoid Case,” IEEE Trans. Signal Processing, vol. 47, no. 11, pp. 3084–3097, Nov. 1999.

[51] B. van der Pol, “Frequency Modulation,” Proc. IRE, vol. 18, pp. 1194–1205, July 1930.

[52] J. Ville, “Theorie et Applications de la Notion de Signal Analytique,” Cables et Transmissions, vol. 2A, pp. 61–74, 1948. Translated from French by I. Selin, “Theory and Applications of the Notion of Complex Signal,” RAND Corporation Technical Report T–92, Santa Monica, CA, 1958.

[53] E.P. Wigner, “On the Quantum Correction for Thermodynamic Equilibrium,” Physical Review, vol. 40, pp. 749–759, 1932.

[54] X. Yang, K. Wang, and S.A. Shamma, “Auditory Representations of Acoustic Signals,” IEEE Trans. Information Theory, vol. 38, no. 2, pp. 824–839, March 1992.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset