Chapter 9
Sinusoidal Analysis/Synthesis

9.1 Introduction

We have seen in Chapters 5 and 6 that one approach to the analysis and synthesis of speech signals is to use the speech production model in which speech is viewed as the result of passing a source excitation waveform through a time-varying linear filter that models the resonant cavities of the vocal tract. In certain applications, it suffices to assume that the source function can be in one of two possible states corresponding to voiced or unvoiced speech. We have referred to these approaches of analysis/synthesis as “modeled-based.” In Chapters 7 and 8, on the other hand, we introduced methods of analysis and synthesis that depend on Fourier transform and filter-bank representations and are less modeled-based in the sense of having less dependence on the state of the source and system. In the filter-bank-based phase vocoder, in particular, filter-bank outputs were assumed to be AM-FM sinewaves regardless of the source, although more meaningful mathematical models of filter outputs were derived under the assumption that one input sinewave falls within a filter. The analysis in the phase vocoder does not explicitly model and estimate the sinewave components, but rather views them as outputs of a bank of uniformly-spaced bandpass filters. It was observed that conditions under which this sinewave assumption holds were often violated for voiced speech, and a mathematical characterization of unvoiced speech in this framework was complex, thus motivating a more explicit representation and estimation of sinewaves in speech signals.

The goal of this chapter is the description of such a sinewave representation that is valid irrespective of the source state. This model, originally introduced in [30], is composed of sinusoidal components of arbitrary amplitudes, frequencies, and phases, and it results in an analysis/synthesis system based on explicit sinewave estimation. Figure 9.1 shows a schematic of the sinewave model frequency tracks that are the frequency trajectories of sinewaves from their onset to their termination. Each sinewave is represented by a time-varying envelope and a phase equal to the integral of a time-varying frequency track. In voiced speech, such as vowels, the sinewave frequency tracks are roughly harmonically related and linger over long durations. Noise-like and transient sounds such as fricatives and plosives, although aharmonic, can nevertheless also be represented approximately by a sum of sinewaves, but generally without coherent phase structure; the model sinewave frequencies have arbitary value, their tracks coming and going randomly in time over shorter durations.

Figure 9.1 The sinewave model encompasses both voiced and unvoiced speech. Frequency tracks of voiced sinewaves are approximately harmonically related, while unvoiced tracks typically have no such relation and come and go randomly over short durations.

Image

A number of other approaches to analysis/synthesis that are based on explicit sinewave models, in addition to the approach of this chapter, have been discussed in the literature. Hedelin [16] proposed a sinewave model for use in quantizing the baseband signal for speech coding. The amplitudes and phases of the underlying sinewaves are explicitly estimated using Kalman filtering techniques, and each sinewave phase is defined to be the integral of the associated instantaneous frequency. As in the original phase vocoder, absolute phase information is lost. Another sinewave speech coding system has been developed by Almeida and Silva [1]. In contrast to Hedelin’s approach, their system uses a pitch estimate during voiced speech to establish a harmonic set of sinewaves. The sinewave phases are computed at harmonic frequencies from the short-time Fourier transform. In another implementation of this system, Marques and Almeida model unvoiced speech using a set of narrowband basis functions [25]. Yet another approach to modeling unvoiced speech in the context of the sinewave model is to explicitly generate noise via the linear filtering of white noise whenever unvoiced speech components are detected in different frequency bands. This approach, developed by Griffin and Lim [14], is referred to as the “multi-band excitation vocoder” and uses overlap-add reconstruction for synthesizing speech in unvoiced bands. For voiced bands the system uses a sinewave analysis/synthesis similar to that described in this chapter.

In this chapter, we describe an approach to sinewave modeling and analysis/synthesis that lends itself to determining sinewave frequency tracks through frequency matching [30], phase coherence during voicing through a source and vocal tract filter phase model [39],[40], and estimation of noise-like components by the use of an additive model of deterministic and stochastic signal contributions [50],[51]. In Section 9.2, the sinewave model is developed. In Section 9.3, in the analysis stage, the amplitudes, frequencies, and phases of the sinewave model are estimated, while in Section 9.4, in the synthesis stage, these parameters are processed to produce a synthetic waveform that is essentially perceptually indistinguishable from the original. This sinewave analysis/synthesis baseline system forms the foundation for the remainder of the chapter. In Section 9.4, some applications of this baseline system, including signal modification, splicing, and estimation of vibrato, are also presented. This section ends with an overview of time-frequency resolution considerations for sinewave analysis that arise in these applications, including a wavelet-based front end analysis. Although the baseline sinewave analysis/synthesis is applicable to arbitrary signals, tailoring the system to a specific class can improve performance. In Section 9.5, a source/filter phase model for quasi-periodic signals is introduced within the sinewave representation. This model is important in numerous applications, including signal modification, such as time-scale modification, and reducing the peakiness (i.e., “peak-to-rms” valve) in a signal. The source/filter phase model provides phase coherence, i.e., preserving phase relations among sinewaves, which is essential for high quality in these applications. Finally, in Section 9.7, an additive model of deterministic and stochastic components is introduced within the sinewave representation. This two-component model is particularly important for representing speech sounds with simultaneous harmonic and aharmonic contributions.

9.2 Sinusoidal Speech Model

In the linear speech production model, the continuous-time1 speech waveform s(t) is assumed to be the output of passing a source excitation waveform u(t) through a linear time-varying filter with impulse response h(t, τ) that models the characteristics of the vocal tract (Figure 9.2). Mathematically, the speech signal is expressed as

1 As in Chapter 8, a continuous-time model helps in the development of phase representations because of the need to differentiate and integrate.

Image

where the excitation is convolved with a different impulse response at each time t. It is proposed that the excitation u(t) be represented by a sum of sinusoids of various amplitudes, frequencies and phases:2

2 For clarity, in this chapter, we often replace the complex exponential notation Image.

(9.1)

Image

where the phase function

Image

Figure 9.2 Speech waveform s(t) modeled as the output of a linear time-varying vocal tract with source excitation u(t).

Image

and K(t) is the number of sinewave components at time t. Our sinewave model represents an arbitary source and is not constrained to a periodic, impulsive, or white noise form. For the kth sinewave component, ak(t) and Ωk(t) represent the time-varying amplitude and frequency, and Image is the fixed phase offset to account for the fact that at time t = 0, the sinewaves are generally not in phase.3 The representation of sinewaves as the real part (“Re”) of complex exponentials is used because it simplifies the analysis that will be required for derivation of the sinewave phase model.

3 Sinewaves are in phase at a time t = to when the sinewave peaks all occur at this time, i.e., the phase of all sinewaves is a multiple of 2π at time t < to.

The vocal tract transfer function in terms of its time-varying magnitude M(t,Ω) and phase Φ(t, ω) components is written as

(9.2)

Image

Then if the parameters of the excitation, ak(t) and Ωk(t), are constant over the duration of the impulse response of the vocal tract filter, the speech waveform can be written as

(9.3)

Image

where each sinewave appears at time t as an eigenfunction of the system h(t, τ) (Appendix 9.A). By combining the excitation and vocal tract amplitudes and phases, the representation4 can be written more concisely as

4 In the following discussion, we often remove the “Re” notation and work with the complex version of s(t). This quadrature representation is approximately equal, as we saw in Chapter 2, to its analytic signal representation that can be obtained through the Hilbert transform.

(9.4)

Image

where

(9.5)

Image

represent the amplitude and phase of the kth sinewave along the frequency trajectory Ωk(t) (referred to as an instantaneous frequency in Chapter 8). To simplify notation, the system amplitude and phase along each frequency trajectory Ωk(t) are occasionally written as

(9.6)

Image

Equation (9.4) is the basic sinewave model that can be thought of as speech-independent, i.e., the model can be applied to any signal. The decomposition in Equation (9.5), on the other hand, is speech-dependent and will be useful for a number of applications where we require the contributions separately. The next step is to develop a robust procedure for extracting the amplitudes, frequencies, and phases of the component sinewaves from the speech waveform. Before doing so, however, we look at a few examples.

Example 9.1       Consider an idealized voiced speech signal where the excitation frequencies are harmonically related with fundamental frequency Ωo, in phase Image, and have unity amplitude (with ak(t) = 1); in addition, the vocal tract is time-invariant. Then we can write

Image

and the excitation phase as

Image

The vocal tract magnitude and phase along the (fixed) excitation frequency trajectories are given by

Image

so that

Image

We see then that each excitation sinewave has been shaped by the gain and phase of the system function, i.e., each input sinewave is shifted by the phase of H (Ω) and multiplied by the magnitude of H (Ω). Image

Example 9.2       Consider the idealized voiced speech signal of Example 9.1 where the excitation frequencies are harmonically related and the vocal tract is time-invariant, but now suppose that the pitch is linearly changing in time, i.e.,

Ωk(t) = k0ct)

where c is a constant that determines the rate of change of pitch. Then the excitation phase is expressed as

Image

which is a quadratically changing phase function. Therefore,

Image

where the system function is sampled along the time-varying frequencies of the excitation sinewaves. This model is valid only if Ωk(t) is “slowly varying” over the duration of h(t, τ) (Appendix 9.A). Image

9.3 Estimation of Sinewave Parameters

To obtain a flavor for the analysis and synthesis problem, we begin with a simple example of analyzing and synthesizing a single sinewave.

Example 9.3       Consider the discrete-time counterpart to our continuous-time sinewave model. In particular, consider a single discrete-time sinewave of the form

Image

derived from a 500-Hz continuous-time signal at 10000 samples/s. We analyze x[n] with the discrete STFT at a frame interval of L = 200 samples (20 ms), and the analysis window Nw has the same length. From Figure 9.3, which shows one slice of the STFT magnitude of the signal, we are motivated to form an estimate of the amplitude and frequency of the sinewave as the amplitude and frequency of the spectral maximum. Denote these estimates for the lth frame by Image and Image, respectively. We can then synthesize the sinewave over the lth frame as

Image

which represents an estimate of x[n] to within a phase offset. Therefore, if we perform this analysis and synthesis over successive frames, we obtain a waveform discontinuity at frame boundaries. This discontinuity would yield an annoying 100 Hz “buzz” corresponding to the frame rate. Alternatively, we can retain an estimate of the original (unwrapped) phase function, Image, by a cumulative sum of the frequency estimate over successive frames, i.e.,

(9.7)

Image

Figure 9.3 Spectral analysis of a single sinewave: (a) waveform; (b) spectral magnitude of the short-time segment in (a). A 25-ms Hamming window was applied.

Image

where l denotes the frame number and where Image is the phase of the right boundary of the previous frame. Our new estimate for the lth frame is then given by

Image

This reconstruction is inaccurate by an absolute phase offset Image when we initialize the process with zero phase. An alternative is to estimate the phase offset on each frame as the phase of the spectral peak. This method, however, does not ensure waveform continuity because of limited frequency accuracy as obtained with the DFT. Moreover, when sinewave parameters are time-varying, even more acute waveform discontinuity arises (Exercise 9.2).

Observe that we can extend the approach in Equation (9.7) to stationary multiple sinewaves. With voiced speech, for example, the magnitude peaks of the STFT, with a long enough analysis window, provide estimates of all the harmonically related sinewave amplitudes and phases. In the reconstruction of each sinewave, however, we lose not only absolute phase, but also the phase relations across sinewaves when the initial sinewave phases are set to zero (Exercise 9.20). Recall that we encountered a similar loss of phase coherence in the phase vocoder synthesis and thus a change in the waveform shape that we earlier in the text referred to as waveform “dispersion.” Even when synthesis is performed with the correct phase offsets, inaccuracies in the measurements Image made with the DFT result in dispersion. Image

The general analysis/synthesis problem is to take a speech waveform, extract the parameters that represent a quasi-stationary portion of that waveform, and use those parameters to reconstruct an approximation that is “as close as possible” to the original speech. The general estimation problem in which the speech signal is to be represented by a sum of sinewaves is a difficult one to solve analytically; hence the approach taken here is pragmatic in the sense that an estimator is derived based on a set of idealized assumptions; then, once the structure of the ideal estimator is understood, modifications are made as the assumptions are relaxed to better match the real speech waveform. Looking ahead to Figure 9.7 we see that the speech waveform is short-time Fourier analyzed to obtain amplitudes, frequencies, and phases at multiple spectral peaks. The waveform is then reconstructed by interpolating these parameters across successive frames and modulating the sinewaves with the resulting functions.

As a first step, and as indicated in Example 9.3, the continuous-time axis is subdivided into a contiguous sequence of frames analyzed with a time window of length T. The center of the analysis window for the lth frame occurs at time tl. We assume that the source excitation and vocal tract parameters are constant over the window duration and so are constant over the interval Image.5 Then we can write the kth amplitude and frequency function of our sinewave model in Equations (9.4) and (9.5) as constant for the lth frame, i.e.,

5 In the speech context, this implies that the source and vocal tract parameters are constant over a time that includes the duration of the analysis window and the duration of the vocal tract impulse response.

Image

Then the sinewave phase in Equation (9.5) can be written as

(9.8)

Image

where Image denotes the phase of the kth sinewave at the center of the lth frame at time t = tl(Figure 9.4a)6 and is a function of the source phase offset Image and the vocal tract phase Φ(t, Ω) at time tl, i.e., Image. For frame l, over the time interval I, we then write the waveform as

6 This point may cause confusion. Observe that the phase of the short-time Fourier transform at spectral peaks is a measurement of the phase offset of each sinewave relative to the frame center tl.

Image

where Kl = K(tl) and where, because we assume the frequency is constant over the analysis interval, the phase is linear with slope Image.

To arrive at a discrete-time formulation, we perform the following two operations:

1. Shift the time interval to the origin, i.e., with t′ = ttl, we have

Image

2. Convert to discrete time (eliminating prime notation)

(9.9)

Image

where Nw is the discrete-time window duration (assumed odd for simplicity of presentation). Equation (9.9) is a discrete-time stationary model for analysis with which the goal is to estimate the sinewave parameters Image, Image and Image on each frame. Finally, Equation (9.9) leads to an expression for the synthetic speech waveform over frame l as

(9.10)

Image

where Image represents the kth complex amplitude for the kth component of the Kl = K(tl) sinewaves, and can be thought of as the measurements at continuous time tl which is at the center of the analysis window. Figure 9.4 illustrates the entire process of going from continuous time to discrete time.

The problem now is to fit the synthetic speech waveform in Equation (9.10) to the original measured waveform, denoted here by y[n]. In particular, and to summarize the above idealization, our goal in analysis is to estimate the sinewave parameters that we have assumed constant over each frame. Our stationary assumption corresponds to amplitude, frequency, and phase functions in discrete time given by

Image

where Image is the phase offset measured at the center of the lth frame. The parameters Image, Image, and Image are the unknown parameters in the discrete-time model Equation (9.10). In order to specify a criterion for the goodness of fit of the sinewave model s[n] to the speech measurement y[n], it is necessary to address whether the speech source is either quasi-periodic, noisy, or impulsive, which are the three basic source classes introduced in Chapter 3.

9.3.1 Voiced Speech

A useful criterion for judging the goodness of fit is the mean-squared error (MSE) defined as7

7 In this section, because the sinewaves in the model s[n] are considered in complex quadrature form (approximately analytic signals), y[n] is also considered in this complex form. As we saw in Chapter 2, an analytic signal representation is obtained by removing negative frequencies from a signal.

(9.11)

Image

Figure 9.4 Discrete-time conversion of speech segment used in sinewave analysis: (a) original waveform over the analysis window duration. The segment center occurs at time tl; (b) short-time segment shifted by tl; (c) discrete-time version of (b) with window duration Nw assumed odd.

Image

where s[n] is the sinewave model, y[n] is the measurement, and the superscript in ∈l refers to the lth frame. Substituting the speech model of Equation (9.10) into Equation (9.11) leads to the following expression for the MSE:

(9.12)

Image

where

Image

The problem now is to try to identify a set of sinewaves that minimizes Equation (9.12); this identification problem, in general, is difficult to solve because the desired sinewave parameters are nonlinear in ∈l. Insights into the development of a suitable estimator can be obtained by restricting the class of input speech signals to perfectly voiced speech, in which case s[n] is not only deterministic but also harmonic. In this ideal case, we derive an estimator and then use this estimator in non-ideal cases, for example, nearly harmonic voiced speech and aharmonic unvoiced speech. We finally modify the estimator (if needed) to handle real speech; here we rely on our intuition and experience.

In the ideal voiced case, Equation (9.10) can be written as

Image

where Image and where Pl is the pitch period, which is assumed to be constant over the analysis window duration of the lth frame. For the purpose of establishing the structure of the ideal estimator, it is further assumed that the pitch period is known and that the width of the analysis window is a multiple of Pl. Under these idealized conditions, Image, and the sinc(·) function in the last term in Equation (9.12) reduces to

Image

where δ[km] = 1 if l = m and δ[km] = 0 if km. Then the expression for the MSE becomes

(9.13)

Image

where

(9.14)

Image

that can be thought of as one time slice of the short time Fourier transform (STFT) of the speech measurement for a rectangular analysis window centered at the time origin. By completing the square in Equation (9.13), the MSE can be written as

Image

which can be minimized by choosing as the estimate for the complex amplitudes,

(9.15)

Image

which reduces the MSE to

Image

From this it follows that the error is minimized by selecting all of the available harmonic frequencies in the speech bandwidth Image.

Equations (9.14) and (9.15) completely specify the structure of the ideal estimator and show that the optimum estimator is obtained through the discrete-time Fourier transform. Although these results are equivalent to a Fourier series representation of a periodic waveform, the above equations lend themselves to an intuitive generalization for the practical case. To see this, consider the STFT, that we are denoting by |Y (ω)|2. For the perfectly voiced speech case, this function is pulse-like in nature, with peaks occurring at all of the harmonics Image. Therefore, the frequencies of the underlying sinewaves correspond to the location of the peaks of |Y (ω)|2, and the estimates of the amplitudes and phases are obtained by evaluating the real and imaginary parts of the STFT at the frequencies of the peaks. The advantage of this latter interpretation of the estimator structure is that it can be applied when the ideal voiced speech assumption is no longer valid, namely when the sinewaves are aharmonic but with constant frequencies. To support this assumption, consider the STFT for the general sinusoidal speech model in Equation (9.10). In this case, the STFT (for the one time slice) is simply

Image

Provided the analysis window is “wide enough” so that

(9.16)

Image

then the Fourier transform magnitude squared can be written as

Image

As before, the location of the peaks of the spectral magnitude corresponds to the sinewave frequencies, and the samples of the STFT at these frequencies correspond to the complex amplitudes. Therefore, the structure of the ideal estimator applies to a more general class of speech waveforms as long as Equation (9.16) holds. Since, during steady voicing, neighboring frequencies are separated by the fundamental frequency, Equation (9.16) suggests that the desired resolution can be achieved “most of the time” by requiring that the analysis window be at least two pitch periods wide.

Example 9.4       Figure 9.5a shows an example of the spectral magnitude of a voiced speech waveform over a 25-ms analysis window. In this example, the spectral peaks, estimated as DFT values larger than two adjacent neighbors, occur primarily near the harmonics. However, beyond about 3000 Hz, peak locations become aharmonic due to the presence of aspiration noise. Peaks can also occur between harmonics when a noise component is present simultaneously with voicing. The sinewave representation of noise is addressed in the following section. Image

Figure 9.5 Typical STFT magnitude of voiced and unvoiced (fricative) speech: (a) voiced with an aspiration component; (b) unvoiced. Spectral peaks, whose locations are denoted by the crosses, determine which frequencies are selected to represent the speech waveform.

Image

9.3.2 Unvoiced Speech

Speech generated with stochastic (noisy) and impulsive sources make up the unvoiced speech category. We have seen in Chapter 3 that stochastic inputs arise, for example, with aspiration at the glottis and frication at a vocal tract constriction. In these cases, the waveform is treated as a sample function of a random process. We saw in Chapter 5 that the STFT magnitude squared, i.e., the periodogram, fluctuates about the power spectrum underlying a random process (Appendix 5.A). If we are to apply the same analysis procedure used in developing the sinewave representation for voiced speech, then we must show that sinewaves corresponding to the sample peaks of our periodogram sum to a random process whose power spectrum adequately approximates the underlying power spectrum of the unvoiced sound.8 An argument that indeed shows conditions for this property exploits the Karhunen-Loève expansion that allows constructing a random process over a finite interval from a series expansion of harmonic sinusoids with uncorrelated complex amplitudes (thus characterized by magnitude and phase) [55].

8 An alternate approach, developed in a speech coding context and described in Chapter 12, uses a harmonically-dependent set of sinewaves with a random phase modulation [33],[34]. Yet another technique represents the signal by a sum of adjacent narrowband sinewaves of uniformly spaced center frequency with random amplitude and frequency modulation [27].

A mathematical analysis based on the Karhunen-Loève expansion shows that a sinusoidal representation is valid, and that we can use the same analysis as in the voiced speech case above, provided the frequencies are “close enough” so that the power spectrum changes slowly over consecutive frequencies. If the window width is constrained to be at least 20 ms, there will be “on the average” a set of periodogram peaks that are approximately 100 Hz apart. This condition has been shown to provide a sufficiently dense sampling to satisfy the necessary constraints while also providing samples that are roughly uncorrelated, i.e., the Karhunen-Loève expansion constraint can be ensured by not allowing samples to fall more than 100 Hz apart during unvoiced speech segments.

Example 9.5       Figure 9.5b shows an example of a typical periodogram for a frame of unvoiced speech along with the amplitudes and frequencies that are estimated using the above peak-picking procedure. The analysis window is 20 ms in duration and the periodogram peaks are “on the average” no more than about 100 Hz apart in frequency, thus roughly satisfying the Karhunen-Loève expansion constraint for representing the underlying power spectrum. Synthesis experiments using the Karhunen-Loève expansion constraints will later be described that show empirically that unvoiced speech can be modeled as a sum of sinewaves derived from periodogram peaks. Image

Finally, we must address the representation and analysis of transient sounds, such as plosives, generated with impulsive sources which are neither quasi-periodic nor random. Here the justification of a sinewave model and the use of peaks in the STFT magnitude as an estimator for the model parameters are more empirical. The validity of the approach for this signal class is based on the observation that peak picking the STFT magnitude captures most of the spectral energy so that, together with the corresponding STFT phase, the short-time waveform character is approximately preserved. This interpretation will become more clear when one sees in the following sections that sinewave synthesis is roughly equivalent to an overlap-add of triangular-weighted short-time segments derived from the STFT peaks. A problem arises, however, in selecting an analysis window that provides adequate time-frequency resolution for both periodic and impulsive sounds. Later in this chapter, we revisit this resolution issue and also describe alternative approaches to a sinewave representation and analysis of transient sounds, such as replacing the STFT with a constant-Q wavelet transform.

9.3.3 Analysis System

This section details the specific methods used to implement the sinusoidal analyzer which are common to all of the applications discussed later in the chapter. The analysis in the preceding section implicitly assumed that the STFT was computed using a rectangular window. Returning to the standard STFT notation, we write the STFT of y[n]as

Image

where w[n] represents the analysis window. Since the poor sidelobe performance of the rectangular window compromises the performance of the estimator, in all of the experiments described in this chapter the Hamming window is used to weight the measured data. While this results in a satisfactory sidelobe structure, it does so at the expense of broadening the main lobes of the spectral estimator. Therefore, in order to maintain the resolution properties that are needed to justify the optimality properties of the spectral processor, the constraint implied by Equation (9.16) is revised to require that the window width, Nw, be two and one-half times the average pitch period or 20 ms, whichever is the larger. Then the pitch-adaptive Hamming window is computed using

Image

and normalized according to

Image

so that the spectral peak estimate Image yields the amplitude of an underlying sinewave (Exercise 9.1). The Hamming window is then applied to the speech segment and zero padded to the specified length of the DFT. For a high-quality analysis/synthesis, a 1024-point DFT is found adequate.

It should be noted that the placement of the analysis window w[n] relative to the time origin is important for computing the phases. Typically, in frame-sequential processing, the window, Nw, lies in the interval 0 ≤ n < Nw and is symmetric about (Nw − 1)/2, a placement and shape that gives it a linear Fourier transform phase equal to −ω(Nw − 1)/2. Since Nw is on the order of 200–500 discrete time samples, any error in the estimated frequencies results in a large random phase error and consequent distortion, perceived as hoarseness, in the reconstruction. An error of one DFT sample, for example, results in a Image phase error (where N is the DFT length) which could be on the order of π. Experiments have shown that the ear is very sensitive to short-term phase jitter whenever the phase error is greater than ~ π/16 radians. To eliminate the linear-phase term, the windowed speech must be circularly shifted before the DFT is taken so that its center is at n = 0. To do this, the Hamming window is placed symmetric relative to an origin defined as the center of the current analysis frame; hence the window takes on values over the interval (− (Nw − 1)/2 ≤ n ≤ (Nw − 1)/2, rather than the more common interval .0 ≤ n < Nw As shown in Figure 9.6, the speech values at negative values of n are wrapped around to the end of the FFT buffer.

Figure 9.6 Circularly shifting the windowed speech segment of length Nw to reduce linear phase error due to DFT frequency sampling. The DFT length is denoted by N.

Image

Figure 9.7 Block diagram of the baseline sinusoidal analysis/synthesis system.

SOURCE: R.J. McAulay and T.F. Quatieri, “Speech Analysis-Synthesis Based on a Sinusoidal Representation” [30]. ©1986, IEEE. Used by permission.

Image

After the circular shift, the practical version of the idealized estimator obtains the frequencies of the underlying sinewaves as the locations of the peaks of Image, i.e., the DFT samples at which the slope changes from positive to negative. Denoting these frequency estimates by Image, the corresponding amplitudes and phases are given by

Image

Henceforth, we drop the “hat” notation, for simplicity, unless needed. A block diagram of the sinewave analysis system is shown in Figure 9.7.

9.3.4 Frame-to-Frame Peak Matching

The above analysis provides a heuristic justification for the representation of the speech waveform in terms of the amplitudes, frequencies, and phases of a set of sinewaves that applies to one analysis frame. As speech evolves from frame to frame, different sets of these parameters will be obtained. The next problem to address is the association of amplitudes, frequencies, and phases measured on one frame with those that are obtained on a successive frame in order to define a set of sinewaves that will be continuously evolving in time. If the number of peaks were constant from frame to frame, the problem of matching the parameters estimated on one frame with those on a successive frame would simply require a frequency-ordered assignment of peaks. In practice, however, the locations of the peaks changes as the pitch changes, and there are rapid changes in both the location and the number of peaks during rapidly varying speech, such as at voiced/unvoiced transitions.

Figure 9.8 Problem of frequency matching: (a) slowly varying pitch; (b) rapidly varying pitch; (c) rapid voiced/unvoiced transition.

Image

To illustrate the problem of peak matching, suppose that three spectral peaks are found on two consecutive frames l and l+1 in a slowly varying voiced region with a small change in pitch, as shown in Figure 9.8a. As indicated, the matching requires a simple ordered alignment of the frequency estimates Image with Image. The matching of the amplitude and phase values implicitly follows. In practice, however, peak locations change as the pitch and spectrum change. For example, suppose that the pitch quickly decreases over two consecutive frames. Then, as shown in Figure 9.8b, the number of peaks abruptly increases. A rapid change in not only the number of peaks but also their location can occur in rapidly varying regions such as voiced to unvoiced transitions, as shown in Figure 9.8c.

In order to account for such rapid movements in the spectral peaks and unequal number of peaks from frame-to-frame, the concepts of birth and death of sinusoidal components are introduced. The problem of matching spectral peaks in some optimal sense, while allowing for this birth/death process, is generally a difficult problem. One method, which has proven to be successful for speech reconstruction, is to define sinewave tracks for frequencies9 that are successively nearest neighbors, conditioned on a frequency on the current frame to fall within a matching interval, [− Δ, Δ], relative to its matched frequency on the previous frame (Figure 9.9). The matching procedure is made dynamic by allowing for tracks to begin at any frame (a birth) and to terminate at any frame (a death)—events which are determined when successive frequencies do not fall within the matching interval. The algorithm, although straightforward, is a rather tedious exercise in rule-based programming, and the reader is referred to [30] for a detailed understanding of the algorithm for actual software development. An illustration of the matching algorithm showing how the birth/death procedure accounts for rapidly varying peak locations is given in Figure 9.10.

9 A more sophisticated matcher may use amplitude and phase, as well as frequency, information.

Figure 9.9 The matching interval condition used in nearest-neighbor sinewave frequency matching.

Image

Figure 9.10 Different modes used in the birth/death frequency-matching process for determining frequency tracks. Note the death of two tracks during frames one and three and the birth of a track during the second frame.

SOURCE: R.J. McAulay and T.F. Quatieri, “Low Rate Speech Coding Based on the Sinusoidal Speech Model,” chapter in Advances in Speech Signal Processing [34]. ©1992, Marcel Dekker, Inc. Courtesy of Marcel Dekker, Inc.

Image

Figure 9.11 Typical frequency tracks for real speech.

SOURCE: R.J. McAulay and T.F. Quatieri, “Speech Analysis-Synthesis Based on a Sinusoidal Representation” [30]. ©1986, IEEE. Used by permission.

Image

Example 9.6       The result of applying the frequency matcher to a segment of real speech is shown in Figure 9.11, which illustrates the ability of the matcher to adapt quickly through transitory speech behavior such as voiced/unvoiced transitions and mixed voiced/unvoiced regions. The frame interval here is 10 ms and the frequency-matching interval width 2 Δ is set at 100 Hz. Image

Example 9.7       Suppose there are P frequencies (peaks) in frame l and Q frequencies (peaks) in frame l + 1. Suppose also that no frequency pairs meet the frequency-matching interval constraint. Then as many as P + Q frequency tracks can exist across any given frame. For example, if P = Q = 70, then a maximum of 140 sinewaves can exist in any one frame. Image

9.4 Synthesis

Since a set of amplitudes, frequencies, and phases is estimated for each frame, it might seem reasonable to estimate the original speech waveform on the lth frame by generating synthetic speech using the equation

Image

where L is the length of the synthesis frame. As we indicated in Example 9.3, due to the time-varying nature of the parameters, however, this straightforward approach leads to discontinuities at the frame boundaries which seriously degrade the quality of the synthetic speech. Therefore, some provision must be made for smoothly interpolating the parameters measured from one frame to those that are obtained on the next. Several methods have been developed to accomplish the necessary interpolation.

9.4.1 Cubic Phase Interpolation

As a result of the frequency-matching algorithm described in the previous section, all of the parameters measured for an arbitrary frame l have been associated with a corresponding set of parameters for frame l + 1. Letting Image and Image denote the successive sets of parameters for the kth frequency track, an obvious solution to the amplitude interpolation problem is to take

(9.17)

Image

where the analysis time origin is the center of the analysis frame so that in Equation (9.17) we are reconstructing half of each of the lth and (l + 1 )st frames.

Unfortunately, such a simple approach cannot be used to interpolate the phase because the phases θl and θl+1 are measured modulo 2π. We illustrate this problem with the following example:

Example 9.8       Consider a time-invariant vocal tract driven by a source consisting of sinewaves with fixed frequencies and amplitudes so that (in continuous time) Ωk(t) = Ωk. Then the phase of the speech sinewaves is expressed as

Image

where Φk = Φ[t, Ωk] is the time-invariant vocal tract phase component of the phase offset. Figure 9.12 shows that when Ωk is large, the sinewave phase θk(t) moves rapidly; therefore, the phase over successive frames moves through multiples of 2π, which results in principal phase measurements with 2π discontinuities. Interpolation cannot be performed with samples of such a wrapped phase function. Image

Hence, phase unwrapping must be performed jointly with interpolation to ensure that the frequency trajectories are meaningful across frame boundaries.

The first step in solving this problem is to postulate a phase interpolation function that is a cubic polynomial.10 For simplicity, we omit the sinewave index k and for convenience, we write the phase as a continuous function of the time variable t, with t = 0 corresponding to the center of frame l and t = T corresponding to the center of frame l + 1:

10 The idea of applying a cubic polynomial to interpolate the phase between frame boundaries was independently proposed in [30] for synthesis of sinewaves of arbitrary frequency and in [1] for harmonic sinewave synthesis.

(9.18)

Image

Figure 9.12 Phase functions for Example 9.8: (a) phase moves quickly when Ωk is large; (b) samples of wrapped phase computed modulo 2π. The thick dashed line in panel (b) represents interpolation from wrapped phase samples that would result in an incorrect phase derivative and thus an incorrect frequency trajectory.

Image

The choice of a cubic polynomial is motivated by four constraints that must be satisfied on each frame.11 Before describing these constraints, we observe that when the vocal tract is slowly varying, the derivative of the sinewave phase is approximately the excitation frequency that we have measured from the location of spectral peaks. We can see this by expressing the sinewave phase function as

11 Other interpolation schemes satisfying these constraints also exist. Exercises 9.4 and 9.5 work through, respectively, piecewise quadratic and piecewise linear interpolating functions over a frame.

Image

so that its derivative under the assumption of a slowly varying Φ[t, Ω(t)] becomes

Image

It follows that, at the center of the lth and (l + 1)st frames, the phase derivatives are given by the frequency measures on these consecutive frames, i.e.,

Image

We can now formulate the four constraints on the phase polynomial.

Because the derivative of the phase is the frequency, it is necessary that the cubic phase function and its derivative equal the phase and frequency measured at the frame l. Therefore, it follows that at t = 0

Image

and, as a result, Equation (9.18) can be written as

Image

Similarly, the cubic phase function and its derivative must equal the phase and frequency measured at the frame l + 1. Therefore, it follows that at t = T

(9.19)

Image

Because the terminal phase θl+1 is measured modulo 2π, it is necessary to augment it by the term 2πM where M is an unknown integer. This results in two equations in three unknowns: α, β, and M. Nevertheless, suppose we had the correct value of M. Then we can solve for α and β. In other words, at this point M is unknown, but for each value of M, Equation (9.19) can be solved for α (M) and β (M), the dependence on M being made explicit. The solution is shown to satisfy the matrix equation (Appendix 9.B)

(9.20)

Image

In order to determine M and, ultimately, the solution to the phase unwrapping problem, an additional constraint needs to be imposed. Our final constraint is to make the resulting frequency function “maximally smooth,” a concept that is now quantified. Figure 9.13 illustrates a typical set of cubic phase interpolation functions for a number of values of M. It seems clear on intuitive grounds that the best phase function to pick is the one that would have the least variation. This is what is meant by a maximally smooth frequency trajectory. In fact, if the frequencies were constant and the vocal tract were stationary, the true phase would be linear, its phase derivative constant, i.e., Image, and so its second derivative would be zero, i.e., Image. Therefore, a reasonable criterion for smoothness is to choose M such that

Image

is a minimum where Image denotes the second derivative of θ(t; M) with respect to the time variable t. Although M is integer valued, since ƒ(M) is quadratic in M, the problem is most easily solved by minimizing ƒ(x) with respect to the continuous variable x and then choosing M to be the integer closest to x. After straightforward but tedious algebra, it can be shown that the minimizing value of x is (Appendix 9.B)

Image

Figure 9.13 Typical set of cubic phase interpolation functions. The phase function for M = 2 is “maximally smooth.”

SOURCE: R.J. McAulay and T.F. Quatieri, “Speech Analysis-Synthesis Based on a Sinusoidal Representation” [30]. ©1986, IEEE. Used by permission.

Image

from which M* is determined as the nearest integer to x*. M* is then used in Equation (9.20) to compute α(M*) and β(M*) and, in turn, the unwrapped phase interpolation function is denoted by

(9.21)

Image

This phase function not only satisfies all of the measured phase and frequency endpoint constraints, but also unwraps the phase in such a way that α(t) is maximally smooth.

Because the above analysis began with the assumption of an initial unwrapped phase αl corresponding to frequency Ωl at the start of frame l, it is necessary to specify the initialization of the frame interpolation procedure. This is done by noting that, at some point in time, the track under study was born. When this event occurred, an amplitude, frequency, and phase were measured at frame l + 1, and the parameters at frame l to which these measurements correspond are defined by setting the amplitude to zero (i.e., Al = 0) while maintaining the same frequency (i.e., Ωl = Ωl+1). In order to ensure that the phase interpolation constraints are satisfied initially, the unwrapped phase is defined to be the measured phase θl+1 and the startup phase is defined to be

θl = θl+1 − Ωl+1T

where T (in discrete time) is the number of samples traversed in going from frame l + 1 back to frame l. This startup procedure is consistent with assuming the frequency is constant over the initial frame interval and so the cubic polynomial reduces to a linear-phase trajectory over this interval. Observe that in this case, M* = 0 because the measured phase becomes the desired unwrapped phase. We summarize the conditions at the birth of a track as12

12 It is interesting to observe that one can obtain a faster attack and decay than a frame length by beginning or ending the amplitude interpolation within a frame duration.

Image

A similar procedure is used for terminating a frequency track.

As a result of the above phase unwrapping procedure, each frequency track has associated with it an instantaneous unwrapped phase, which accounts for both the rapid phase changes due to the source and the slowly varying phase changes due to the shape of the glottal flow and the vocal tract transfer function. Returning to discrete time, the final synthetic waveform for the lth frame is given by

Image

where, for the kth sinewave,Image is given by Equation (9.17), Image is the sampled data version of Equation (9.21), and Kl is the number of sinewaves. A block diagram description of the sinewave synthesis system is shown in Figure 9.7. We henceforth refer to this as the baseline sinusoidal analysis/synthesis system.

9.4.2 Overlap-Add Interpolation

While the cubic phase interpolation system is the recommended synthesis technique in that it produces very high-quality synthetic speech, it is computationally expensive to implement due to the fact that every component sinewave must be synthesized on a per sample basis. An alternate procedure which has proven to be satisfactory in some speech applications is to overlap and add weighted, i.e., windowed, segments of the reconstructed waveform from one frame to the next [35].

The sinewave parameters estimated on frame l can be used to generate the waveform

(9.22)

Image

over the interval 0 ≤ n < L. Another representation of the waveform on this interval can be obtained using the sinewave parameters measured at frame l + 1:

(9.23)

Image

If the frame interval L is “short enough,” sl[n] and sl+1[n] will be similar, and smoothly interpolated synthetic speech can be generated by weighting the above waveforms by a triangular window. As shown in Figure 9.14, the leading portion (downward slope of the triangular window) of the sinewave generated during the frame l is overlapped and added to the trailing portion (upward slope of the triangular window) of the sinewave in frame l + 1. This operation can be expressed as

Image

Figure 9.14 Overlap and adding segments in sinusoidal synthesis.

Image

Since the component sinewaves are now of constant frequency, FFT techniques can be used to synthesize the component waveforms in Equation (9.22) and Equation (9.23) which results in reduced computational complexity [35]. Provided that the FFT frequency quantization is small enough (~ 30 Hz) and the frame rate is high enough (~ 100 Hz) this procedure can produce very high-quality output speech. It should be noted the method can be implemented without recourse to the frame-to-frame peak matching algorithm, which further reduces the computational complexity.

It is interesting to observe that the above overlap-add operation is also implemented by setting the frequency-matching window to zero in the matching algorithm. This is because, under this condition, sinewaves are born and die on every two consecutive frames, and over these two frames the birth/death process holds the frequency constant. Since the amplitude interpolation is linear, beginning and ending at zero for a birth and death, respectively, a triangular window of length twice the frame interval is effectively applied to each steady sinewave. This is an interesting alternative to direct overlap-add; although it is clearly not computationally efficient, it may serve to help explain the ability of the sinewave analysis/synthesis to represent short-lived and rapidly varying speech events such as plosives and unvoiced-to-voiced transitions (Exercise 9.8).

Finally, observe also that the overlap-add method can be interpreted as synthesizing a signal from a modified STFT. As a result, we might consider the least-squared error approach of Chapter 7, which resulted in a weighting of the short-time frame by the window and a normalization by the summed squared, shifted windows (Exercise 9.15).

Unfortunately, there are situations in practice when the constraint on the frame rate cannot be met and straightforward application of the overlap-add procedure leads to synthetic speech that is quite “rough.” This scenario occurs in the design of low-rate sinewave speech coders that quantize the measured phases. Since many bits must be allocated to the phase-coding procedure, the frame rate must be ~ 50 Hz, which is too slow to ensure effective interpolation of the sinewave phases. The reason for this is that the sinewave frequencies are implicitly assumed constant over the extent of the triangular window. At the 100-Hz frame rate, the triangular window is 20 ms wide, which is an interval over which the sinewave parameters can reasonably be assumed to be stationary. If the frame rate is less than 100 Hz, however, the triangular window exceeds 20 ms and the stationarity assumption begins to break down. This is not to say that sinewave synthesis cannot be performed at frame rates less than 100 Hz; quality can be regained by using the frame-to-frame peak-matching algorithm and the cubic phase interpolation technique. This suggests that there exists an interpolated set of sinewave parameters that can be used with the STFT overlap-add synthesizer and operate at an effective 100-Hz frame rate using the desired 20-ms triangular windows (Exercise 9.6) [35].

9.4.3 Examples

Non-real-time floating-point simulations have determined the effectiveness of the sinusoidal approach in modeling real speech. In one simulation, the speech is lowpass-filtered at 5 kHz, digitized at 10 kHz, and analyzed at 10-ms frame intervals [30]. The speech segments are weighted using a fixed or pitch-adaptive Hamming window, circularly rotated (Figure 9.6) and the STFT is computed using a 1024-point FFT. The sinewave amplitude and frequency are estimated by locating the peaks of the magnitude of the STFT. The phases are then computed using the corresponding real and imaginary parts. The frequency-matching and linear amplitude and cubic phase interpolation techniques are used for synthesis. A large speech data base was processed with this system, and it was found that the synthetic speech is perceived to be essentially indistinguishable from the original. Visual examination of many of the reconstructed passages shows that the waveform structure is essentially preserved, as seen in the following examples:

Example 9.9       An example of the reconstruction fidelity of sinewave analysis/synthesis is shown in Figure 9.15, which compares the waveforms for the original speech and the reconstructed speech during a vowel and two plosives (“go to”). In this case of a female speaker, the window duration is 15 ms and the frame interval is 10 ms. Some slight temporal smearing is seen at the onset of the two plosives; the first is the voiced plosive “g” in the word “go” and the second is the unvoiced plosive “t” in the word “to.”Image

Figure 9.15 Reconstruction (b) of speech waveform (a) from female speaker using sinusoidal analysis/synthesis with 15-ms window and 10-ms frame.

Image

Example 9.10       A second example of the reconstruction fidelity is shown in Figure 9.16, which compares the waveforms for the original speech and the reconstructed speech for the “z” in the end of the word “jazz” and the word “hour” (from the phrase “jazz hour”). In this case of a very low-pitched male speaker, the window duration is 60 ms and the frame interval is 10 ms. Some slight temporal smearing is seen at the onset of the vowel “o” in the word “hour.” The waveform is well-reconstructed both visually and aurally in spite of the diplophonic behavior. The diplophonia exhibits itself in the measured phase of the harmonics, as well as in amplitude modulation of the harmonics (Exercise 3.3 in Chapter 3) that is captured by the peak-picking in sinewave analysis.Image

The fidelity of the reconstructions suggests that the quasi-stationarity conditions seem to be satisfactorily met and that the use of the parametric model based on the amplitudes, frequencies, and phases of a set of sinewave components appears to be justifiable for both voiced and unvoiced speech. Although the sinusoidal model was originally designed in the speech context for a single speaker, it can represent almost any waveform. Successful reconstruction is obtained for multi-speaker waveforms, complex musical pieces, and biologic signals such as bird and whale sounds. Other signals tested include complex acoustic signals from mechanical impacts such as a bouncing can, a slamming book, and a closing stapler [41]. These signals were selected to have a variety of time envelopes, spectral resonances, and attack and decay dynamics. In each case, the window length and frame interval are tailored to the signal, and the reconstruction is both visually and aurally nearly imperceptible from the original; small discrepancies are found primarily at transitions and nonstationary regions where temporal resolution is limited due to the analysis window extent, as we saw with plosives and vowel onsets in the previous two examples. In addition, numerous synthetic and real background signals, including random signals (e.g., synthetic colored noise or an ocean squall) and AM-FM tonal interference (e.g., a blaring siren) were tested. The synthesized waveforms are essentially perceptually indistinguishable from the originals with little modification of the background [31],[44].

Figure 9.16 Reconstruction (b) of speech waveform (a) from a low-pitched male speaker using sinusoidal analysis/synthesis with 60-ms window and 10-ms frame.

Image

Although high-quality analysis/synthesis of speech has been demonstrated using amplitudes, frequencies, and phases at the spectral peaks of the STFT, it is often argued that the ear is insensitive to phase. The folklore about phase insensitivity dates back to as early as von Helmoltz [17]. This proposition, however, is not consistent with the experiments in Chapter 8 that show aural sensitivity to phase modification by the phase vocoder. The proposition is also inconsistent with auditory models of perception described in that chapter. The importance of phase measurements has also been shown to be essential to a high-quality sinewave synthesis. This property can be demonstrated by performing “magnitude-only” reconstruction by replacing the cubic phase function in Equation (9.21) by a phase that is simply the integral of the instantaneous frequency, analogous to certain experiments performed with the phase vocoder. One way to do this is to make the instantaneous frequency be the linear interpolation of the frequencies measured at the frame boundaries and then perform the integration. Alternately, one can simply use the quadratic frequency derived from the cubic phase via initiating the cubic phase offset at zero upon the birth of a frequency track. A waveform synthesized by the magnitude-only system for the low-pitched speaker of Example 9.10 is given in Figure 9.17. While the speech from the magnitude-only synthesis is very intelligible and free of artifacts, it is quite different not only visually, i.e., dispersed, but also aurally because of the failure to maintain the true sinewave phases; the synthetic speech is reverberant during voicing and tonal during unvoiced speech, corresponding to the loss of phase coherence. We defined phase coherence in Chapter 8 as the preservation of the original component phase relations. For voiced speech, the aural sensitivity to loss in phase coherence is greater for low-pitched than high-pitched speech, consistent with the front-end auditory filter-bank model described in Chapter 8 in which filter output envelopes are modified more with decreasing pitch and increasing cochlear filter characteristic frequency.

Figure 9.17 Magnitude-only reconstruction of speech (b) is compared against original (a) from a low-pitched male speaker. The synthesized waveform is dispersed compared to that of the synthesis with measured phases shown in Figure 9.16.

Image

9.4.4 Applications

We now briefly describe a number of applications of the baseline sinewave analysis/synthesis that illustrate the generality of the approach.

Time-Scale Modification — In time-scale modification, the magnitudes, frequencies, and phases of the sinewave components are modified to expand the time scale of a signal without changing its spectral or frequency (pitch) characteristics. Consider a time-scale modification by a rate-change factor ρ. By time-warping the sinewave frequency tracks, i.e., forming Image, the instantaneous frequency locations are preserved while modifying their rate of change in time [38]. Since Image, this modification can be represented by a phase change in Equation (9.4) and thus the time-scaled signal can be expressed as

(9.24)

Image

where the amplitude functions are also time-warped. (Recall similar operations on filter-bank outputs in the phase vocoder.) Suppose that in the discrete-time baseline analysis/synthesis system, the analysis and synthesis frame intervals are L samples. In the discrete-time analysis/synthesis system based on the model in Equation (9.24), the synthesis interval is mapped to L′ = ρL′ samples. L′ is constrained to be an integer value since the synthesis frame requires an integer number of discrete samples. The modified cubic phase and linear amplitude functions, which are derived for each sinewave component, are then sampled over this longer frame interval. This modification technique has been successful in time-scaling of speech, as well as a larger class of signals such as music, biological, and mechanical impact signals [38]. Nevertheless, a problem arises due to the inability of the system to maintain the original sinewave phase relations through θk(tρ−1); some signals can suffer from the reverberance typical of other modification systems, such as the phase vocoder and the “magnitude-only” reconstruction described in Section 7.6.1 of Chapter 7. An approach to preserve phase coherence, and thus improving quality, imparts a source/filter phase model on the sinewave components and is described in Section 9.5. In spite of its limitations, however, the technique of Equation (9.24) and its variations remain the most general for sinewave-based time-scale modification [38]. Similar approaches, using the baseline sinewave analysis/synthesis, have also been used for frequency transformations, including frequency compression and pitch modification [38].

Other Applications — Other applications of the baseline system are sound splicing and morphing [41],[50],[51] using the linear amplitude and cubic phase interpolators of Section 9.4.1. Splicing the plosive of one speech segment with the sustained portion of a vowel from a different speech segment, for example, can test the relative importance of the temporal components in characterizing the sound. This is performed by matching frequencies and then interpolating the amplitudes and phases of the two synthesized sounds at a splice point. In sound morphing, in contrast to splicing temporal segments of sounds, entire frequency tracks are blended together to transform one sound into another; the functional form for amplitude, frequency, and phase gives a natural means for moving from one sound or voice into another. A new frequency track is created as the interpolation of tracks ω1[n] and ω2[n] from the two sounds, represented by Image, over a time interval [0, N]. A similar operation is performed on the corresponding amplitude functions [53]. Such a time-varying blend of different signals can also be performed in the framework of the phase vocoder [36].

Another interesting application of sinewave analysis is the analysis of vibrato in the singing voice. Vibrato is the quasi-sinusoidal fluctation of pitch and is crucial for the richness and naturalness of the sound [26],[41]. With vibrato, the resonant character of the vocal tract remains approximately fixed while the excitation from the vocal folds changes frequency in a quasi-sinusoidal manner. The output spectrum of the singing voice shows frequency modulation due to the activity of the vocal folds and amplitude modulation, i.e., tremolo, due to the source harmonics being swept back and forth through the vocal tract resonances13 [26]. For quasi-periodic waveforms with time-varying pitch, each harmonic frequency varies synchronously. Because each harmonic is a multiple of the time-varying fundamental, with vibrato, higher harmonics have a larger bandwidth than lower harmonics. With rapid pitch vibrato, the temporal resolution required for frequency estimation increases with harmonic number. The sinusoidal analysis is useful (although, as we will see shortly, limited by a fixed analysis window) in tracking such harmonic frequency and amplitude modulation and can have an advantage over the phase vocoder that requires the modulated frequency to reside in a single channel [26]. The presence of vibrato in the analyzed tone may cause unwanted “cross-talk” between the bandpass filters of the phase vocoder; in other words, a sinewave may appear in the passband of two or more analysis filters during one vibrato cycle. Sinewave analysis, on the other hand, was found by Maher and Beauchamp [26] to provide an improvement over fixed filter-bank methods for the analysis of vibrato since it is possible to track changing frequencies and thereby avoid the cross-talk problem.

13 McAdams [29] hypothesizes that the tracing of resonant amplitude by frequency modulation contributes to the distinctness of the sound in the presence of competing sound sources.

9.4.5 Time-Frequency Resolution

For some signal processing applications, it is important that the sinewave analysis parameters represent the actual signal components. Although a wide variety of sounds have been successfully analyzed and synthesized based on the sinusoidal representation, constraints on the analysis window and assumptions of signal stationarity do not allow accurate estimation of the underlying components for some signal classes. For example, with sinewave analysis of signals with closely spaced frequencies such as a low-pitched male voice, it is difficult to achieve adequate temporal resolution with a window selected for adequate frequency resolution. On the other hand, for signals with rapid modulation or signals with short duration and/or a sharp attack, such as with rapid pitch or formant transitions and plosives, it is difficult to attain adequate frequency resolution with a window selected for adequate temporal resolution. A long window can result in temporal smearing of transient or rapidly varying signal components and may be perceived as a mild dulling or distortion of the sound. A short window, on the other hand, may prevent accurate representation of low frequencies and closely spaced sinewaves. These problems reflect the tradeoff of time-frequency resolution that arises in the uncertainty principle (Chapter 2).

With the presence of pitch modulation, for example, to obtain equivalent time resolution in following each frequency trajectory by spectral peak locations, we would need to decrease the window duration with increasing frequency. By “equivalent” time resolution, we mean (loosely) the same frequency change under each analysis window. (Also keep in mind that the time span of one sinewave cycle decreases with increasing frequency.) With plosives, likewise, we would need to decrease the window length, to give better time resolution of the primarily high-frequency energy. Sinewave analysis may, therefore, benefit from multi-resolution windowing. Such a multi-resolution sinewave analysis can be provided by the wavelet transform that we described in Chapter 8. The basic approach, introduced in [8],[9], is to first decompose the signal by a wavelet transform. Each wavelet filter output is then represented by a sum of sinewaves. A number of formal ways of representing the subband outputs in terms of sinewaves were developed by Goodwin [13], who modeled each output using progressively shorter windows with increasing frequency, as illustrated in Figure 9.18. Specifically, each window duration decreases by a factor of two for each frequency octave increase. Goodwin also introduced a more general “atomic decomposition” in which subband bandwidths and analysis window durations adapt to the signal’s time-frequency characteristics [13]. Approaches in this style have been developed for a number of specific applications. Anderson [2], Goodwin [13], and Levine, Verma, and Smith [22] have developed multi-resolution approaches to sinewave analysis, particularly useful in preserving signal transients and transitions in signal modification and wideband coding applications. Hamdy, Ali, and Tewfik [15] have shown that the wavelet transform can be combined with a “residual” sinewave model that we describe later in this chapter, also for the wideband coding application.

Similar approaches exploit auditory perception principles. Although a short window for temporal tracking reduces frequency resolution, the human ear may not require as high a frequency resolution in perceiving high frequencies as for low frequencies. Ellis used this property of auditory perception in developing a constant-Q wavelet analysis within the sinewave framework for improving the perception of fine structure at signal onsets [8],[9]. Ghitza [12] and Anderson [2] have exploited auditory spectral masking (to be described in Chapters 12 and 13) with constant-Q filters to reduce the number of sinewaves required in sinewave synthesis for speech coding.

Figure 9.18 An octave-band filter bank for sinewave decomposition. Each filter-bank output is sinewave-analyzed with a window whose duration decreases by a factor of two for each frequency octave increase [13].

SOURCE: M.M. Goodwin, Adaptive Signal Models: Theory, Algorithms, and Audio Applications [13] (Figure 13.9). ©1992, Kluwer Academic Publishers. Used by permission.

Image

A number of approaches have also been proposed for modeling the nonstationarity of sinewave parameters over an analysis window.14 One technique relies on a time-varying frequency model with a linear evolution of frequency. For a Gaussian analysis window, Marques and Almeida [28] have shown that the windowed signal can be written in complex form as

14 Observe that there is an inconsistency in assumptions made in baseline analysis and synthesis. In analysis, we assume the sinewave amplitudes and frequencies are fixed (stationary), while in synthesis we assume they are time-varying (nonstationary). However, we have faced this inconsistency in all parametric analysis/synthesis systems of the text thus far (see Exercise 5.13 for an alternative model).

Image

with

(9.25)

Image

where the center frequency for each component is ωk, the frequency slope is 2Δk, and the Gaussian envelope is characterized by the parameters μk and λk. The Fourier transform of s(t) in Equation (9.25) is given by

Image

where, for a Gaussian window, Ok(ω) can be evaluated analytically and also takes on a Gaussian form. This convenient form allows for estimation of the unknown parameters by iterative least-squared-error minimization using a log spectral error. To relieve the multiple sinewave estimation problem, estimation is performed one sinusoid at a time, successively subtracting each estimate from the signal’s spectrum.15 Improvement in an average reconstruction error over multiple frames was observed in speech signals whose pitch varies rapidly.

15 An iterative approach was also developed by George [11] for improving the estimation of low-level sinewaves in wideband spectra of large dynamic range. A least-squared-error minimization of sinewave parameters was formulated as an analysis-by-synthesis procedure in which sinewave estimates are successively subtracting from the signal; at each iteration, the component with the largest magnitude is subtracted and the estimation is carried out on the remainder.

This problem of representing frequency variation is particularly severe for high-frequency sinewaves. We have seen that, because for periodic waveforms each harmonic frequency is an integer multiple of the fundamental frequency, higher frequencies experience greater variation than low frequencies. We mentioned earlier that with rapid pitch vibrato in the singing voice, for example, the temporal resolution required for frequency estimation increases with harmonic number. This high-frequency variation can be so great that the short-time signal spectrum can appear noise-like in high-frequency regions, thus reducing the efficacy of spectral peak peaking. (Look ahead to Figure 11.2.) This observation motivated an alternative approach by Ramalho to addressing sinewave frequency variations [46]. In this approach, the waveform is temporally warped according to an evolving pitch estimate, resulting in a nearly monotone-pitch synthesis. A fixed analysis window is then selected for a desired frequency resolution; dewarping the frequency estimate yields an estimate of the desired sinewave frequency trajectory.

9.5 Source/Filter Phase Model

The baseline sinewave model and analysis/synthesis system are applicable to arbitrary signals. For signals represented approximately by the output of a linear system driven by periodic, impulsive, or noise excitations, however, we saw in Section 9.2 that the baseline sinewave model can be refined by imposing a source/filter representation on the sinewave components. Within this framework, we now introduce the concept of sinewave phase coherence, which becomes the basis for a number of applications, including time-scale modification and peak-to-rms reduction.

9.5.1 Signal Model

We begin by reviewing the sinewave representation derived earlier for a source/filter speech model. We saw in Section 9.2 that we can write the speech excitation model as a sum of sinewaves of the form in Equation (9.1), i.e.,

(9.26)

Image

where K(t) represents the number of sinewaves at time t and ak(t) is the time-varying amplitude associated with each sinewave. For the kth sinewave, the excitation phase Image is the integral of the time-varying frequency ωk(t) or

Image

where Image is a fixed phase offset which accounts for the fact that the sinewaves are generally not in phase. Since the system impulse response is also time-varying, the system transfer function is written in terms of its time-varying amplitude and phase as

(9.27)

Image

To simplify notation, the system amplitude and phase along each frequency trajectory Ωk(t) are written as

(9.28)

Image

Passing the excitation described in Equation (9.26) through the linear time-varying system of Equation (9.27) results in the sinusoidal representation for the waveform:

Image

where

Ak(t) = ak(t)Mk(t)

and

(9.29)

Image

represent the amplitude and phase of each sinewave component along the frequency trajectory Ωk(t). We saw that the accuracy of this representation is subject to the caveat that the parameters are slowly varying relative to the duration of the system impulse response.

In developing a source/filter phase model, we initially assume voiced speech and that the glottal flow contribution is embedded within the vocal tract impulse response. For a periodic voiced segment, the excitation function therefore reduces to a periodic impulse train. The excitation phase representation in Equation (9.26) can then be simplified by introducing a parameter representing the time at which an impulse occurs; we refer to this as an onset time of the periodic excitation. In the context of the sinewave model, an onset time corresponds to the time at which sinewaves are “in phase,” by which we mean that they all reach their peak amplitude value, each sinewave having a multiple of 2π phase value. The excitation waveform is modeled over the analysis window duration as

(9.30)

Image

where to is an onset time of the source and where the excitation frequency Ωk is assumed constant over the duration of the analysis window. Comparison of Equation (9.26) with Equation (9.30) shows that here the excitation phase Image is linear with respect to frequency. With this representation of the excitation, the excitation phase can be written in terms of the onset time to as

Image

According to Equation (9.29), the system phase for each sinewave frequency is given by the phase function obtained when the linear excitation phase (tto) Ωk is subtracted from the composite phase θk(t), which consists of both excitation and system components, i.e.,

Φk(t) = θk(t) − (tt0k.

Similarly, from Equation (9.29), the system amplitude is obtained by dividing excitation amplitude, ak(t) into the composite amplitude Ak(t). Alternatively, excitation components are obtained by removing system components from composite values.

9.5.2 Applications

We saw in Chapter 8 an auditory model that offers an explanation for perceptual phase sensitivity. Motivated by this model and empirical observations of phase sensitivity, in this section, we describe a method for preserving phase coherence (i.e., the original sinewave phase relations) or a desired phase relation (other than the original) in the context of sinewave analysis/synthesis for a number of applications.

Time-Scale Modification— The simplified linear model of the generation of speech predicts that a time-scaled modified waveform takes on the appearance of the original except for a change in time scale, e.g., we simply increase or decrease the number of pitch periods during voicing, as we saw in the time-scale modification model of Chapter 7. This section develops a time-scale modification system that preserves this shape invariance property for quasi-periodic signals by preserving the phase coherence across sinewaves in modification [39]. A similar approach can be applied to pitch modification [39] (Exercise 9.11).

Source/Filter Model: For a uniform change in the time scale, the time to corresponding to the original articulation rate is mapped to the transformed time Image through the mapping

(9.31)

Image

The case ρ > 1 corresponds to slowing down the rate of articulation by means of a time-scale expansion, while the case ρ < 1 corresponds to speeding up the rate of articulation by means of a time-scale compression. Events which take place at a time Image according to the new time scale will have occurred at Image in the original time scale.

In an idealized sinewave model for time-scale modification, the “events” which are modified are the amplitudes and phases of the system and excitation components of each underlying sinewave. The rate of change of these events is a function of how fast the system moves and how fast the excitation characteristics change. In this simplified model, a change in the rate at which the system moves corresponds to a time scaling of the amplitude M(t, Ω) and the phase Φ (t, Ω) The excitation parameters must be modified so that frequency trajectories are stretched and compressed while maintaining pitch. While the excitation amplitudes ak(t) can be time-scaled, a simple time scaling of the excitation phase Image will alter pitch. Alternatively, the transformation given by Image maintains the pitch but results in waveform dispersion. As in the baseline sinewave modification system of Equation (9.24), the phase relation among sinewaves is continuously being altered. A different approach to modeling the modification of the excitation phase function relies on the representation of the excitation in terms of its impulse locations, i.e., onset times, which were introduced in the previous section. In time-scale modification, the excitation onset times extend over longer or shorter time durations relative to the original time scale. This representation of the time-scaled modified excitation function is a primary difference from time-scale modification using the baseline system of Equation (9.24). The model for time-scale modification is illustrated in Figure 9.19

Figure 9.19 Onset-time model for time-scale modification where t′ = tρ with ρ the rate-change factor. For a periodic speech waveform with pitch period P, the excitation is given by a train of periodically spaced impulses, i.e., Image, where to is a displacement of the pitch impulse train from the origin. This can also be written as a sum of complex sinewaves of the form Image, where to is a displacement of the pitch impulse train from the origin. This can also be written as a sum of complex sinewaves of the form Image.

SOURCE: T.F. Quatieri and R.J. McAulay, “Shape-Invariant Time-Scale and Pitch Modification of Speech” [39]. ©1992, IEEE. Used by permission.

Image

Equations (9.26)-(9.31) form the basis for a mathematical model for time-scale modification [38],[39],[40]. To develop the model for the modified excitation function, suppose that the time-varying pitch period P(t) is time-scaled according to the parameter ρ. Then the time-scaled pitch period is given by

Image

from which a set of new onset times can be determined. The model of the modified excitation function is then given by

(9.32)

Image

where

(9.33)

Image

and where Image is the modified onset time. The excitation amplitude in the new time scale is the time-scaled version of the original excitation amplitude function ak(t) and is given by

Image

The system function in the new time scale is a time-scaled version of the original system function so that the magnitude and phase are given by

Image

where Mk (t) and Φk(t) are given in Equation (9.28). The model of the time-scaled waveform is completed as

Image

where

(9.34)

Image

which represent the amplitude and phase of each sinewave component.

The above time-scale modification model was developed for a periodic source and is applicable to quasi-periodic waveforms with slowly varying pitch and for which sinewave frequencies are approximately harmonic. There also exist, of course, many sounds that are aharmonic. Consider plosive consonants generated with an impulsive source. In these cases, a change in the articulation rate of our model is less desired and so the model should include a rate change that adapts to speech events [39]. This is because, as we saw in Chapter 3 (Section 3.5), a natural change in articulation rate incurs less modification for this sound class; compression or expansion of a plosive is likely to alter its basic character. A smaller degree of change was also observed in Chapter 3 to take place naturally with fricative consonants.

A change in the articulation rate of noisy aharmonic sounds may be useful, however, in accentuating the sound. In this case, the spectral and phase characteristics of the original waveform, and, therefore, the noise character of the sound are roughly preserved in an analysis/synthesis system based on the rate-change model in Equation (9.34), as long as the synthesis interval is 10 ms or less to guarantee sufficient decorrelation of sinewaves in time from frame to frame, and as long as the analysis window is 20 ms or more to guarantee approximate decorrelation in frequency of adjacent sinewaves [30],[39]. For time-scale expansion, this noise characteristic, however, is only approximate. Some slight “tonality” is sometimes perceived due to the temporal stretching of the sinewave amplitude and phase functions. In this case, and when the 10-ms synthesis frame condition is not feasible (computationally) due to a very expanded time scale (e.g., ρ greater than 2 with an analysis frame no less than 5 ms), then frequency and phase dithering models can be used to satisfy the decorrelation requirements [23],[39]. One extended model adds a random phase to the system phase in Equation (9.34) in only those spectral regions considered aharmonic.16 For the kth frequency, the phase model is expressed as

16 In adding synthetic harmonic and aharmonic components, it is important that the two components “fuse” perceptually [18], i.e., that the two components are perceived as emanating from the same sound source. Informal listening tests suggest that sinewave phase randomization yields a noise component that fuses with the harmonic component of the signal.

(9.35)

Image

where bkc) is a binary weighting function which takes on a value of unity for a frequency declared “aharmonic” and a value of zero for a “harmonic” frequency:

(9.36)

Image

where Ωk are sinewave frequencies estimated on each frame and φk(t′) is a phase trajectory derived from interpolating random phase values over each frame, and differently for each sinewave; the phase values are selected from a uniformly distributed random variable on [−π, π]. The cutoff frequency Ωc is the harmonic/aharmonic cutoff for each frame and varies with a “degree of harmonicity” measure (or “probability of voicing”) Pυ:

(9.37)

Image

over a bandwidth B and where the harmonicity measure Pυ , falling in the interval [0, 1] (the value 1 meaning most harmonic), must be obtained. One approach to obtain Pυ uses a sinewave-based pitch estimator [32] that will be described in Chapter 10. Figure 9.20 illustrates an example of frequency track designations in a speech voiced/unvoiced transition. This phase-dithering model can provide not only a basis for certain modification schemes, but also a basis for speech coding that we study in Chapter 12.

Analysis/Synthesis: With estimates of excitation and system sinewave amplitudes and phases at the center of the new time-scaled synthesis frame, the synthesis procedure becomes identical to that of the baseline system of Section 9.4. The goal then is to obtain estimates of the amplitudes Image and phases Image in Equation (9.34) on the synthesis frame of length L′ = ρL where L is the analysis frame length. In the time-scale modification model, because the vocal tract system and its excitation amplitudes are simply time-scaled, we see from Equation (9.34) that the composite amplitude need not be separated and therefore the required time-scaled amplitude can be obtained from the sinewave amplitudes measured on each frame l. The amplitudes for each frame in the new time scale are thus given by (re-introducing frame index notation that we earlier implicitly removed)

Figure 9.20 Transitional properties of frequency tracks with adaptive cutoff. Solid lines denote voiced frequencies, while dashed lines denote unvoiced frequencies.

SOURCE: T.F. Quatieri and R.J. McAulay, “Peak-to-rms Reduction of Speech Based on a Sinusoidal Model” [42]. ©1991, IEEE. Used by permission.

Image

Image

In discrete time, the time-scaled sinewave amplitudes are then obtained by linearly interpolating the values Image and Image over the synthesis frame duration L′, identical to the scheme in Equation (9.17). The system and excitation phases, however, must be separated from the measured phases because the components of the composite phase Image in Equation (9.34) are manipulated in different ways.

We first estimate the required system phase in the original time scale, i.e., relative to the original analysis frame, by subtracting the measured excitation from the measured composite phase, as according to Equation (9.29). The initial step in estimating the excitation phase is to obtain the onset time with respect to the center of the lth frame, denoted in discrete time by no(l). Determining this absolute onset time 17 is not an easy task and a number of methods have been developed [33],[52]. One method of estimating the absolute onset times is based on a least-squared-error approach to finding the unknown no(l) [33] and will be described in Chapter 10. Although this method can yield onset times to within a few samples, this slight inaccuracy is enough to degrade quality of the synthesis. Specifically, the measured absolute onset time can introduce pitch jitter, rendering the synthetic modified speech hoarse. (Recall this association in Chapter 3.)

17 Recall that for voiced speech we have lumped the glottal flow function with the vocal tract impulse response. The resulting sequence without a linear phase term is concentrated at the time origin. One can think of the absolute onset time as the time displacement of this sequence relative to the center of a frame. Thus, in estimating the absolute onset time, we are in effect estimating the linear phase term of the system function. Other definitions of absolute onset time are with respect to glottal flow characteristics such as the time of glottal opening or the sharp, negative-going glottal pulse.

An alternative perspective of onset time, however, leads to a way to avoid this hoarseness. Because the function of the onset time is to bring the sinewaves into phase at times corresponding to the sequence of excitation impulses, it is possible to achieve the same effect simply by keeping record of successive onset times generated by a succession of pitch periods. If no(l −1) is the onset time for frame l −1 and if Pl is the pitch period estimated for frame l, then a succession of onset times can be specified by

q0(l; j) = no(l − 1) + jPl,         j = 1, 2, 3…

If q0(l; J) is the onset time closest to the center of the frame l, then the onset time for frame l, is defined by

(9.38)

Image

We call this time sequence the relative onset times. An example of a typical sequence of onset times is shown in Figure 9.21a. It is implied in the figure that, in general, there can be more than one onset time per analysis frame. Although any one of the onset times can be used, in the face of computational errors due to discrete Fourier transform (DFT) quantization effects, it is best to choose the onset time which is nearest the center of the frame, since then the resulting phase errors will be minimized. This procedure determines a relative onset time, which is in contrast to finding the absolute onset time when excitation impulses actually occur. Because the relative onset time is obtained from the pitch period, pitch estimation is required. Given the importance of avoiding pitch jitter, fractional pitch periods are desired. One approach, consistent with a sinewave representation and described in Chapter 10, is based on fitting a set of harmonic sinewaves to the speech waveform that allows a fractional pitch estimate and thus yields accurate relative onset times [32].

Figure 9.21 Estimation of onset times for time-scale modification: (a) onset times for system phase; (b) onset times for excitation phase.

SOURCE: T.F. Quatieri and R.J. McAulay, “Shape-Invariant Time-Scale and Pitch Modification of Speech” [39]. ©1992, IEEE. Used by permission.

Image

Having an onset time estimate, from Section 9.5.1, the excitation phase for the lth frame is then given in discrete time by

Image

where we assume a constant frequency Image and where no(l) takes on the meaning of an onset time closest to the frame center time. To avoid growing frame and onset times, we shift each frame to time n =0. This can be done by subtracting and adding lL (disregarding a half-frame length) in the above expression for Image to obtain

Image

where Image is the onset time relative to the lth frame center shifted to time n = 0. The excitation phase at the frame center is therefore given by

Image

Finally, an estimate of the system phase at the measured frequencies is computed by subtracting the estimate of the excitation phase from the measured phase at the sinewave frequencies, i.e.,

Image

When the excitation phase, derived from the relative onset time, is subtracted, some residual linear phase will be present in the system phase estimate because the relative linear phase is not equal to the absolute linear phase. This linear phase residual, however, is consistent at harmonics over successive frames and therefore does not pose a problem to the reconstruction since the ear is not sensitive to a linear phase shift. To see this property, consider a periodic voiced waveform with period P and suppose that the absolute onset time on the l th frame is ml. And suppose the relative phase onset is given by JlP (relative to time n = 0) where P is the pitch period and Jl an integer referring to the l th frame. (For convenience, we have slightly changed our notation from above.) The linear phase difference at harmonics, denoted by Image, is given by

Image

And for the following frame, the linear phase residual is given by

Image

where M is the number of pitch periods traversed in going from frame l to frame l + 1. We see, therefore, that the phase residual remains constant (modulo 2π) over the periodic waveform. Furthermore, with small changes in pitch and with aharmonicity, the change in linear phase is much reduced because Image

The remaining step is to compute the excitation phase relative to the new synthesis interval of L′ samples. As illustrated in Figure 9.21b, the pitch periods are accumulated until a pulse closest to the center of the l th synthesis frame is achieved. The location of this pulse is the onset time with respect to the new synthesis frame and can be written as

Image

where J′ corresponds to the first pulse closest to the center of the synthesis frame of duration L′. The phase of the modified excitation Image, at the center of the l th synthesis frame, is then given by

Image

Finally, in the synthesizer the sinewave amplitudes over two consecutive frames, Image and Image, are linearly interpolated over the frame interval L′, as described above. The excitation and system phase components are summed and the resulting sinewave phases, Image and Image, are interpolated across the duration L′ using the cubic polynomial interpolator. A block diagram of the complete analysis/synthesis system is given in Figure 9.22. Finally, we note that similar techniques can be applied to frequency compression and pitch modification, as well as to such operations jointly (Exercises 9.10 and 9.11).

An important feature of the sinewave-based modification system is its straightforward extension to time-varying rate change, details of which are beyond the scope of this chapter [39]. Consider, for example, application to listening to books from audio recordings. Here we want time-varying control of the articulation rate, i.e., a “knob” that slows down the articulation rate in important or difficult-to-understand passages and speeds up speech in unimportant or uninteresting regions. In addition, we mentioned earlier that unvoiced sounds naturally are modified less than voiced sounds. As a consequence, the corresponding analysis/synthesis system can be made to adapt to the events in the waveform, i.e., the degree of voicing, which better emulates speech modification mechanisms, as discussed in the previous section. We saw that one way to achieve a measure of voicing, and thus the desired adaptivity, is through a measure of “harmonicity”Pυ [39]. This event-adaptive rate change can be superimposed on a contextual-dependent rate change as given, for example, in the above application. The reconstructions are generally of high quality, maintaining the naturalness of the original, and are free of artifacts. Interfering backgrounds are also reproduced at faster and slower speeds. Although the phase model is pitch-driven, this robustness property is probably due to using the original sinewave amplitudes, frequencies, and phases in the synthesis rather than forcing a harmonic structure onto the waveform.

Figure 9.22 Sinusoidal analysis/synthesis for time-scale modification.

SOURCE: T.F. Quatieri and R.J. McAulay, “Shape-Invariant Time-Scale and Pitch Modification of Speech” [39]. ©1992, IEEE. Used by permission.

Image

Example 9.11       An example of time-scale expansion of a speech waveform is shown in Figure 9.23, where the time scale has been expanded by a “target” (i.e., desired maximum) factor of 1.7 and has been made to adapt according to the harmonicity measure Pυ referred to in Equation (9.37); specifically, the rate change is given by ρ = [1 − Pυ] + 1.7 Pυ. The analysis frame interval is 5 ms so that the synthesis frame interval is no greater than 8.5 ms. Observe that the waveform shape is preserved during voicing, as a result of maintaining phase coherence, and that the unvoiced segment is effectively untouched because ρ ≈ 1 in this region, although some temporal smearing occurs at the unvoiced/voiced transition. Spectrogram displays of time-scaled signals show preservation of formant structure and pitch variations [39].

Example 9.12       An example of time-scale modification of speech using sinewave analysis/synthesis is shown in Figure 9.24, where the rate change is controlled to oscillate between a compression with ρ = 0.5 and an expansion with ρ = 1.5. The figure shows that details of the temporal structure of the original waveform are maintained in the reconstruction despite the time-varying rate change; waveform dispersion, characteristic of the time-scale modification with the baseline sinewave analysis/synthesis, does not occur.Image

Peak-to-rms Reduction— In reducing the ratio of the peak value of a signal to its average power, the concern of sinewave phase coherence again arises. Here, however, the goal is not to preserve the original sinewave phase relations, but rather to intentionally modify them to yield a response with minimum “peakiness,” relying on a phase design technique derived in a radar signal context [40],[42],[43].

Figure 9.23 Example of sinewave-based time-scale modification of speech waveform “cha” in the word “change”: (a) original; (b) adaptive expansion with target ρ = 1.7.

Image

Figure 9.24 Example of time-varying time-scale modification of speech waveform using sinusoidal analysis/synthesis: (a) original; (b) time-scaled with rate change factor ρ changing from 0.5 (compression) to 1.5 (expansion).

SOURCE: T.F. Quatieri and R.J. McAulay, “Shape-Invariant Time-Scale and Pitch Modification of Speech” [39]. ©1992, IEEE. Used by permission.

Image

Key-Fowle-Haggarty Phase Design: In the radar signal design problem, the signal is given as the output of a transmit filter whose input consists of impulses. The spectral magnitude of the filter’s transfer function is specified and its phase is typically chosen so that the response over its duration is flat. This design allows the waveform to have maximum average power given a peak-power limit on the radar transmitter. The basic unit of the radar waveform is the impulse response h[n] of the transmit filter. Following the development in [41],[42], it is expedient to view this response in the discrete time domain as an FM chirp signal with envelope a[n], phase Image, and length N:

(9.39)

Image

that has a Fourier transform H(ω) with magnitude M(ω) and phase ψ (ω), i.e.,

(9.40)

Image

By exploiting the analytic signal representation of h[n], Key, Fowle, and Haggarty [20] have shown that, under a constraint of a large product of signal time duration and bandwidth, i.e., a large “time-bandwidth product,” specifying the two amplitude components, a[n] and M(ω), in Equation pair (9.39) and (9.40) is sufficient to approximately determine the remaining two phase components. How large the time-bandwidth product must be for these relations to hold accurately depends on the shape of the functions a[n] and M(ω) [6],[10].

Ideally, for the minimum ratio of the signal peak to the square root of the signal average energy, referred to as the peak-to-rms ratio in the radar signal, the time envelope a[n] should be flat over the duration N of the impulse response. With this and the additional constraint that the spectral magnitude is specified (a flat magnitude is usually used in the radar signal design problem), Key, Fowle, and Haggarty’s (KFH) general relation among the envelope and phase components of h[n] and its Fourier transform H(ω) reduces to an expression for the unknown phase ψ(ω):

(9.41)

Image

where the “hat” indicates that the magnitude has been normalized by its energy, i.e.,

Image

The accuracy of the approximation in Equation (9.41) increases with increasing time-bandwidth product [42].

Equation (9.41) shows that the resulting phase ψ(ω) depends only on the normalized spectral magnitude Image and the impulse response duration N. It can be shown that the envelope level of the resulting waveform can be determined, with the application of appropriate energy constraints, from the unnormalized spectrum and duration [42]. Specifically, if the envelope of h[n] is constant over its duration N and zero elsewhere, the envelope constant has the value

(9.42)

Image

The amplitude and phase relation in Equations (9.41) and (9.42) will be used to develop the sinewave-based approach to peak-to-rms reduction.

Waveform Dispersion: In the above radar signal design context, the spectral magnitude was assumed known and a phase characteristic was estimated from the magnitude. Alternately, a filter impulse response with some arbitrary magnitude or phase is given and the objective is to disperse the impulse response to be maximally flat over some desired duration N. This requires first removing the phase of the filter and then replacing it with a phase characteristic from the KFH calculation. This same approach can be used to optimally disperse a voiced speech waveform.18 The goal is to transform a system impulse response with arbitrary spectral magnitude and phase into an FM chirp response which is flat over the duration of a pitch period. The basis for this optimal dispersion is the sinewave source/filter phase model of Section 9.5.1. The sinewave analysis/synthesis system first separates the excitation and system phase components from the composite phase of the sinewaves that make up the waveform, as was done for time-scale modification. The system phase component is then removed and the new KFH phase replaces the natural phase dispersion to produce the optimally dispersed waveform.

18 Schroeder has derived a different method to optimally flatten a harmonic series by appropriate selection of harmonic phases [48],[49]. This solution can be shown to approximately follow from the KFH dispersion formula.

Applying the KFH phase to dispersion requires estimation of the spectral magnitude M (lL, ω) of the system impulse response and the pitch period of the excitation Pl on each frame l. The duration of the synthetic impulse response is set close to the pitch period Pl so that the resulting waveform is as dense as possible. Estimation of the spectral envelope M (lL, ω) can be performed with a straight-line spectral smoothing technique that uses a linear interpolation across sinewave amplitudes (Chapter 10) [37]. The synthetic system phase on each frame is derived from M (lL, ω) using the KFH solution in Equation (9.41) and is denoted by ψkfh (lL, ω) where “k f h” refers to the KFH phase.

Applying the KFH phase dispersion solution in the synthesis requires that the synthetic system phase, ψkfh(lL, ω), a continuous function of frequency, be sampled at the time-varying sinewave frequencies Image. We write the resulting sampled phase function as

(9.43)

Image

where the subscript “k, kf h” and superscript “l” denote the KFH phase sampled at the kth frequency on the l th analysis frame, respectively. The phase solution in Equation (9.43) is used only where an approximate periodicity assumption holds, whereas at aharmonic regions the original system phase is maintained. Therefore, the KFH phase is assigned only to those frequencies designated “harmonic.” The assignment can be made using the same approach applied in the phase model of Equation (9.35) where a frequency cutoff ωc adapts to a measure of the degree of voicing. Use of the system phase during aharmonic (noise or impulsive) regions does not change the preprocessor’s effectiveness in reducing the peak-to-rms ratio since these regions contribute negligibly to this measure. Moreover, the preservation of as much of the original waveform as possible helps to preserve the original quality. Thus, the phase assignment for the kth sinewave on the lth frame is given by

(9.44)

Image

where bkc), given in Equation (9.36), takes on a value of zero for a harmonic frequency (below ωc) and unity for an aharmonic frequency (above ωc), and where Image is the excitation phase,Image is the original system phase, and Image is the synthetic system phase.

Example 9.13       An example of dispersing a synthetic periodic waveform with a fixed pitch and fixed system spectral envelope is illustrated in Figure 9.25. Estimation of the spectral envelope uses a piecewise-linear spectral smoothing technique of Chapter 10 [37] and the pitch estimation is performed with a sinewave-based pitch estimator also described in Chapter 10. For the same peak level as the original waveform, the processed waveform has a larger rms value and so has a lower peak-to-rms ratio. The simulated vocal tract phase is modified significantly, as illustrated in Figures 9.25b and 9.25c. In Figure 9.25d the magnitude of the dispersed waveform is compared with the original magnitude and the agreement is very close, a property that is important to maintaining intelligibility and minimizing perceived distortion in speech.Image

With speech waveforms, a smoothed KFH phase solution (Exercise 9.18) that reduces the short-time peakiness within a pitch period can be combined with conventional dynamic range compression techniques that reduce long-time envelope fluctuations [4]. The result is a processed speech waveform with significant peak-to-rms reduction (as great as 8 dB) and good quality [42],[43]. The KFH solution introduces about a 3 dB contribution to this peak-to-rms reduction over that of dynamic range compression alone. The effect is a waveform with much greater average power but the same peak level, and thus a waveform that is louder than the original under the peak constraint. This property is useful in applications such as speech transmission (e.g., AM radio transmitters [43]) and speech enhancement (e.g., devices for the hearing-impaired).

Figure 9.25 KFH dispersion using the sinewave preprocessor: (a) waveforms; (b) original phase; (c) modified phase; (d) spectral magnitudes.

SOURCE: T.F. Quatieri and R.J. McAulay, “Peak-to-rms Reduction of Speech Based on a Sinusoidal Model” [42]. ©1991, IEEE. Used by permission.

Image

9.6 Additive Deterministic-Stochastic Model

We have seen in Chapter 3 that there are many aharmonic contributions to speech, including the numerous noise components (e.g., aspiration at the glottis and frication at an oral tract constriction) of unvoiced and voiced fricatives and plosives. Even during sustained vowels, aharmonicity can be present, both in distinct spectral bands and in harmonic spectral regions, due to aspiration (generated at the glottis) that is a subtle but essential part of the sound; this aharmonic component during sustained portions is different from the aharmonic sounds from unvoiced fricatives and plosives. Aharmonic contributions may also be due to sound transitions or from modulations and sources that arise from nonlinearities in the production mechanism such as vortical airflow. Speech analysis and synthesis are often deficient with regard to the accurate representation of these aharmonic components. In the context of sinewave analysis/synthesis, in particular, the harmonic and aharmonic components are difficult to distinguish and separate. One approach to separating these components, given in the previous section, assumes that they fall in two separate time-varying bands. Although the adaptive harmonicity measure is effective in specifying a split-band cutoff frequency (as well as generalizations to multi-bands as discussed in Chapter 10 [14]), it is, however, overly simple when the harmonic and aharmonic components are simultaneously present over the full band. An alternative representation, developed by Serra and Smith [50],[51], assumes the two components are additive over the full speech band. This approach, referred to as the deterministic-stochastic sinewave representation, is the focus of this section.

9.6.1 Signal Model

Following the formulation in [41], we express the Serra/Smith deterministic-stochastic model [50],[51] in continuous time as

s(t) = d(t) + e(t)

where d(t) and e(t) are the deterministic and stochastic components, respectively. The deterministic component d(t) is of the form

Image

where the phase is given by the integral of the instantaneous frequency Ωk(t):

Image

where the frequency trajectories Ωk(t) are not necessarily harmonic and correspond to sustained sinusoidal components that are relatively long in duration with slowly varying amplitude and frequency. The deterministic component 19 is, therefore, defined in the same way as in the baseline sinusoidal model except with one caveat: The sinewaves are restricted to be “sustained.” In the baseline sinusoidal model, the spectral peaks need not correspond to such long-term frequency trajectories. A mechanism for determining these sustained sinewave components is described in the following section.20

19 Referring to sustained sinewave components as “deterministic” pertains more to our intuition rather than our strict mathematical sense.

20 As an alternative deterministic-stochastic separation scheme, Therrien, Cristi, and Allison [54] have introduced an adaptive pole-zero model for sample-by-sample tracking of sinewave amplitudes and frequencies of the deterministic signal. This component is subtracted from the original signal and parameters of the resulting residual are also adaptively estimated using a pole-zero representation. This representation has been used to synthetically increase limited training data for a signal classification task [54].

The stochastic component, sometimes referred to as a “residual,” is defined as the difference between the speech waveform and its deterministic part, i.e., e(t) = s(t) − d(t), and can be thought of as anything not deterministic. It is modeled as the output of a linear time-varying system h(t, τ) with a white-noise input u(t), i.e.,

(9.45)

Image

This is a different approach to modeling a stochastic component from that taken in the baseline sinewave model where noise is represented as a sum of sinewaves. A limitation with this stochastic representation, as discussed further below, is that not all aharmonic signals are appropriately modeled by this stochastic signal class; for example, a sharp attack at a vowel onset or a plosive may be better represented by a sum of short-duration coherent sinewaves or the output of an impulse-driven linear system, respectively. Another limitation of this representation is that harmonic and aharmonic components may be nonlinearly combined. Nevertheless, this simplification leads to a useful representation for a variety of applications.

9.6.2 Analysis/Synthesis

The analysis/synthesis system for the deterministic-stochastic model is similar to the baseline sinewave analysis/synthesis system. The primary differences lie in the frequency matching stage for extraction of the deterministic component, in the subtraction operation to obtain the residual (stochastic component), and in the synthesis of the stochastic component.

Although the matching algorithm of Section 9.3.4 can be used to determine frequency tracks (i.e., frequency trajectories from beginning to end of sinewaves), this algorithm does not necessarily extract the sustained sinewave components; rather, all sinwaves are extracted. In order to obtain these sustained components, Serra [50] developed a matching algorithm based on prediction of frequencies into the future, as well as extrapolation of past frequencies, over multiple frames. In this algorithm, frequency guides (which is a generalization of the frequency matching window of Section 9.4.3), advance in time through spectral peaks looking for slowly varying frequencies according to constraint rules. When the signal is known to be harmonic, the matcher is assisted by constraining each frequency guide to search for a specific harmonic number. A unique feature of the algorithm is the generalization of the birth and death process by allowing each frequency track to enter a “sleep” state and then reappear as part of a single frequency track. This peak continuation algorithm is described in detail in [50]. The algorithm helps to prevent the artificial breaking up of tracks, to eliminate spurious peaks, and to generate sustained sinewave trajectories, which is important in representing the time evolution of true frequency tracks.

With matched frequencies from the peak continuation algorithm, the deterministic component can be constructed using the linear amplitude and cubic phase interpolation of Section 9.4.1. The interpolators use the peak amplitudes from the peak continuation algorithm and the measured phases at the matched frequencies. The residual component can then be obtained by subtraction of the synthesized deterministic signal from the measured signal.

The residual is simplified by assuming it to be stochastic, represented by the output of a time-varying linear system with white-noise input as in Equation 9.45, a model useful in a variety of applications. In order to obtain a functional form for this stochastic process, the power spectrum of the residual must be estimated on each frame from the periodogram of the residual, i.e., the normalized squared STFT magnitude of the residual, Image. This power spectrum estimate can be obtained, for example, with linear predictive (all-pole) modeling (Chapter 5) [50],[51]. We denote the power spectral estimate on each frame by Image. A synthetic version of the process is then obtained by passing a white-noise sequence into a time-varying linear filter with the square root of the residual power spectral estimate, Image as the frequency response. A frame-based implementation of this time-varying linear filtering is to filter windowed blocks of white noise and overlap and add the outputs over consecutive frames. The time-varying impulse response of the linear system is given by

Image

which is a zero-phase response.21 The synthetic stochastic signal over the l th frame is then given by

21 It is left to the reader to show that a minimum-phase version of the filter can also be constructed through the homomorphic filtering methods of Chapter 6 and that its output has a power spectrum no different from the zero-phase counterpart response.

Image

where u[n], a white-noise input, is multiplied by the sliding analysis window with a frame interval of L samples. Because the window w[n] and frame interval L are designed so that Image, the overlapping sequences Image can be summed to form the synthesized residual

Image

Although, as we will show in the application to speech modification, the stochastic model can provide an important alternative to a sinewave representation of noise-like sounds, it has its limitations. The residual which results from the deterministic-stochastic model generally contains everything which is not sustained sinewaves. One of the more challenging unsolved problems is the representation of transient events that reside in the residual; examples include plosives and transitions in speech and other sounds that are neither quasi-periodic nor random. Nevertheless, the deterministic-stochastic analysis/synthesis has the ability to separate out subtleties in the sound that the baseline sinewave representation may not reveal. For example, the residual can include low-level aspiration and transitions, as well as the stronger fricatives and plosives in speech [50]. The following example illustrates this property of the residual, as well as the generality of the technique, for a musical sound:

Example 9.14       The deterministic-stochastic analysis/synthesis reveals the strong presence of non-tonal components in a piano sound [50]. The residual is an important component of the sound and includes the transient attack and noise produced by the piano action. An example of the decomposition of the attack (with noise) and initial sustained portion of a piano tone is illustrated in Figure 9.26. Synthesizing only the deterministic component, without the residual, results in a lifeless sound with a distorted attack and no noise component.Image

Figure 9.26 Deterministic-stochastic decomposition of a piano sound:(a) beginning segment of a piano tone; (b) deterministic component; (c) residual component, including the attack and noise.

SOURCE: X. Serra, A System for Sound Analysis/Transformation/Synthesis Based on a Deterministic Plus Stochastic Decomposition [50].  1989, X. Serra. Used by permission.

Image

Generally, treating the residual as the output of a white-noise driven system when it contains transient events, as in the previous example, can alter the sound quality. An approach to improve the quality of a stochastic synthesis is to introduce a second layer of decomposition where transient events are separated from the residual and represented with an approach tailored to transient events. One method performs a wavelet analysis on the residual to estimate and remove transients in the signal [15]; the remainder is processed separately, an approach useful in applications such as speech coding or modification.

9.6.3 Application to Signal Modification

The decomposition approach has been applied successfully to speech and music modification where modification is performed differently on the two deterministic-stochastic components [50],[51]. Consider, for example, time-scale modification. With the deterministic component, the modification is performed as with the baseline system; using Equation (9.24), sustained sinewaves are compressed or stretched. For the aharmonic component, the white-noise input in the stochastic synthesis lingers over longer or shorter time intervals and is matched to the impulse responses (per frame) that vary slower or faster in time.

In one approach to implement synthesis of the modified stochastic component, the window length and frame interval are modified according to the rate change. A new window w′[n] and frame interval L′ are selected such that Image, and the factor L′/L equals the desired rate change factor ρ, which is assumed rational. The resulting time-scaled stochastic waveform is given by

Image

where u′ [n] is the white-noise input generated on the new time scale.

The advantage of separating out the additive stochastic component is that the character of a noise component is not modified; in particular, the noise may be stretched without the “tonality” that occurs in very large stretching of sinewaves. On the other hand, we noted earlier that the quality of transient aharmonic sounds may be altered. In addition, component separation and synthesis may suffer from a lack of “fusion,” which is not a problem in sinewave modification because all components are modeled similarly. One approach to improve fusion of the two components is to exploit the property that for many sounds the stochastic component is in “synchrony” with the deterministic component. In speech, for example, the amplitude of the noise component is known to be modulated by the glottal air flow, as we saw in Example 3.4 of Chapter 3. Improved fusion can thus be obtained by temporal shaping of the stochastic component with the temporal envelope of the glottal air flow [21]. Indeed, more generally, further development of the deterministic-stochastic model will require accounting for harmonic/aharmonic components that are nonlinearly combined.

9.7 Summary

In this chapter, a sinusoidal model for the speech waveform has been developed. In sinewave analysis, the amplitudes, frequencies, and phases of the component sinewaves are extracted from the short-time Fourier transform of the speech. In order to account for effects due to a time-varying source and vocal tract events, the sinewaves are allowed to come and go in accordance with a birth-death frequency tracking algorithm. Once contiguous frequencies are matched, a cubic phase interpolation function is obtained that is consistent with all the frequency and phase measurements and that performs maximally smooth phase unwrapping. Synthesis is performed by applying this phase function to a sinewave generator that is amplitude-modulated by a linear interpolation of successive matched sinewave amplitudes. The process is repeated for all sinewave components, the final speech being the sum of the component sinewaves. In some respects, the basic model has similarities to the filter-bank representation used in the phase vocoder of Chapter 8 [13],[41]. Although the sinusoidal analysis/synthesis system is based on the discrete Fourier transform (DFT), which can be interpreted as a filter bank (Chapter 7), the use of the DFT in combination with peak-picking renders a highly adaptive filter bank, since only a subset of all of the DFT filters is used on any one frame. It is the use of the frequency tracker and cubic phase interpolator that allows the filter bank to move with highly resolved speech components.

The sinewave model for an arbitrary signal class results in a sinewave analysis/synthesis framework applicable to a variety of problems in speech and audio sound processing, including, for example, sound modification, morphing, and peak-to-rms reduction. Tailoring the sinewave representation, however, to specific signal classes can improve performance. A source/filter phase model for quasi-periodic signals leads to a means to preserve sinewave phase coherence through a model of onset times of the excitation impulses. This approach to phase coherence is similar in style to that introduced into the phase vocoder in Chapter 8. In addition, the sinewave analysis/synthesis was tailored to signals with additive harmonic and aharmonic components by introducing a deterministic-stochastic model. It was also shown that, in some applications, computational advantages can be achieved by performing sinewave synthesis using the FFT and an overlap-and-add procedure.

There are many extensions, refinements, and applications of the sinusoidal approach that we were not able to cover in this chapter. For example, we merely touched upon the many considerations of time-frequency resolution and led the reader to some important work in combining the baseline sinewave model, and its deterministic-stochastic extension, with a wavelet representation for multi-resolution analysis and synthesis. The problem of the separation of two voices from a single recording is another area that we introduced only briefly through Exercise 9.17. Although the sinewave paradigm has been useful in signal separation, the problem remains largely unsolved [41],[45]. Multi-speaker pitch estimation [41],[45] and the synchrony of movement of the time-varying sinewave parameters within a voice [5], may provide keys to solving this challenging separation problem. In addition, there remain a variety of other application areas not addressed within this chapter, including speech coding (Chapter 12) and enhancement for the hearing impaired [19],[47]. Other applications exploit the capability of sinewave analysis/synthesis to blend signal operations, an example being joint time-scale and pitch modification (Exercise 9.11) for prosody manipulation in concatenative speech synthesis[3],[24].

Appendix 9.A: Derivation of the Sinewave Model

Consider the speech production model in continuous time which is illustrated in Figure 9.2, where h(t, τ) is the time-varying vocal tract impulse response, i.e., the response at time t to an impulse applied τ samples earlier, at time tτ (Chapter 2). The frequency response of the system at time t is given by the Fourier transform of h(t, τ) and can be written in polar form as

H (t,Ω) = M (t, Ω) exp[jΦ(t, Ω)].

The source u(t) is modeled as a sum of sinewaves (with, for simplicity, the number of sinewaves K assumed fixed):

Image

representing an arbitary source and thus is not constrained to a periodic, impulsive, or white-noise form. The functions ak(t) are the time-varying sinewave amplitudes and

Image

are the sinewave phase functions where, for convenience, we henceforth eliminate the “Re” notation.

The speech waveform s(t) can be written as a time-varying convolution

Image

which can be written as

Image

Interchanging the ∫ and Image above,we have

Image

where t′ is the effective starting time of h(t, tτ). If we assume that the excitation amplitude and frequency are constant over the effective duration of h(t, tτ) (Figure 9.27a), we have

Image

Then Image can be written as

Image

Therefore, s(t) can be written as

Image

where exp[jΩk(tt′)] is considered as an eigenfunction of h(t, τ) at each time instant t. The above equation can therefore be further rewritten as

Image

Figure 9.27 Effective duration of time-varying vocal tract impulse response h(t, τ) relative to the source u(τ): (a) an impulsive source overlapping h(t, τ); (b) a causal impulsive source not overlapping h(t, τ). In panel (a) the region of overlap of u(τ) and h(t, τ) in the time-varying convolution provides sinewave eigenfunctions as input to h(t, τ), but in panel (b) the eigenfunction assumption breaks down.

Image

and thus

Image

Expressing the system function in terms of its magnitude and phase, namely

H[t, Ωk(t)] = M[tk(t)] exp[jΦ(tk(t))],

we can express the speech waveform model as

Image

Combining amplitude and phase terms, we have

Image

where,

Image

with

Image

Finally, it is important to observe in this derivation that at rapid transitions the eigenfunction condition does not hold. Consider, for example, the onset of an initial vowel whose source is illustrated in Figure 9.27b as a causal impulse train. If we model this function u(t) as a sum of sinewaves, each with a unit step function amplitude, ak(t) = u(t), then, as h(t, τ) slides over the beginning of the excitation function, the amplitudes of the excitation sinewaves are not constant over the duration of h(t, τ) (Problem 9.3).

Appendix 9.B: Derivation of Optimal Cubic Phase Parameters

The continuous-time cubic phase and frequency model for a sinewave over an analysis time interval [0, T] is given by (where l is the frame index)

(9.46)

Image

where we assume the initial phase θ(0) = θl and frequency Image values are known. We also assume a known endpoint phase (modulo 2π) and frequency so that at t = T

Image

Therefore,

Image

which can be written in matrix form as

Image

giving the solution for the unknown cubic parameters

Image

Now let

Image

We can then write

(9.47)

Image

Suppose the frequency is constant. Then Ωl+1 = Ω1 and

Image

which motivates, for slowly changing frequencies, selection of the value of M that minimizes

Image

which, using Equation (9.46), can be expressed as

Image

Now, from Equation (9.47),

Image

and letting

Image

we have

(9.48)

Image

Then

(9.49)

Image

Let g(M) equal the bracketed expression in Equation (9.49). Then, to minimize Image we must minimize g(M) with respect to the integer value M. To simplify this difficult nonlinear optimization problem, we assume a continuous argument x for g(x), minimize g(x) with respect to x, and then pick the closest integer to the optimizing solution. Therefore, we want

(9.50)

Image

Substituting Equation (9.48) into Equation (9.50), it can then be shown that the optimizing value x* is given by

Image

To reduce x* further, the denominator and numerator of x* are written as, respectively (with appropriate substitutions from the above expressions),

Image

from which we have

Image

Finally, substituting for b1 and b2

Image

from which the nearest integer M* is chosen.

EXERCISES

9.1 Consider analyzing with the STFT a sequence x[n] consisting of a single sinewave, i.e.,

x[n] = A cos (ωn).

Show that normalizing the analysis window w[n] used in the STFT according to

Image

yields a spectral peak of value Image, i.e., half the amplitude of the underlying sinewave. Hint: Use the windowing (multiplication) theorem that was reviewed in Chapter 2.

9.2 Consider reconstruction of a single sinewave in Example 9.3 from spectral peaks of the STFT.

(a) Describe an approach to short-time synthesis by estimating the phase at the spectral peaks over successive frames. What problem in sinewave frequency and phase estimation do you encounter with a DFT implementation of the STFT? Hint: Use the result that the STFT phase at the spectral peaks represents the phase offset of the sinewave relative to the center of the analysis window. Assume each short-time segment is shifted to the region 0 ≤ n < Nw, where Nw is the window length. Also consider that the sinewave frequency can be estimated no more accurately that half the distance between two consecutive DFT samples.

(b) Suppose now that sinewave parameters are time-varying. Explain why even more acute waveform discontinuity arises with your approach from part (b)

9.3 Propose a source/filter speech model that addresses the eigenfunction approximation problem at speech transitions described in Appendix 9.A.

9.4 One approach to phase interpolation, as described in Section 9.4.1, is through a cubic polynomial. An alternative is to use a piecewise quadratic interpolator, as illustrated in Figure 9.28, which shows the phase function in continuous time for one frame from t = 0 to t = T. In this interpolator, the frequency is linear over each half segment with the midpoint frequency being unknown. Assume the phase has been unwrapped up to time t = 0. The midpoint frequency is a free parameter which will be selected so the phase measurement θ(T), at the right boundary, is met. On two consecutive frames, the measurements θl (unwrapped), θl+1 (modulo 2π), Ωl, and Ωl+1 are known.

(a) Suppose for the moment that 2πM is also known. Derive an expression for θ(T) in terms of θl, Ωl, Ωl+1, and the unknown midpoint frequency Image. Then determine Image, assuming again that M is known.

Figure 9.28 Piecewise quadratic phase interpolator over one synthesis frame: (a) phase; (b) frequency [derivative of (a)].

Image

(b) From your answer in part (a), θ(t) is now a function only of the unknown integer M. Propose an optimization criterion for selecting M such that the resulting unwrapped phase over the interval [0, T] is “maximally smooth.” Recall the derivation of the cubic phase interpolation. Do not solve.

9.5 An alternative to the cubic (Section 9.4.1) and quadratic (Exercise 9.4) phase interpolator used in sinewave synthesis is given in Figure 9.29, which shows the continuous-time phase over one synthesis frame from t = 0 to t = T. On two consecutive frames, the measurements θl, θl+1, Ωl, and Ωl+1 are given. In this interpolator, the frequency is assumed piecewise constant over three equal intervals. Assume the phase θ(0) = θl has been unwrapped up to time t = 0, and the measurement θl+1 is made modulo 2π. The mid-frequency value Image is a free parameter that will be selected so that the phase measurement θ(T) at the right boundary is met.

(a) Suppose for the moment that 2πM is known. Derive an expression for Image in terms of the known quantities so that the phase θ(T) = θl+1 + 2πM is achieved.

(b) Give an expression for the phase derivative Image over the interval [0, T] in terms of Image, Ωk, and Ωk+1. In order to select the unknown M, one can minimize the following criterion:

Image

where ƒ(t) is a linearly interpolated frequency between the known frequency measurements Ωl and Ωl+1:

Image

Figure 9.29 Piecewise linear-phase interpolator: (a) phase; (b) frequency [derivative of (a)].

Image

Find the value of M (closest integer to) that minimizes E(M). You are not asked to solve the equations that result from the minimization. Explain why this is a reasonable criterion for selecting M.

(c) Do you think the step discontinuities in frequency in Figure 9.29 would result in an annoying perceptual effect? Explain your reasoning.

9.6 In Section 9.4.2, we described an overlap-add approach to sinewave synthesis, one which does not give high-quality synthesis for a long frame interval. Consequently, overlap-add interpolation with the generation of an artificial mid-frame sinewave to reduce discontinuities in the resynthesized waveform was suggested. Consider the problem of finding an interpolated sinewave parameter set for a pair of matched sinewaves when a 20-ms frame interval is too low for high-quality synthesis. At frame l a sinewave is generated with amplitude Al, frequency Ωl, and phase θl, and a 20-ms triangular window is applied. The process is repeated 20 ms later for the sinewave at frame l + 1. The problem is to find the amplitude, frequency, and phase of a sinewave to apply at the midpoint between frames l and l + 1 such that the result of overlapping and adding it with the sinewave at frame l and the sinewave at frame l + 1 results in a “best fit.” If Image, Image and Image represent the mid-point amplitude, frequency, and phase, then a reasonable choice for the mid-point amplitude is simply the average amplitude

Image

Suppose the time evolution of the phase is described by the cubic phase interpolator, Equation (9.21), i.e.,

Image

where α(M*) and β(M*) are computed from Equation (9.20). Propose reasonable estimates for the midpoint frequency and phase, considering how well the new sinewaves match to their overlapping neighbors. Determine the computational complexity with the specific operations required in this scheme relative to decreasing the analysis frame interval to 10 ms.

9.7 Explain, conceptually, the differences between the baseline sinewave analysis/synthesis and the phase vocoder. Compare the two methods for time-scale modification with respect to waveform dispersion and sinewave frequency estimation and resolution. Compare also refinements to the two approaches that have been described in this and the previous chapter in achieving phase coherence with time-scale modification.

9.8 Consider the sinewave frequency matching algorithm in Section 9.3.4. Suppose that the frequency matching interval Δ is zero so that, in effect, no matching is performed.

(a) Over how many synthesis frames does each sinewave persist? Hint: The resulting sinewave frequencies are constant over this duration. Also, the sinewave birth and death process of Section 9.3.4 occurs very often.

(b) Consider a frequency set for two consecutive analysis frames l and l + 1: {ωk}l and {ωk}l+1. Show that when the matching interval Δ is zero, the sinewave synthesis over these two consecutive frames is equivalent to an overlap-and-add (OLA) procedure where the window weight is triangular and has duration equal to twice the synthesis frame interval.

(c) Argue why the overlap-add view of synthesis of parts (a) and (b) helps explain the ability of the sinewave analysis/synthesis to represent short-lived and rapidly varying speech events such as plosives and unvoiced-to-voiced transitions.

9.9 We saw in Chapter 7 that digital filtering of a sequence x[n] could be performed with the short-time Fourier transform by way of the operation

Y (n, ω) = X (n, ω) H (n, ω)

where H (n, ω) is a time-varying multiplicative modification. For FBS synthesis, the corresponding time-domain modification h[n, m] is weighted by the analysis window with respect to the variable m prior to convolution with x[n]; in OLA synthesis, the time-domain modification is convolved with the window with respect to the variable n prior to convolution with x[n]. This problem asks you to consider the effect of spectral modifications in sinewave analysis/synthesis.

(a) Given the sinewave parameter sets {Ak}, {ωk}, and {θk} on frame l, consider the multiplicative modifier H(lL, ωk). Letting

Image

we do sinewave synthesis from the modified parameter sets Image, Image and {ωk}. Suppose H (lL, ωk) is time-invariant, i.e.,

(9.51)

Image

Write an approximate expression for the modified time-domain sequence in terms of the original sequence x[n] and the time-domain filter h[n]. Assume we know the sinewave frequencies ωk exactly.

(b) Now write an approximate expression for the modified time-domain sequence in terms of the original sequence x[n] and the time-domain time-varying filter h[n, m]. Assume we know the sinewave frequencies ωk exactly.

(c) Assume the sinewave frequencies are obtained by peak-picking the amplitude of the DFT samples. How does this affect your results from part (a) and part (b)?

9.10 Given a transmission channel bandwidth constraint, it is sometimes desirable to compress a signal’s spectrum, a procedure referred to as “frequency compression.” This problem explores a method of frequency compression based on sinewave analysis/synthesis. Suppose we have performed sinewave analysis in continuous time. We can then synthesize a signal as

Image

where the phase for the kth sine-wave, θk(t), is the integral of the kth sinewave’s instantaneous frequency. Over each synthesis frame, the phase is assumed to have a cubic trajectory with some initial phase offset at the birth of a sinewave.

(a) Suppose that in sinewave synthesis, we scale the phase function by a scale factor B, i.e.,

Image

so that

Image

Show that the frequency trajectory of the resulting sinewaves are scaled by B. What is the resulting pitch if the original sinewaves are harmonic with fundamental frequency Ωo?

Figure 9.30 Spectral and waveform illustrations for Exercise 9.10: (a) vocal tract spectrum and harmonic lines; (b) periodic speech waveform with rectangular window.

Image

(b) Suppose s(t) is perfectly voiced with vocal tract spectrum and harmonic lines as shown in Figure 9.30a. Also, suppose a rectangular window, of duration twice the pitch period, is applied to s(t) as illustrated in Figure 9.30b. Sketch the magnitude of the STFT of s(t). Assume the scale factor B = 1/2. Then sketch the magnitude of the STFT of Image [in contrast to s(t)], assuming the window duration is twice the new pitch period. Comment on the shape of Image in comparison to s(t), considering waveform dispersion over a single pitch period.

(c) Suppose we transmit the waveform Image from part (b) over a channel with half the bandwidth of s(t) and we want to reconstruct the original full-bandwidth waveform s(t) at a receiver. One approach to recover s(t) from Image is to invert the frequency compression, i.e., perform “frequency expansion,” using sinewave analysis/synthesis and modification with a scale factor of B = 2 on the cubic phase derived from Image. Assuming steady voiced speech, what is the duration of the rectangular analysis window (relative to a pitch period) required at the receiver to obtain sinewave amplitudes, phases, and frequencies from Image? Explain your reasoning.

(d) Do you think the method of recovering s(t) from Image in part (c) suffers from waveform dispersion? Briefly explain your reasoning. Given the time-varying nature of speech, what fundamental limitation might our bandwidth reduction scheme suffer from? Are we really getting “something for nothing”?

9.11 In pitch modification, it is desired to modify the pitch of a speaker while preserving the vocal tract transfer function. For voiced speech, the pitch is modified by multiplying the sinewave frequencies by a factor β prior to sinewave synthesis, i.e., Ω′k = βΩk, with an associated change in the pitch period P given by

Image

The change in pitch period corresponds to modification of the onset times of an assumed impulse train input (Section 9.5.1).

(a) Propose a sinewave model for the pitch-modified excitation function in terms of onset times and sinewave frequencies for voiced speech. Hint: This model is a simple variation of the excitation model in Equation (9.30) and involves a scaling of the original onset times and frequencies.

(b) Consider now the magnitude and phase of the system function H(t, Ω) = M (t,Ω) exp[jΦ(t,Ω]. Propose a model for the system magnitude and phase contribution to a sinewave model for the pitch-modified voiced speech.

(c) Propose a sinewave model of pitch-modified unvoiced speech as a variation of the original sinewave model for unvoiced speech. Hint: Consider the changes that might occur in natural speech in generating noise-like and impulsive sources with pitch modification. Allow for adaptivity to the degree of voicing in your model and, in particular, assume a two-band spectrum with the upper band “voiced” and the lower band “unvoiced.”

(d) In implementing sinewave analysis/synthesis for pitch modification we must estimate the amplitudes and phases of the excitation and system components at the scaled frequenciesΩ′k = βΩk from the sinewave amplitudes and phases at the original sinewave frequencies; these parameters are then interpolated over consecutive frames prior to sinewave generation. The excitation and system amplitude functions are first considered. To preserve the shape of the magnitude of the vocal tract transfer function, the system amplitudes must be obtained at the new frequency locations, βΩk. Propose a means to estimate the required excitation and system amplitude values from amplitudes at the measured sinewave frequencies. Hint: You are not allowed to simply shift the amplitude values of the original sinewaves obtained by peak-picking because this will modify the original formant structure and not simply the pitch.

(e) Consider now sinewave phases in implementing the pitch modification model. Propose first a method to implement the onset-time model from part (a) for the modified excitation, similar in spirit to that used for time-scale modification. Next, propose a means to estimate the system phase values from phases at the measured sinewave frequencies. As with the system amplitude, to preserve the shape of the phase of the vocal tract transfer function, the system phases must be computed at the new frequency locations, βΩk. Again, you are not allowed to simply shift the phase values of the original sinewaves obtained by peak-picking because this will modify the original formant phase structure and not simply the pitch. Also, keep in mind that this is a more difficult problem than the corresponding one for the system amplitude because measured phase values are not unwrapped with respect to frequency.

(f) Using parts (a)-(e), propose a sinewave model and analysis/synthesis system for performing pitch and time-scale modification simultaneously.

9.12 Consider applying sinewave analysis/synthesis to slowing down the articulation rate of a speaker by a factor of two. The goal is to perform the modification without loss in phase coherence. This problem tests your understanding of the steps we have described in Section 9.5.2.

(a) Using the baseline analysis/synthesis system, if analysis is performed with a 10-ms frame interval, what is the corresponding synthesis frame interval over which amplitude and phase interpolation is performed?

(b) Explain your method of doing sinewave amplitude interpolation across the center of two consecutive frames.

(c) In generating the time-scaled phase, it is necessary to decompose the excitation and vocal tract phase functions. How do you perform this decomposition in the original time scale? The resulting system phase can be used in the new time scale. The excitation phase, however, must be obtained relative to the new time scale. Explain how, from the excitation onset times in the new time scale, you can obtain the linear excitation phase.

(d) Finally, give the composite phase at each frame center and describe the method of interpolation.

(e) Observe that we could have formed (in the modified time scale) an excitation phase Image (why does this preserve pitch?), added this phase to the decomposed system phase at a frame center, and performed interpolation across frame boundaries in the new time scale. What is wrong with this approach for achieving phase coherence?

9.13 Suppose a 20-ms analysis window and a 10-ms frame interval are used in sinewave analysis/synthesis during unvoiced speech.

(a) Argue qualitatively that estimating sinewave amplitudes, frequencies, and phases by peak-picking the STFT results in a high-quality, noise-like synthesized waveform during unvoiced speech. Do not use the argument given in Section 9.3.2. Create your own argument.

(b) In doing time-scale expansion with sinewave analysis/synthesis, consider the preservation of the noise-like quality of unvoiced sounds. What problems arise in excessive stretching of sinewaves and how would you remedy these problems?

9.14 In the estimation of the sinewave model parameters (i.e., amplitudes, frequencies, and phases), the amplitudes and frequencies are held constant over the duration of the analysis frame. In synthesis, however, these parameters are assumed to vary over the synthesis interval.

(a) Suggest an approach to remove this inconsistency between analysis and synthesis.

(b) Consider now all-pole (linear prediction) analysis/synthesis of Chapter 5. Describe an inconsistency between all-pole analysis and synthesis similar in style to that of the sinewave analysis/synthesis. Propose a method to remove this inconsistency. Hint: See Exercise 5.13 in Chapter 5.

9.15 Observe that the overlap and add method of Section 9.4.2 can be interpreted as synthesizing a signal from a modified STFT corresponding to a sum of triangularly windowed sinewaves. Thus, we might consider the least-squared-error approach of Chapter 7 which results in a weighting of a short-time segment by the window and a normalization by the summed squared, shifted windows. Discuss the possible advantage of this method in the sinewave context. Note that you are implicitly comparing the overlap-and-add (OLA) and least-squared-error (LSE) methods of Chapter 7. Hint: Consider overlapping regions of low amplitude near the window edges.

9.16 We have seen that often it is desirable to do time-varying time-scale modification. Consider, for example, time-scale modification of a plosive followed by a vowel, where it may be desirable to maintain the original time-scale of the plosive and modify the time scale of the vowel. Another example is taken from music modification where it is often imperative to maintain the attack and release portion of the note, while lengthening or shortening the steady portion of the note; a change in the attack or release can radically alter the character of the sound.22 An idealization of such a modification of a musical note is shown in Figure 9.31a. Suppose that the original tone is, as illustrated, a pure (complex) tone of the form

22For example, it has been said that replacing the attack of a flute with the attack of a clarinet can turn a flute into a clarinet.

x(t) = a(t)e(t)

where

Image

Ω being the frequency of the tone. Our goal is to stretch by a factor of two the steady portion of the tone, while maintaining phase and frequency continuity at its edges.

(a) One approach is to simply stretch out the amplitude and phase functions to form a(t/2) and θ(t/2) during the steady portion and keep these functions intact during the attack and release. Stretching the phase in this way is shown in Figure 9.31b. Show that this operation on the phase lowers the frequency by a factor of two during the steady region.

(b) Show that one approach to preserve frequency during the steady portion is to scale θ(t/2) by 2. To preserve phase continuity at the attack, we also displace the resulting 2θ(t/2) by a constant k, as illustrated in Figure 9.31c. (Assume you know the value of k.) Unfortunately, this implies, as shown, generally a phase discontinuity at the release. Explain why this discontinuity may occur. Heretoforth, we denote the three phase segments in the new time scale by Image, Image, and Image, as depicted in Figure 9.31c. One way to reduce the discontinuity between Image and Image is to add a 2π multiple to the third phase segment Image so that

Image

However, there is no guarantee of phase and frequency continuity at the boundary. We can, of course, shift Image so that it matches Image at the boundary, but this generally changes the principal phase value (i.e., the phase calculated modulo 2π) of the release which we want to preserve. Why is this principal phase value modified?

(c) An alternate approach to preserve both phase and frequency continuity at the boundaries of the attack and release, henceforth indicated by times t = 0 and t = T, respectively, is to add a “perturbation function,” as in Figure 9.31d. Denote the perturbation function by Image and also make it a cubic polynomial of the form

Image

Then the new phase function over the stretched interval [0, T] is given by

Image

For boundary phase and frequency continuity, there are four constraints that θ2(t) must satisfy:

1. Image

2.Image

3. Image

4.Image

Justify the four constraints.

Figure 9.31 Illustrations of time-varying time-scale expansion in Exercise 9.16: (a) idealized stretching of musical tone; (b) simple stretching of phase of steady region only; (c) stretching with frequency preservation and phase continuity at the attack; (d) adding a phase perturbation function.

Image

(d) Suppose for the moment that the integer M is known. Solve for the unknown cubic parameters a, b, c, and d such that all four constraints are satisfied. The solution will be a function of the known phase and frequency boundary values as well as the integer value M.

(e) We must now determine the value M. Propose an error criterion which is a function of M and which, when minimized with respect to M, gives a small frequency perturbation over the interval [0, T]. Do not solve the minimization.

9.17 An unsolved problem in speech processing is speaker separation, i.e., the separation of two speech waveforms that are additively combined, e.g.,in speech enhancement, a low-level speaker may be sought in the presence of a loud interfering talker. In this problem, you use a sinewave representation of speech to solve a simplified version of the speaker separation problem.

A speech waveform generated by two simultaneous talkers can be represented by a sum of two sets of sinewaves, each with time-varying amplitudes, frequencies, and phases. In the steady-state case where the source and vocal tract characteristics are assumed fixed over an analysis time interval, the sinewave model is given by

(9.52)

Image

with

Image

where the sequences xa[n] and xb[n] denote the speech of Speaker A and the speech of Speaker B, respectively. The amplitudes and frequencies associated with Speaker A are denoted by ak and ωa, k, and the phase offsets by θa,k. A similar parameter set is associated with Speaker B.

(a) Let s[n] represent a windowed speech segment extracted from the sum of two sequences

s[n] = w[n](xa[n] + xb[n])

where the analysis window w[n] is non-zero over a duration Nw and is assumed centered and symmetric about the time origin. With the model Equation (9.52), show that the Fourier transform of the windowed summed waveforms, S(ω), for ω ≥ 0, is given by the summation of scaled and shifted versions of the transform of the analysis window W(ω):

(9.53)

Image

For the sake of simplicity, assume the negative frequency contribution is negligible and any scale factors have been embedded within the window.

(b) Consider the sum of two voiced segments. Figure 9.32 shows an example of the short-time spectral magnitude and phase of two actual voiced segments and of their sum. Assuming a Hamming window about two pitch periods in duration, justify the approximately piecewise flat phase functions in panels (a) and (b) of the figure (Hint: Recall the w[n] is assumed to be centered at the time origin.) and justify the spectral magnitude and phase characteristics you see in panel (c).

Figure 9.32 Properties of the STFT of the sum of two vowels of different speakers, x[n] = xa[n]+xb[n]: (a) STFT magnitude and phase of xa[n]; (b) STFT magnitude and phase of xb[n]; (c) STFT magnitude and phase of xa[n] + xb[n]. The vowels have roughly equal intensity and belong to two voices with dissimilar fundamental frequency. The duration of the short-time window is 25 ms.

SOURCE: T.F. Quatieri and R.J. McAulay, “Audio Signal Processing Based on Sinusoidal Analysis/Synthesis,” chapter in Applications of Digital Signal Processing to Audio and Acoustics [41] (Figure 9.23). ©1998, Kluwer Academic Publishers. Used by permission.

Image

(c) Assume now that you are given the frequencies ωa, k and ωb, k of Speakers A and B, respectively, and that you know which frequencies belong to each speaker. (The frequencies might be estimated by joint pitch estimation.) Explain why picking peaks of the spectral magnitude of the composite waveform in panel (c) of Figure 9.32 near the harmonics of each speaker is the basis for a flawed signal separation scheme. Suppose, on the other hand, that the spacing between the shifted versions of W(ω) in Equation (9.53) is such that the main lobes do not overlap. Briefly describe a sinewave-based analysis/synthesis scheme that allows you to separate the two waveforms xa[n] and xb[n].

(d) Because the frequencies of Speaker A and B may be arbitrarily close, and because the analysis window cannot be made arbitrarily long due to the time-varying nature of speech, the constraint of part (c) that window main lobes do not overlap is typically not met. One approach to address this problem is to find a least-squared error fit to the measured two-speaker waveform:

(9.54)

Image

where xa[n] and xb[n] are the sinewave models in Equation (9.52), where y[n] is the measured summed waveform, and where the minimization takes place with respect to the unknown sinewave parameters ak, bk, θa,k, θb,k, ωa,k, and ωb,k. We transform this nonlinear problem of forming a least-squared-error solution for the sinewave amplitudes, phases, and frequencies into a linear problem. We accomplish this by assuming, as above, that the sinewave frequencies are known, and by solving for the real and imaginary components of the quadrature representation of the sinewaves, rather than solving for the sinewave amplitudes and phases. Part (a) suggests that these parameters can be obtained by exploiting the linear dependence of S(ω) on scaled and shifted versions of the Fourier transform of the analysis window.

Suppose that Speaker A’s waveform consists of a single (complex) sinewave with parameters A1, ω1, and θ1, namely, xa[n] = A1ejω1n+θ1, and that Speaker B’s waveform also consists of a single (complex) sinewave with parameters A2, ω2, and θ2, namely, xb[n] = A2ejω2n+θ2. Figure 9.33 illustrates S(ω) of Equation (9.53) for this simple case as the sum of the Fourier transforms of the windowed xa[n] and xb[n], given by Xa(ω) and Xb(ω), respectively.

Using the spectral interpretation in Figure 9.33, solve the least-squared-error minimization of Equation (9.54) by solving for the samples of the Fourier transform of the separate waveforms at the known frequencies ω1 and ω2. Assume the Fourier transform of the analysis window, W(ω), is normalized so that W(0) = 1. Do not solve; just set up the equations in matrix form. (Hint: Consider the use of two linear equations in two unknowns.) From Xa1) and Xb(ω2), describe a method for reconstruction of xa[n] and xb[n].

(e) Generalize the approach of part (d) to approximately solve the minimization problem in Equation (9.54) for the case of two arbitrary summed speech waveforms. Assume again that you know the frequencies of each speaker. Do not explicitly solve for the unknown sinewave parameters; rather, describe your approach and develop a set of linear equations in Ka + Kb unknowns in matrix form. Describe the limitation of this approach as frequencies of voice A come arbitrarily close to those of voice B. Propose a scheme to address this problem by exploiting separation solutions from neighboring frames and frequencies.

Figure 9.33 Least-squared-error solution for two sinewaves. S(ω) is the STFT of the composite waveform, while Xa(ω) and Xb(ω) are the STFTs of the component waveforms.

SOURCE: T.F. Quatieri and R.G. Danisewicz, “An Approach to Co-Channel Talker Interference Suppression Using a Sinusoidal Model for Speech” [45]. ©1990, IEEE. Used by permission.

Image

9.18 As illustrated in Figure 9.25, the KFH phase typically traverses a very large range (e.g., from 0 to 300 radians over a bandwidth of 5000 Hz). This phase calculation can thus be sensitive to small measurement errors in pitch or spectrum.

(a) For unity spectral magnitude and a one-sample error in the pitch period, derive the resulting change in the phase at ω = π. Explain how this sensitivity might affect synthesis quality.

(b) To reduce large frame-to-frame fluctuations in the KFH phase, both the pitch and the spectral envelope, used by the KFH solution, are smoothed in time over successive analysis frames. The strategy for adapting the degree of smoothing to signal characteristics is important for maintaining dispersion through rapidly changing speech events and for preserving the original phase in unvoiced regions where dispersion is unwanted. Suppose you are given a measure of the degree of voicing, and spectral and pitch derivatives which reflect the rate at which these parameters are changing in time. Propose a technique of smoothing the KFH phase over successive frames in time that reduces quality degradation due to the phase sensitivity shown in part (a).

9.19 Suppose you are given a set of sinewave frequencies ω1, ω2, … ωK and amplitudes A1, A2, … AK and assume the corresponding phases are zero. Working only in the frequency domain, the objective of this problem is to generate a windowed sum of sinewaves of the form

Image

where w[n] is an analysis window. The frequencies do not necessarily fall on DFT coefficients.

(a) Create a discrete Fourier transform (DFT), using the DFT of the window and the given frequencies and amplitudes, which when inverted yields the above windowed sum of sinewaves. Keep in mind that the frequencies do not necessarily fall on DFT coefficients, i.e., the frequencies do not necessarily fall at Image where N is the DFT length.

(b) Suppose the frequencies are multiples of a fundamental frequency. Describe the shape of the waveform x[n]. How will the shape change when the sinewave phases are made random, i.e., fall in the interval [−π, π] with uniform probability density.

9.20 (MATLAB) In this MATLAB exercise, use workspace ex9M1.mat as well as the function peak pick.m located in companion website directory Chap exercises/chapter9. This exercise leads you through a (simplified) sinewave analysis/synthesis of a steady-state vowel.

(a) Plot the speech waveform speech1_10k (located in ex9M1.mat) and apply a 25-ms Hamming window to its center. (Think of this center point as the time origin n = 0.) Compute the DFT of the windowed waveform with a 1024-point FFT and display its log-magnitude spectrum (over 512 points).

(b) Do a help command on peak_pick.m and apply peak_pick.m to the FFT magnitude of part (a) (only the first 512 points). Sample the (complex) FFT at the peak locations, which is an output of peak_pick.m, and save these locations and corresponding FFT magnitudes and phases in a 3-D array. Superimpose the log-magnitude of the FFT peaks onto the original log-magnitude from part (a). One way to do this is to create a vector which is zero except at the peak locations at which the vector has values equal to the peak log-magnitudes. When plotted, this new vector will have a harmonic “line” structure.

(c) Write a MATLAB function to generate the sum of steady sinewaves with magnitudes and frequencies from part (b). The phase offsets of each sinewave should be such that at the time origin, n = 0 (center of original waveform), the sinewaves take on the measured phases from part (b), i.e., the synthesized waveform should be of the form

Image

where Image, Image are the measured magnitudes, frequencies, and phases at the original waveform center (n = 0). Plot the synthesized waveform over a duration equal to the length of speech1_10k. How well does the synthesized waveform match the original? Explain any deviation from the original. Note that in analysis, from Section 9.3.3, you need to circularly shift the analysis window to the time origin (with respect to the DFT) to avoid large phase quantization errors. Otherwise, you will not achieve an adequate waveform match.

(d) Repeat part (c) with the phase offsets for all sinewaves set to zero, i.e.,

Image

How does the waveform shape and its aural perception (using sound.m) change relative to your result in part (c)? Try also a random set of phase offsets, using the MATLAB function rand.m or randn.m, with normalization to the interval [−π, π]. In both cases, justify your observations.

BIBLIOGRAPHY

[1] L.B. Almeida and F.M. Silva, “Variable-Frequency Synthesis: An Improved Harmonic Coding Scheme,” Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, San Diego, CA, pp. 27.5.1–27.5.4, 1984.

[2] D. Anderson, “Speech Analysis and Coding Using a Multi-Resolution Sinusoidal Transform,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, Atlanta, GA, May 1996.

[3] E.R. Banga and C. Garcia-Mateo, “Shape-Invariant Pitch-Synchronous Text-to-Speech Conversion,” Proc. IEEE Int. Conf. Acoustics Speech, and Signal Processing, Detroit, MI, vol. 4, pp. 656–659, April 1995.

[4] B.A. Blesser, “Audio Dynamic Range Compression for Minimum Perceived Distortion,” IEEE Trans. Audio and Electroacoustics, vol. AU–17, no. 1, pp. 22–32, March 1969.

[5] A.S. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound, The MIT Press, Cambridge, MA, 1990.

[6] C.E. Cook and M. Bernfeld, Radar Signals, Academic Press, New York, NY, 1967.

[7] R.G. Danisewicz, “Speaker Separation of Steady State Vowels,” Masters Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, June 1987.

[8] D.P. Ellis, A Perceptual Representation of Sound, Masters Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Feb. 1992.

[9] D.P. Ellis, B.L. Vercoe, and T.F. Quatieri, “A Perceptual Representation of Audio for Co-Channel Source Separation,” Proc. 1991 Workshop on Applications of Signal Processing to Audio and Acoustics, Mohonk Mountain House, New Paltz, NY, 1991.

[10] E.N. Fowle, “The Design of FM Pulse-Compression Signals,” IEEE Trans. Information Theory, vol. IT–10, no. 10, pp. 61–67, Jan. 1964.

[11] E.B. George, An Analysis-by-Synthesis Approach to Sinusoidal Modeling Applied to Speech and Music Signal Processing, Ph.D. Thesis, Georgia Institute of Technology, Nov. 1991.

[12] O. Ghitza, “Speech Analysis/Synthesis Based on Matching the Synthesized and the Original Representations in the Auditory Nerve Level,” Proc. of IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Tokyo, Japan, pp. 1995–1998, 1986.

[13] M.M. Goodwin, Adaptive Signal Models: Theory, Algorithms, and Audio Applications, Kluwer Academic Publishers, Boston, MA, 1992.

[14] D. Griffin and J.S. Lim, “Multi-band Excitation Vocoder,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–36, no. 8, pp. 1223–1235, 1988.

[15] K.N. Hamdy, M. Ali, and A.H. Tewfik, “Low Bit Rate High Quality Audio Coding with Combined Harmonic and Wavelet Representations,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Atlanta, GA, vol. 2, pp. 1045–1048, May 1996.

[16] P. Hedelin, “A Tone-Oriented Voice-Excited Vocoder,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Atlanta, GA, pp. 205–208, 1981.

[17] H.L.F. von Helmholz, On the Sensations of Tone, translated by A.J. Ellis for Longmans & Co., 1885, Dover Publications, Inc., New York, NY, 1952.

[18] D.J. Hermes, “Synthesis of Breathy Vowels: Some Research Methods,” Speech Communications, vol. 10, pp. 497–502, 1991.

[19] J.M. Kates, “Speech Enhancement Based on a Sinusoidal Model,” J. Speech and Hearing Research, vol. 37, pp. 449–464, April 1994.

[20] E.L. Key, E.N. Fowle, and R.D. Haggarty, “A Method of Pulse Compression Employing Nonlinear Frequency Modulation,” Technical Report 207, DDC 312903, Lincoln Laboratory, Massachusetts Institute of Technology, Aug. 1959.

[21] J. Laroche, Y. Stylianou, and E. Moulines, “Hns: Speech Modification Based on a Harmonic+noise Model,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Minneapolis, MN, vol. 2, pp. 550–553, April 1993.

[22] S.N. Levine, T.S. Verma, and J.O. Smith, “Alias-Free, Multiresolution Sinusoidal Modeling for Polyphonic, Wideband Audio,” Proc. 1997 Workshop on Applications of Signal Processing to Audio and Acoustics, Mohonk Mountain House, New Paltz, NY, 1997.

[23] M.W. Macon and M.A. Clements, “Sinusoidal Modeling and Modification of Unvoiced Speech,” IEEE Trans. Speech and Audio Processing, vol. 5, no. 6, pp. 557–560, Nov. 1997.

[24] M.W. Macon and M.A. Clements, “Speech Concatenation and Synthesis Using an Overlap-Add Sinusoidal Model,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, Atlanta, GA, vol. 1, pp. 361–364, May 1996.

[25] J.S. Marques and L.B. Almeida, “New Basis Functions for Sinusoidal Decomposition,” Proc. EUROCON, Stockholm, Sweden, 1988.

[26] R. Maher and J. Beauchamp, “An Investigation of Vocal Vibrato for Synthesis,” Applied Acoustics, vol. 30, pp. 219–245, 1990.

[27] J.S. Marques and L.B. Almeida, “Sinusoidal Modeling of Speech: Representation of Unvoiced Sounds with Narrowband Basis Functions,” Proc. EUSIPCO, 1988.

[28] J.S. Marques and L.B. Almeida, “Frequency-Varying Sinusoidal Modeling of Speech,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 37, no. 5, pp. 763–765, May 1989.

[29] S. McAdams, “Spectral Fusion, Spectral Parsing, and the Formation of Auditory Images,” Report no. STAN-M–22, CCRMA, Department of Music, Stanford, CA, May 1984.

[30] R.J. McAulay and T.F. Quatieri, “Speech Analysis-Synthesis Based on a Sinusoidal Representation,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–34, no. 4, pp. 744–754, Aug. 1986.

[31] R.J. McAulay and T.F. Quatieri, “Speech Analysis/Synthesis Based on a Sinusoidal Representation,” Technical Report 693, Lincoln Laboratory, Massachusetts Institute of Technology, May 17, 1985.

[32] R.J. McAulay and T.F. Quatieri, “Pitch Estimation and Voicing Detection Based on a Sinusoidal Speech Model,” IEEE Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Albuquerque, NM, vol. 2, pp. 249–252, April 1990.

[33] R.J. McAulay and T.F. Quatieri, “Phase Modeling and Its Application to Sinusoidal Transform Coding,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Tokyo, Japan, pp. 1713–1715, April 1986.

[34] R.J. McAulay and T.F. Quatieri, “Low Rate Speech Coding Based on the Sinusoidal Speech Model,” chapter in Advances in Speech Signal Processing, S. Furui and M.M. Sondhi, eds., Marcel Dekker, 1992.

[35] R.J. McAulay and T.F. Quatieri, “Computationally Efficient Sinewave Synthesis and its Application to Sinusoidal Transform Coding,” IEEE Int. Conf. Acoustics, Speech, and Signal Processing, New York, NY, pp. 370–373, April 1988.

[36] J.A. Moorer, “Signal Processing Aspects of Computer Music,” Proc. IEEE, vol. 65, no. 8, pp. 1108–1137, Aug. 1977.

[37] D. Paul, “The Spectral Envelope Estimation Vocoder,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–29, no. 4, pp. 786–794, Aug. 1981.

[38] T.F. Quatieri and R.J. McAulay, “Speech Transformations Based on a Sinusoidal Representation,” IEEE Trans. Acoustics, Speech, Signal Processing, vol. ASSP–34, no. 6, pp. 1449–1464, Dec. 1986.

[39] T.F. Quatieri and R.J. McAulay, “Shape-Invariant Time-Scale and Pitch Modification of Speech,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 40, no. 3, pp. 497–510, March 1992.

[40] T.F. Quatieri and R.J. McAulay, “Phase Coherence in Speech Reconstruction for Enhancement and Coding Applications,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Glasgow, Scotland, May 1989.

[41] T.F. Quatieri and R.J. McAulay, “Audio Signal Processing Based on Sinusoidal Analysis/ Synthesis,” chapter in Applications of Digital Signal Processing to Audio and Acoustics, M. Kahrs and K. Brandenburg, eds., Kluwer Academic Publishers, Boston, MA, 1998.

[42] T.F. Quatieri and R.J. McAulay, “Peak-to-rms Reduction of Speech Based on a Sinusoidal Model,” IEEE Trans. Signal Processing, vol. 39, no. 2, pp. 273–288, Feb. 1991.

[43] T.F. Quatieri, J. Lynch, M.L. Malpass, R.J. McAulay, and C. Weinstein, “Speech Processing for AM Radio Broadcasting,” Technical Report 681, Lincoln Laboratory, Massachusetts Institute of Technology, Nov. 1991.

[44] T.F. Quatieri, R.B. Dunn, and R.J. McAulay, “Signal Enhancement in AM–FM Interference,” Technical Report 993, Lincoln Laboratory, Massachusetts Institute of Technology, May 1994.

[45] T.F. Quatieri and R.G. Danisewicz, “An Approach to Co-Channel Talker Interference Suppression using a Sinusoidal Model for Speech,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 38, no. 1, pp. 56–69, Jan. 1990.

[46] M.A. Ramalho, The Pitch Mode Modulation Model with Applications in Speech Processing, Ph.D. Thesis, Department of Electrical Engineering, Rutgers University, Jan. 1994.

[47] J.C. Rutledge, Time-Varying, Frequency-Dependent Compensation for Recruitment of Loudness, Ph.D. Thesis, Georgia Institute of Technology, Dec. 1989.

[48] M.R. Schroeder, “Synthesis of Low-Peak-Factor Signals and Binary Sequences with Low Autocorrelation,” IEEE Trans. Information Theory, vol. IT–16, pp. 85–89, Jan. 1970.

[49] M.R. Schroeder, Number Theory in Science and Communication, Springer-Verlag, New York, NY, 2nd Enlarged Edition, 1986.

[50] X. Serra, A System for Sound Analysis/Transformation/Synthesis Based on a Deterministic Plus Stochastic Decomposition, Ph.D. Thesis, CCRMA, Department of Music, Stanford University, 1989.

[51] X. Serra and J.O. Smith, III, “Spectral Modeling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic Plus Stochastic Decomposition,” Computer Music Journal, vol. 14, no. 4, pp. 12–24, Winter 1990.

[52] R. Smits and B. Yegnanarayana, “Determination of Instants of Significant Excitation in Speech Using Group Delay Function,” IEEE Trans. Speech and Audio Processing, vol. 3, no. 5, pp. 325–333, Sept. 1995.

[53] E. Tellman, L. Haken, and B. Holloway, “Timbre Morphing of Sounds with Unequal Number of Features,” J. Audio Engineering Society, vol. 43, no. 9, pp. 678–689, Sept. 1995.

[54] C.W. Therrien, R. Cristi, and D.E. Allison, “Methods for Acoustic Data Synthesis,” Proc. IEEE 1994 Digital Signal Processing Workshop, Yosemite National Park, Oct. 1994.

[55] H. Van Trees, Detection, Estimation and Modulation Theory, Part I, John Wiley and Sons, New York, NY, 1968.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset