Chapter 13
Speech Enhancement

13.1 Introduction

Throughout the text, we have introduced a number of speech enhancement techniques, including homomorphic deconvolution in Chapter 6 for removing convolutional distortion and a spectral magnitude-only reconstruction approach in Chapter 7 to reducing additive noise. In this chapter, we further develop speech enhancement methods for addressing these two types of distortion. We cannot hope to cover all such methods; rather, we illustrate how certain analysis/synthesis schemes described in the text are used as a basis for enhancement, focusing on signal processing principles.

In this chapter, we judge enhancement techniques in part by the extent to which they reduce additive noise or convolutional distortion, by the severity of artifacts in the remaining (or “residual”) disturbance, and by the degree of distortion in the desired speech signal. Because of the typical requirement of average speech characteristics, such as average short-time spectra, enhancement algorithms tend to degrade transient and dynamic speech components—for example, plosive fine structure and formant modulation—that contribute significantly to the quality of the speech signal. Therefore, with regard to speech distortion, we are interested in preserving not only slowly varying short-time spectral characteristics, but also instantaneous temporal properties such as signal attack, decay, and modulation. In addition, we judge the enhanced speech by using the subjective and objective measures of quality that were outlined in the introduction of Chapter 12. In subjective quality evaluation, which is ultimately the deciding evaluation in human listening, we consider speech naturalness, intelligibility, and speaker identifiability. Speech enhancement, however, is not always performed for the human listener, but often is performed for the machine recognizer. As an example, in Chapter 14, we will see how many of the techniques in this chapter are applied to improve automatic speaker recognition performance when convolutional and additive disturbances are present.

We begin in Section 13.2 with the Fourier transform and filtering perspectives of the short-time Fourier transform (STFT) as bases for approaches to the problems of reducing additive noise and convolutional distortion. We describe, as a foundation, spectral subtraction that operates on STFT short-time sections for additive noise reduction, and cepstral mean subtraction that operates on STFT bandpass filter outputs for removing stationary convolutional distortion. In Section 13.3, the Wiener filter and its adaptive renditions for additive noise removal are then developed. A variety of Wiener-filter-based approaches are given, including a method that adapts to spectral change in a signal to help preserve signal nonstationarity while exploiting auditory temporal masking, and also a stochastic-theoretical approach to obtain a mean-squared error estimate of the desired spectral magnitude. Sections 13.4 and 13.5 then develop all-pole model-based and further auditory-based approaches to additive noise reduction, respectively. The auditory-based methods use frequency-domain perceptual masking principles to conceal annoying residual noise under the spectral components of interest. Section 13.6 next generalizes cepstral mean subtraction (CMS), introduced in Section 13.2, to reduce time-varying convolutional distortion. This generalization, commonly referred to as RASTA, (along with CMS) can be viewed as homomorphic filtering along the time dimension of the STFT, rather than with respect to its frequency dimension. These approaches represent a fascinating application of homomorphic filtering theory in a domain different from that studied in Chapter 6. Moreover, we will see that CMS and RASTA can be viewed as members of a larger class of enhancement algorithms that filter nonlinearly transformed temporal envelopes of STFT filter outputs.

13.2 Preliminaries

In this section, we first formulate the additive noise and convolutional distortion problems in the context of the STFT, from both the Fourier transform and filtering viewpoints introduced in Chapter 7. With this framework, we then develop the method of spectral subtraction for additive noise suppression and cepstral mean subtraction for reducing a stationary convolutional distortion.

13.2.1 Problem Formulation

Additive Noise — Let y[n] be a discrete-time noisy sequence

(13.1)

Image

where x[n] is the desired signal, which we also refer to as the “object,” and b[n] is the unwanted background noise. For the moment, we assume x[n] and b[n] to be wide-sense stationary, uncorrelated random processes with power spectral density functions (Appendix 5.A) denoted by Sx(ω) and Sb(ω), respectively. One approach to recovering the desired signal x[n] relies on the additivity of the power spectra, i.e.,

(13.2)

Image

With STFT analysis, however, we work with the short-time segments given by

ypL[n] = w[pLn](x[n] + b[n])

where L is the frame length and p is an integer, which in the frequency domain is expressed as

Y (pL, ω) = X (pL, ω) + B (pL, ω)

where X (pL, ω), B (pL, ω), and Y (pL,ω) are the STFTs of the object x[n], the background noise b[n], and the measurement y[n], respectively, computed at frame interval L. The STFT magnitude squared of y[n] is thus given by

(13.3)

Image

from which our objective is to obtain an estimate of |X(pL, ω)|2. We think of the relation in Equation (13.3) as the “instantaneous” counterpart to the stochastic Equation (13.2).

In the above approach to signal estimation, we do not estimate the STFT phase.1 Therefore, the best we can do for each short-time segment is an estimate of the form

1 We noted in Chapter 7 (Section 7.4) that STFT phase estimation for speech signals in noise is a more difficult problem than STFT magnitude estimation. This is in part due to the difficulty in characterizing phase in low-energy regions of the spectrum, and in part due to the use of only second-order, statistical averages, e.g., the autocorrelation function and its corresponding power spectrum, in standard noise reduction algorithms. One approach to estimate phase is through the STFT magnitude as described in Sections 7.5.3 and 7.6.2. A different approach is proposed in Exercise 13.4.

(13.4)

Image

i.e., the ideal STFT estimate consists of the clean STFT magnitude and noisy measured STFT phase. We refer to this as the theoretical limit in estimating the original STFT when only the STFT magnitude is estimated [53]. By considering the threshold of perception of phase deviation due to additive noise, it has been shown that speech degradation is not perceived with an average short-time “segmental” signal-to-noise ratio2 (SNR) > 6 dB for the theoretical limit in Equation(13.4). With this SNR considerably below 6 dB, however, a roughness of the reconstruction is perceived [53].

2 The average short-time SNR is the ratio (in dB) of the energy in the short-time clean speech and short-time noise disturbance averaged over all frames. This ratio is sometimes referred to as the segmental signal-to-noise ratio, a nomenclature that was introduced in Chapter 12.

Convolutional Distortion — Consider now a sequence x[n] that has passed through a linear time-invariant distortion g[n] resulting in a sequence y[n] = x[n] * g[n]. Our objective is to recover x[n] from y[n] without a priori knowledge of g[n]. This problem is sometimes referred to as blind deconvolution. We saw an example of blind deconvolution in Exercise 6.20 where homomorphic filtering reduced the effect of the distorting impulse response g[n]of an acoustic recording horn. As in Exercise 6.20, we assume that in short-time analysis the window w[n] is long and smooth relative to the distortion g[n], so that a short-time segment of y[n] can be written as

Image

Then we can write the STFT of the degraded signal as [2] (Exercise 13.9 gives a formal argument)

(13.5)

Image

Because the Fourier transform of the distortion, G(ω), is a multiplicative modification of the desired signal’s STFT, one is tempted to apply the homomorphic filtering described in Chapter 6 whereby g[m] is removed from each short-time section ypL[m] using cepstral liftering. Our objective, however, is to obtain x[n] even when the cepstra of the signal and the distortion are not necessarily disjoint in the quefrency domain.

13.2.2 Spectral Subtraction

We return now to the problem of recovering an object sequence x[n] from the noisy sequence y[n] of Equation (13.1). We assume that we are given an estimate of the power spectrum of the noise, denoted by Image, that is typically obtained by averaging over multiple frames of a known noise segment. We also assume that the noise and object sequences are uncorrelated. Then with short-time analysis, an estimate of the object’s short-time squared spectral magnitude is suggested from Equation (13.2) as [5]

(13.6)

Image

When we combine this magnitude estimate with the measured phase, we then have the STFT estimate

Image

An object signal estimate can then be formed with any of the synthesis techniques described in Chapter 7 including the overlap-add (OLA), filter-bank summation (FBS), or least-squared-error (LSE) synthesis.3 This noise reduction method is a specific case of a more general technique given by Weiss, Aschkenasy, and Parsons [57] and extended by Berouti, Schwartz, and Makhoul [4] that we introduce in Section 13.5.3. This generalization, Equation (13.19), allows in Equation(13.6) a compression of the spectrum and over- (or under-) estimation of the noise contribution.

3 In spectral subtraction, as well as in any noise reduction scheme based on a modified STFT, synthesis occurs from a discrete STFT with N uniformly spaced frequencies Image. The DFT length N, therefore, must be sufficiently long to account for the inverse Fourier transform of Image being possibly longer than that of the original short-time segment.

It is interesting to observe that spectral subtraction can be viewed as a filtering operation where high SNR regions of the measured spectrum are attenuated less than low SNR regions. This formulation can be given in terms of an “instantaneous” SNR defined as

(13.7)

Image

resulting in a spectral magnitude estimate

Image

where we have used the approximation Image. The time-varying suppression filter applied to the STFT measurement is therefore given approximately by

(13.8)

Image

The filter attenuation is given in Figure 13.1 as a function of R(pL, ω), illustrating that low SNR signals are attenuated more than high SNR signals.

STFT magnitude estimation has also been more formally posed in the context of stochastic estimation theory that leads to a modification of spectral subtraction. Specifically, a maximum-likelihood (ML) estimation (Appendix 13.A) of the desired STFT magnitude was proposed by McAulay and Malpass [37]. The resulting ML estimate of |X(pL, ω)| is expressed as

Image

where it is assumed that the noise, mapped to the frequency domain from the time domain, is Gaussian at each frequency. As with spectral subtraction, the ML solution can be formulated as a suppression filter with the instantaneous SNR as a variable and has been shown to give a more gradual attenuation than spectral subtraction [37]. A further extension by McAulay and Malpass [37] modifies the ML estimate by the a priori probability of the presence of speech; when the probability of speech being present is estimated to be small, then noise attenuation in the above ML estimate is further increased.

Figure 13.1 Comparison of suppression curves for spectral subtraction (solid line) and the Wiener filter (dashed line) as a function of the instantaneous SNR.

Image

An important property of noise suppression by spectral subtraction, as well as other STFT-based suppression techniques, is that attenuation characteristics change with the length of the analysis window. Because we are ultimately interested in speech signals, we look next at an example of enhancing a sinewave in noise (adapted from Cappe and Laroche [8]), a basic building block of the speech signal. The missing steps in this example are carried through in Exercise 13.1.

Example 13.1      Consider a sinewave x[n] = A cos(ωon) in stationary white noise b[n] with variance σ2 and analyzed by a short-time window w[n] of length Nw. When the sinewave frequency ωo is larger than the width of the main lobe of the Fourier transform of the analysis window, W(ω), then it follows that the STFT magnitude of x[n] at frequency ωo is given approximately by Image where Image. Then, denoting by E, the expectation operator, the average short-time signal power at frequency ωo is given by

Image

Likewise, it is possible to show that the average power of the windowed noise is constant over all ω and given by

Image

In the above expressions, Image and image denote estimates of the underlying power spectra Sx(ω) and Sb(ω), respectively (which in this example are unchanging in time). Using the property that x[n] and b[n] are uncorrelated, we then form the following ratio at the frequency ωo:

(13.9)

Image

where Sb(ω) = σ2 and

Image

which can be shown to be the 3-dB bandwidth of the analysis window main lobe. It follows that Sb(ωw is the approximate power in a band of noise in the STFT centered at frequency ωo. Consequently, the second term in Equation (13.9) is the ratio of the (half) power in the sinewave, i.e., A2/4, to the power in the noise band (equal to the window bandwidth) centered at frequency ωo, i.e., Sb(ωw. As we increase the window length, we increase this SNR at frequency ωo because the window bandwidth decreases, thus decreasing the power in the noise relative to that of the signal.

As the window length is decreased, we see from our result that, for a sinewave at frequency ωo, the SNR decreases and thus, in spectral subtraction, the sinewave is more likely to be attenuated and perhaps removed because of the thresholding operation in Equation (13.6). Therefore, to preserve a sinewave after suppression, we must use an analysis filter of adequate length. However, a long window conflicts with the need to preserve transient and changing components of a signal such as plosives, fast attacks, and modulations. Image

The previous example illustrates the time-frequency tradeoff encountered in enhancing a simple signal in noise using spectral subtraction. In addition, we must consider the perceptual consequences of what appears to be a benign filtering scheme. In particular, a limitation of spectral subtraction is the aural artifact of “musicality” that results from the rapid coming and going of sinewaves over successive frames [32]. We can understand how musicality arises in spectral subtraction when we consider that the random fluctuations of the periodogram of noise, as well as possibly the desired sinewave itself (as in the previous example), rise above and below the spectral subtraction threshold level over time at different frequencies. Consequently, numerous smoothing techniques have been applied to reduce such annoying fluctuations.4 We discuss some of these smoothing techniques in a following section in the context of Wiener filtering, where spectral smoothing also helps reduce musicality.

4 To date, no useful intelligibility improvements (i.e., a better understanding of words or content) have been reported with current noise reduction systems, even with smoothing techniques that remove musicality and give perceived noise reduction. Such “discrepancies” are essentially always found in formal evaluation of these noise reduction systems: the perceived noise level or an objective error is reduced but intelligibility is not improved or is reduced [34]. This is probably because intelligibility is dependent on short, transient speech components, such as subtle differences between voiced and unvoiced consonants and formant modulations, which are not enhanced or are degraded in the noise reduction process. Nevertheless, the processed speech can be more pleasing and less fatiguing to the listener, as well as more easily transcribed [34].

13.2.3 Cepstral Mean Subtraction

Consider now the problem of recovering a sequence x[n] from the convolution y[n] = x[n] * g[n]. Motivated by Equation (13.5), we apply the nonlinear logarithm operator to the STFT of y[n] to obtain

Y (pL, ω) ≈ log[X (pL, ω)] + log[G(ω)].

Because the distortion g[n] is time-invariant, the STFT views log[G(ω)] at each frequency as fixed along the time index variable p. If we assume that the speech component log[X(pL, ω)] has zero mean in the time dimension, then we can remove the convolutional distortion g[n] while keeping the speech contribution intact. This can be accomplished in the quefrency domain by computing cepstra, along each STFT time trajectory, of the form

Image

where Image denotes the inverse Fourier transform of sequences along the time dimension p. Applying a cepstral lifter, we then have:

Image

where l[n] = 0 at n = 0 and unity elsewhere. Because the 0th value of the cepstrum equals the mean of log[Y(pL, ω)] (along the time dimension) for each ω, the method is called cepstral mean subtraction (CMS). Although this approach is limited due to the strictness of the assumption of a zero-mean speech contribution, it has significant advantages in feature estimation for recognition applications (Chapter 14). (Historically, CMS was first developed in the context of speech recognition as described in [23].) In these applications, often the mean of the log[Y(pL, ω)] is computed and subtracted directly.5 Because, in practice, the mean is computed over a finite number of frames, we can think of CMS as a highpass, non-causal FIR filtering operation [2],[23].

5 In practice, only the STFT magnitude is used in recognition applications and the 0th cepstral value is computed by computing the mean of log |X(pL, ω)| along p, rather than computing an explicit inverse Fourier transform. Often, however, in recognition applications, the cepstrum of log |X(pL, ω)| is computed with respect to ω to obtain a cepstral feature vector for each frame. Equivalently, one can then subtract the mean cepstrum (across frames) to remove the distortion component, and hence we have an alternative and, perhaps, more legitimate motivation for the nomenclature cepstral mean subtraction.

13.3 Wiener Filtering

An alternative to spectral subtraction for recovering an object sequence x[n] corrupted by additive noise b[n], i.e., from a sequence y[n] = x[n] + b[n], is to find a linear filter h[n] such that the sequence Image minimizes the expected value of Image. Under the condition that the signals x[n] and b[n] are uncorrelated and stationary, the frequency-domain solution to this stochastic optimization problem is given by the suppression filter (Exercise 13.2)

(13.10)

Image

which is referred to as the Wiener filter [32]. When the signals x[n] and b[n] meet the conditions under which the Wiener filter is derived, i.e., uncorrelated and stationary object and background, the Wiener filter provides noise suppression without considerable distortion in the object estimate and background residual. The required power spectra, Sx(ω) and Sb(ω), can be estimated by averaging over multiple frames when sample functions of x[n] and b[n] are provided. Typically, however, the desired signal and background are nonstationary in the sense that their power spectra change over time, i.e., they can be expressed as time-varying functions Sx(n, ω) and Sb(n, ω). Thus, ideally, each frame of the STFT is processed by a different Wiener filter. For the simplifying case of a stationary background, we can express the time-varying Wiener filter as

Image

where Image is an estimate of the time-varying power spectrum of x[n], Sx(n, ω), on each frame, and Image is an estimate of the power spectrum of a stationary background, Sb(ω). The time-varying Wiener filter can also be expressed as (Exercise 13.2)

(13.11)

Image

with a signal-to-noise ratio Image. A comparison of the suppression curves for spectral subtraction and Wiener filtering is shown in Figure 13.1, where we see the attenuation of low SNR regions relative to the high SNR regions to be somewhat stronger for the Wiener filter, consistent with the filter in Equation (13.8) being a compressed (square-rooted) form of that in Equation (13.11). A second important difference from spectral subtraction is that the Wiener filter does not invoke an absolute thresholding. Finally, as with spectral subtraction, an enhanced waveform is recovered from the modified STFT, Image, by any of the synthesis methods of Chapter 7. Observe that the Wiener filter is zero-phase so that the original phase of Y(pL, ω) is again used in synthesis.

In forming an estimate Image of the time-varying object power spectrum, we typically must use very short-time and local measurements. This is because when the desired signal is on the order of a few milliseconds in duration, and when its change is rapid, as with some plosives, its spectrum is difficult to measure, requiring an estimate to be made essentially “instantaneously.” In the remainder of this section, we study a variety of spectral estimation methods and then apply these methods to speech enhancement. Representations for binaural presentation are also considered, indicating that further enhancement can be obtained with stereo aural displays of combinations of the object estimate and the signal not passed by the Wiener filter, i.e., the result of subtracting the object estimate from the original signal.

13.3.1 Basic Approaches to Estimating the Object Spectrum

Suppose a signal y[n] is short-time processed at frame interval L samples and we have available an estimate of the Wiener filter on frame p − 1, denoted by Image. We assume, as before, that the background b[n] is stationary and that its power spectrum, Sb(ω), is estimated by averaging spectra over a known background region. For a nonstationary object signal x[n], one approach to obtain an estimate of its time-varying power spectrum on the pth frame uses the past Wiener filter Hs((p − 1)L, ω) to enhance the current frame [32]. This operation yields an enhanced STFT on the pth frame:

(13.12)

Image

which is then used to update the Wiener filter:

(13.13)

Image

The estimate of the time-varying object power spectrum Image may be initialized with, for example, the raw spectral measurement or a spectrum derived from spectral subtraction. There is, however, little control over how rapidly the object power spectrum estimate changes6 in Equation (13.13). Because the filter in Equation (13.13) can vary rapidly from frame to frame, as with spectral subtraction, the result is a noise residual with fluctuating artifacts. These fluctuations again are perceived as annoying musicality because of peaks in the periodogram |Y(pL, ω)|2 that influence the object estimate Image in Equation (13.12) and thus the filter estimate in Equation (13.13).

6 In the examples to follow in this section, we use overlap-add (OLA) synthesis. We saw in Chapter 7 (Section 7.5.1) that the effect of OLA synthesis, from a multiplicatively modified STFT (as occurs with Wiener filtering), is a time-varying linear filter smoothed in time by the window. Thus, the window bandwidth constrains how fast the time-domain Wiener filter can change. For a very short window, however, the resulting large bandwidth implies little smoothing.

Figure 13.2 Classic Wiener filter with smoothing of the object spectrum estimate.

Image

One approach to slow down the rapid frame-to-frame movement of the object power spectrum estimate, and thus reduce annoying fluctuations in the residual, is to apply temporal smoothing to the object spectrum of Equation (13.12). Denote the object power spectrum estimate on the pth frame by Image. Then the smooth power spectrum estimate is obtained as

(13.14)

Image

where τ is the smoothing constant. Image then replaces Image within the Wiener filter of Equation (13.13) (Figure 13.2). The smoothing constant of Equation (13.14) controls how fast we adapt to a nonstationary object spectrum. A fast adaptation, with a small smoothing constant, implies improved time resolution, but more noise in the spectral estimate, and thus more musicality in the synthesis. A large smoothing constant improves the spectral estimate in regions of stationarity, but it smears onsets and other rapid events. We now look at an example of the time-frequency resolution tradeoff inherent in this approach.

Example 13.2       Figure 13.3b shows an example of a synthetic train of rapidly-decaying sinewaves in white Gaussian noise; the original 1000-Hz sinewave pulses of Figure 13.3a are uniformly spaced by 2 ms. In noise reduction with a Wiener filter, a 4-ms triangular analysis window, a 1-ms frame interval, and overlap-add (OLA) synthesis are applied. The particular analysis window and frame interval ensure that the OLA constraint is satisfied.7 A Wiener filter was derived using the spectral smoothing in Equation (13.14) with τ = 0.85. The background power spectrum estimate was obtained by averaging the squared STFT magnitude over the first 0.08 seconds of y[n], and the initial object power spectrum estimate Image was obtained by applying spectral subtraction to |Y(0, ω)|2. Panel (c) illustrates the result of applying the Wiener filter. An advantage of the spectral smoothing is that it has removed musicality in the noise residual. However, the initial attack of the signal is reduced, resulting in an aural “dulling” of the sound, i.e., it is perceived as less “crisp” than that of its original noiseless counterpart. In addition, although the Wiener filter adapts to the object spectrum, the effect of this adaptation lingers beyond the object, thus preventing noise reduction for some time thereafter and resulting in a perceived “hiss.” In this example, the smoothing constant τ is selected to give substantial noise reduction while reducing musicality, but at the expense of slowness in the filter adaptation, resulting in the smeared object attack and the trailing hiss. This sluggish adaptivity can be avoided by reducing the smoothing constant, but at the expense of less noise suppression and the return of residual musicality due to a fluctuating object power spectrum estimate. Image

7 The short window corresponds to a large window bandwidth and thus ensures that OLA synthesis does not inherently impose a strong temporal smoothing of the Wiener filter, as seen in Chapter 7 (Section 7.5.1).

Figure 13.3 Enhancement by adaptive Wiener filtering of a train of closely-spaced decaying sinewaves in 10 dB of additive white Gaussian noise: (a) original clean object signal; (b) original noisy signal; (c) enhanced signal without use of spectral change; (d) enhanced signal with use of spectral change; (e) enhanced signal using spectral change, the iterative filter estimate (2 iterations), and background adaptation.

Image

13.3.2 Adaptive Smoothing Based on Spectral Change

In this section, we present an adaptive approach to smoothing the object spectrum estimate with emphasis on preserving the object nonstationarity, while avoiding perceived musicality and hiss in the noise residual. The essence of the enhancement technique is a Wiener filter that uses an object signal power spectrum whose estimator adapts to the “degree of stationarity” of the measured signal [46]. The degree of stationarity is derived from a short-time spectral derivative measurement that is motivated by the sensitivity of biological systems to spectral change, as we have seen in Chapter 8 in the phasic/tonic auditory principle, but also by evidence that noise is perceptually masked by rapid spectral changes,8 as has been demonstrated with quantization noise in the context of speech coding [30]. The approach, therefore, preserves dynamic regions important for perception, as we described in Chapter 8 (Section 8.6.3), but also temporally shapes the noise residual to coincide with these dynamic regions according to this particular perceptual masking criterion. Additional evidence for the perceptual importance of signal dynamics has been given by Moore who, in his book on the psychology of hearing [41] (p. 191), states: “The auditory system seems particularly well-suited to the analysis of changes in sensory input. The perceptual effect of a change in a stimulus can be roughly described by saying that the preceding stimulus is subtracted from the present one, so what remains is the change. The changed aspect stands out perceptually from the rest … a powerful demonstration of this effect may be obtained by listening to a stimulus with a particular spectral structure and then switching rapidly to a white noise stimulus…. The noise sounds colored, and the coloration corresponds to the inverse of the spectrum of the preceding sound.”

8 This aural property is analogous to a visual property. With noise reduction in images, noise in a stationary region, such as a table top, is more perceptible than noise in a nonstationary region, such as a table edge [35].

Our goal is to make the adaptive Wiener filter more responsive to the presence of the desired signal without sacrificing the filter’s capability to suppress noise. We can accomplish this by making the smoothing constant of the recursive smoother in Equation (13.14) adapt to the spectrum of the measurement. In particular, a time-varying smoothing constant is selected to reflect the degree of stationarity of the waveform whereby, when the spectrum is changing rapidly, little temporal smoothing is introduced resulting in a near instantaneous object spectrum estimate used in the Wiener filter. On the other hand, when the measurement spectrum is stationary, as in background or in steady object regions, an increased smoothing improves the object spectral estimate. Although this filter adaptation results in relatively more noise in nonstationary regions, as observed earlier, there is evidence that, perceptually, noise is masked by rapid spectral changes and accentuated in stationary regions.

One measure of the degree of stationarity is obtained through a spectral derivative measure defined for the pth frame as

(13.15)

Image

Because this measure is itself erratic across successive frames, it, too, is temporally smoothed as Image where ƒΔ[p] is a noncausal linear filter. The smooth spectral derivative measure is then mapped to a time-varying smoothing constant as

Image

where

Image

and where Image is the average spectral derivative over the known background region. Subtraction of Image and multiplication by 2 in the argument of Q are found empirically to normalize τ(p) to fall roughly between zero and unity [46]. The resulting smooth object spectrum is given by

(13.16)

Image

The resulting enhancement system is illustrated in Figure 13.4 where the time-varying, rather than fixed, smoothing constant controls the estimation of the object power spectrum estimate, and is derived from the spectral derivative. As we will see shortly through an example, this refined Wiener filter improves the object attack and reduces the residual hiss artifact in synthesis.

Figure 13.4 Noise-reduction Wiener filtering based on spectral change. The time-varying smoothing constant τ(p) controls the estimation of the object power spectrum, and is derived from the spectral derivative.

Image

One approach to further recover the initial attack is to iterate the Wiener filtering on each frame; the iteration attempts to progressively improve the Wiener filter by looping back the filter output, i.e., iteratively updating the Wiener filter with a new object spectrum estimate derived from the enhanced signal. This iterative approach is indicated in Figure 13.4 by the clockwise arrow. Such an iterative process better captures the initial attack through a refined object spectrum estimate obtained from the enhanced signal and by reducing the effect of smoothing delay. Only a few iterations, however, are possible on each frame because with an increasing number of iterations, resonant bandwidths of the object spectral estimate are found empirically to become unnaturally narrow.

Example 13.3      In this example, we improve on the object signal enhancement of Example 13.2 using the adaptive and iterative Wiener filter described above. As before, a 4-ms triangular analysis window, a 1-ms frame interval, and OLA synthesis are applied. In addition to providing good temporal resolution, the short 4-ms analysis window prevents the inherent smoothing by the OLA process from controlling the Wiener filter dynamics. A 2-ms rectangular filter ƒΔ[p] is selected for smoothing the spectral derivative. Figure 13.3d shows that the use of spectral change in the Wiener filter adaptation [Equation (13.16)] helps to reduce both the residual hiss artifact and improve attack fidelity, giving a cleaner and crisper synthesis. Nevertheless, the first few object components are still reduced in amplitude because, although, indeed, the spectral derivative rises, and the resulting smoothing constant falls in the object region, they do not possess the resolution nor the predictive capability necessary to track the individual object components. The iterative Wiener filter (with 2 iterations) helps to further improve the attack, as illustrated in Figure 13.3e. In addition, the uniform background residual was achieved in Figure 13.3e by allowing the background spectrum to adapt during frames declared to be background, based on an energy-based detection of the presence of speech in an analysis frame [46]. Image

Example 13.4       Figure 13.5 shows a frequency-domain perspective of the performance of the adaptive Wiener filter based on spectral change and with iterative refinement (2 iterations). In this example, a synthetic signal consists of two FM chirps crossing one another and repeated, and the background is white Gaussian noise. As in the previous example, the analysis window duration is 4 ms, the frame interval is 1 ms, and OLA synthesis is applied. Likewise, a 2-ms rectangular filter ƒΔ[p] is selected for smoothing the spectral derivative. The ability of the Wiener filter to track the FM is illustrated by spectrographic views, as well as by snapshots of the adaptive filter at three different signal transition time instants. Image

Figure 13.5 Frequency-domain illustration of adaptive Wiener filter for crossing chirp signals: (a) spectrogram of original noisy signal; (b) spectrogram of enhanced signal; (c) adaptive Wiener filters at three signal transition time instants.

Image

13.3.3 Application to Speech

In the speech enhancement application, a short analysis window and frame interval, e.g., a 4-ms triangular window and a 1-ms frame interval, and OLA synthesis, as in the previous examples, provide good temporal resolution of sharp attacks, transitions, and modulations. This temporal resolution is obtained without the loss in frequency resolution and the loss of low-level speech components (Example 13.1), which is typical of processing with a short window. (We cannot necessarily project results from simple signals, such as a sinewave in white noise, to more complex signals.) Some insight into this property is found in revisiting the difference between the narrowband and wideband spectrograms as described in Chapter 3. Although, the reader is left to further ponder and to question this intriguing observation.

For voiced speech, we approximate the speech waveform x[n] as the output of a linear time-invariant system with impulse response h[n] (with an embedded glottal airflow) and with an impulse train input Image. This results in the windowed speech waveform

xn[m] = w[nm](p[m] * h[m]).

The difference between the narrowband and wideband spectrograms is the length of the window w[n]. For the narrowband spectrogram, we use a long window with a duration of at least two pitch periods; as seen in Chapter 3 (Section 3.3), the short-time section xn[m] maps in the frequency domain to

Image

where H(ω) is the glottal flow spectrum/vocal tract frequency response and where ωk = 2πk/P. Thus spectral slices in the narrowband spectrogram of voiced speech consist of a set of narrow “harmonic lines,” whose width is determined by the Fourier transform of the window, shaped by the magnitude of the product of the glottal airflow spectrum and vocal tract frequency response. With the use of a long window, the Wiener filter would appear as a “comb” following the underlying harmonic speech structure,9 reducing noise in harmonic nulls. A consequence of a long window, however, is that rapid events are smeared by the Wiener filtering process.

9 In a voiced speech segment, if we make the analysis window duration multiple pitch periods, the Wiener filter becomes a comb, accentuating the harmonics. Alternatively, given a pitch estimate, it is straightforward to explicitly design a comb filter [32]. Many problems are encountered with this approach, however, including the need for very accurate pitch (the effect of pitch error increasing with increasing harmonic frequency), the presence of mixed voicing states, and the changing of pitch over a frame duration, as well as the lack of a single pitch over the speech bandwidth due to nonlinear production phenomena.

For the wideband spectrogram, we assume a short window with a duration of less than a single pitch period (e.g., 4 ms), and this provides a different view of the speech waveform. Because the window length is less than a pitch period, as the window slides in time, it essentially “sees” pieces of the periodically occurring vocal tract/glottal flow response h[n] (assuming tails of previous responses have died away), and, as in Chapter 3 (Section 3.3), this can be expressed as

|X (pL, ω)|2 ≈ |H (ω)|2E [pL]

where E [pL] is the energy in the waveform under the sliding window at time pL. In this case, the spectrogram shows the formant frequencies of the vocal tract along the frequency dimension, but also gives vertical striations at the pitch rate in time, rather than the harmonic horizontal striations as in the narrowband spectrogram because, for a sufficiently small frame interval L, the short window is sliding through fluctuating energy regions of the speech waveform. The Wiener filter, therefore, follows the resonant structure of the waveform, rather than the harmonic structure, providing reduction of noise in formant nulls rather than harmonic nulls. Nevertheless, noise reduction does occur at the fundamental frequency rate, i.e., within a glottal cycle, via the short window and frame interval.

In our description of the narrowband and wideband spectrograms, we have used the example of voiced speech. For unvoiced sound classes (e.g., fricatives and plosives), either spectrogram shows greater intensity at formants of the vocal tract; neither shows horizontal or vertical pitch-related striations because periodicity is not present except when the vocal cords are vibrating simultaneously with unvoiced noise-like or impulse-like sounds. For plosive sounds in particular, the wideband spectrogram is preferred because it gives better temporal resolution of the sound’s fine structure and when the plosive is closely followed by a vowel. A short 4-ms window used in the Wiener filter estimation is thus consistent with this property. Informal listening shows the approach to yield considerable noise reduction and good speech quality without residual musicality. Formal evaluations, however, have not been performed. We demonstrate in Example 13.5 the performance of the adaptive Wiener filter with an example of speech in a white noise background.

Figure13.6 Reduction of additive white noise corrupting a speech waveform from a female speaker: (a) excerpt of original waveform; (b)enhancement of (a)by adaptive Wiener filtering; (c)-(d) spectrograms of full waveforms corresponding to panels (a) and (b).

Image

Example 13.5       In this example, the speech of a female speaker is corrupted by white Gaussian noise at a 9-dB SNR. As in the previous examples, a 4-ms triangular window, 1-ms frame interval, and OLA synthesis are used. A Wiener filter was designed using adaptation to the spectral derivative, two iterative updates, and background adaptation, as described in the previous section. Likewise, a 2-ms rectangular filter ƒΔ[p] was selected for smoothing the spectral derivative. Figure 13.6 illustrates the algorithm’s performance in enhancing the speech waveform. Good temporal resolution is achieved by the algorithm, while maintaining formant and harmonic trajectories. The speech synthesis is subjectively judged informally to be of high quality and without background residual musicality. It is important to note, however, that the speech quality and degree of residual musicality can be controlled by the length of the noncausal smoothing filter ƒΔ[p] used to achieve a smooth spectral derivative measure, i.e., Image. For example, an excessively long ƒΔ[p] can smear speech dynamics, while an ƒΔ[p] too short causes some musicality to return. Image

13.3.4 Optimal Spectral Magnitude Estimation

Motivated by the maximum-likelihood (ML) estimation method of McAulay and Malpass [37], Ephraim and Malah [13] proposed a different stochastic-theoretic approach to an optimal spectral magnitude estimator. Specifically, they obtained the least-squared-error estimate of |X(pL, ω)|, given the noisy observation y[n] = x[n] + b[n], which is its expected value given y[n], i.e., E{|X(pL, ω)||y[n]}. In contrast, neither spectral subtraction, ML estimation, nor Wiener filtering solves this optimization problem. Its solution (with its derivation beyond the scope of our presentation) results in significant noise suppression with negligible musical residual noise. For each analysis frame, the suppression filter involves measures of both a priori and a posteriori SNRs that we denote by γpr(pL, ω) and γpo(pL, ω), respectively [7],[13],[17]:

Image

with

G[x] = ex/2[(1 + x)I0(x/2) + xI1(x/2)]

where Io(x) and I1(x) are the modified Bessel functions of the 0th and 1st order, respectively. The term a priori is used because it is an estimate of the SNR on the current frame based in part on an object estimate from the previous frame. The term a posteriori is used because it is an estimate of the SNR in the current frame based on an object estimate from the current frame. The a priori and a posteriori SNRs are written specifically in the form [7],[13],[17]

Image

where P(x) = x for x ≥ 0 and P(x) = 0 for x < 0, which ensures that the SNRs are positive. The constant α is a weighting factor satisfying |α| < 1 that is set to a value close to 1. We see that the a priori SNR is, then, a combination of the (current) a posteriori SNR and an estimate of the (previous) instantaneous SNR with the object power spectrum estimated from filtering by Hs((p − 1)L, ω).

The above expression for Hs(pL, ω) is quite complicated, but it can be argued [7],[13],[17] that the a priori SNR γpr(pL, ω) is the dominant factor. γpr(pL, ω) can be interpreted as a heavily smoothed version of γpo(pL, ω) when γpo(pL, ω) is small, and as a delayed version of γpo(pL, ω) when γpo(pL, ω) is large [7],[13],[17]. Thus, for low SNR cases, γpr(pL, ω) is heavily smoothed, which results in a smooth Hs(pL, ω) in the low regions of the spectrum. Because musical residual noise occurs in the low SNR regions of the spectrum, the smooth suppression function gives reduced musicality. In the high SNR regions, γpr(pL, ω) roughly tracks γpo(pL, ω), which is an estimate of the instantaneous SNR with the object power spectrum estimated from spectral subtraction. This results in an SNR-based suppression filter similar to the Wiener filter of Section 13.3.1 for high SNR. Thus, the Ephraim and Malah algorithm can be interpreted as using a fast-tracking SNR estimate for high-SNR frequency components and a highly-smoothed, slow-tracking SNR estimate for low SNR components[7],[17].

This algorithm significantly reduces the amount of musical noise compared to the spectral subtraction and basic Wiener filtering methods [7],[13],[17]. As with other approaches to enhancement, the speech quality and background artifacts can be controlled by the degree of spectral smoothing. Some musical residual noise can be perceived, for example, if the smoothing of the a posteriori SNR estimate is not sufficient. On the other hand, when the smoothing is excessive, the beginnings and ends of sounds that are low in SNR are distorted due to the a posteriori SNR estimate’s being too slow in catching up with the transient speech [7],[17].

13.3.5 Binaural Representations

Consider now the output of the filter, 1 − Hs(pL, ω), with Hs(pL, ω) being a suppression filter. This filter output, that we refer to as the output complement, serves two purposes. First, it gives a means for assessing performance, showing pieces of the object signal, as well as the noise background, that were eliminated by the suppression filter. Because the complement and object estimate together form an identity, the complement contains anything that is not captured by the suppression filter.

The filter output complement also opens the possibility of forming a binaural presentation for the listener. For example, we can send the object estimate and its complement into separate ears. In experiments using the Wiener filter of Section 13.3.2, this stereo presentation appears to give the illusion that the object and its complement emanate from different directions, and thus there is further enhancement [46]. A further advantage of this presentation is that, because the complement plus object estimate are an identity, no object component is lost by separating the complement. A disadvantage is possible confusion in having components of one signal appear to come from different directions.

There are also other binaural presentations of interest for enhancement. For example, with the object estimate plus its complement (i.e., the original noisy signal) in one ear and the complement only in the second ear, one perceives the noise coming from directly ahead, while the object appears to come from an angle. Alternately, one can present the object estimate in both ears and its complement in one ear. Other variations include passing the complement signal in both ears and the object estimate in one ear and the inverted (negative of) the object in the other ear. It is interesting to note the following early experiment described by van Bergeijk, Pierce, and David [3] (pp. 164–165): “Imagine that we listen to a noise reaching both ears simultaneously, as it might from a source dead ahead, together with a low-frequency tone reaching one ear direct and the other ear inverted, as it might from a source to one side. In such circumstances we can hear a tone that is around 10 dB weaker than if it reached both ears in the same manner, uninverted, undelayed…. This is another way of saying that we are using our power of directional discrimination in separating the tone from the noise.”

13.4 Model-Based Processing

Heretofore, the additive noise reduction methods of this chapter have not relied on a speech model. On the other hand, we can design a noise reduction filter that exploits estimated speech model parameters. For example, the Wiener filter can be constructed with an object power spectrum estimate that is based on an all-pole vocal tract transfer function. This filter can then be applied to enhance speech, just as we did with the nonparametric Wiener filter in the previous sections. Such a filter could be obtained by simply applying the deconvolution methods that we have studied such as the all-pole vocal tract (correlation and covariance linear prediction) estimation methods of Chapter 5 or homomorphic filtering methods of Chapter 6. The problem with this approach, however, is that such estimation methods that work for clean speech often degrade for noisy speech. A more formal approach in additive noise is to apply stochastic estimation methods such as maximum likelihood (ML), maximum a posteriori (MAP), or minimum-mean-squared error (MMSE) estimation (Appendix 13.A).

MAP estimation of all-pole parameters10 has been applied by Lim and Oppenheim [32], maximizing the a posteriori probability density of the linear prediction coefficients Image (in vector form), given a noisy speech vector Image (for each speech frame), i.e., maximizing Image with respect to Image. For the speech-in-noise problem, solution to the MAP problem requires solving a set of nonlinear equations. Reformulating the MAP problem, however, leads to an iterative approach that requires a linear solution on each iteration and thus avoids the nonlinear equations[32]. (For convenience, we henceforth drop the Image|Image notation.) Specifically, we maximize Image where Image represents the clean speech, so we are estimating the all-pole parameters and the desired speech simultaneously. The iterative algorithm, referred to as linearized MAP (LMAP), begins with an initial guess Image and estimates the speech as the conditional mean Image, which is a linear problem. Then, having a speech estimate, a new parameter vector Image is estimated using the autocorrelation method of linear prediction and the procedure is repeated to obtain a series of parameter vectors Image that increases Image on each iteration. For a stationary stochastic speech process, it can be shown that when an infinitely-long signal is available, estimating the clean speech as Image(on each iteration i) is equivalent to applying a zero-phase Wiener filter with frequency response

10 Without noise, all-pole MAP estimation can be shown to reduce to the autocorrelation method of linear prediction.

Image

and where the power spectrum estimate of the speech on the ith iteration is given by

Image

where Image for k = 1, 2, …, p are the predictor coefficients estimated using the autocorrelation method and A is the linear prediction gain as determined in Chapter 5 (Section 5.3.4). The LMAP algorithm is illustrated in Figure 13.7, where it is emphasized that the LMAP algorithm estimates not only the all-pole parameter vector, but also the clean speech from Wiener filtering on each iteration.11

11 When the speech is modeled with vocal tract poles and zeros, LMAP can be generalized to estimate both the poles and zeros, and speech simultaneously (Exercise 13.5).

The LMAP algorithm was evaluated by Lim and Oppenheim [32] and shown to improve speech quality using the subjective diagnostic acceptability measure (DAM) [Chapter 12 (Section 12.1)], in the sense that perceived noise is reduced. In addition, an objective mean-squared error in the all-pole envelope was reduced for a variety of SNRs. Nevertheless, based on the subjective diagnostic rhyme test (DRT) (Chapter 12), intelligibility does not increase.

Numerous limitations of the LMAP algorithm were improved upon by Hansen and Clements [21]. These limitations include decreasing formant bandwidth with increasing iteration (as occurs with the iterative Wiener filter described in Section 13.3.2), frame-to-frame pole jitter in stationary regions, and lack of a formal convergence criterion. In order to address these limitations, Hansen and Clements [21] introduced a number of spectral constraints within the LMAP iteration steps. Specifically, spectral constraints were imposed on all-pole parameters across time and, within a frame, across iterations so that poles do not fall too close to the unit circle, thus preventing excessively narrow bandwidths, and so that poles do not have large fluctuations from frame to frame.

Figure 13.7 Linearized MAP (LMAP) algorithm for the estimation of both the all-pole parameters and speech simultaneously from a noisy speech signal.

SOURCE: J.S. Lim and A.V. Oppenheim, “Enhancement and Bandwidth Compression of Noisy Speech” [32]. ©1979, IEEE. Used by permission.

Image

Finally, we end this section by providing a glimpse into a companion problem to vocal tract parameter estimation in noise: estimation of the speech source, e.g., pitch and voicing. This is an important problem not only for its own sake, but also for an alternative means of noise reduction through model-based synthesis. Therefore, reliable source estimation, in addition, serves to improve model- and synthesis-based speech signal processing in noise, such as various classes of speech coders and modification techniques. In Chapter 10, we described numerous classes of pitch and voicing estimators. Although some of these estimators have been shown empirically to have certain immunity to noise disturbances, additive noise was not made explicit in the pitch and voicing model and in the resulting estimation algorithms. A number of speech researchers have brought more statistical decision-theoretic formalisms to this problem. McAulay [38] was one of the first to introduce optimum speech classification to source estimation. This approach is based on principles of decision theory in which voiced/unvoiced hypotheses are formalized for voicing estimation. McAulay [39], as well as Wise, Caprio, and Parks [58], have also introduced maximum-likelihood approaches to pitch estimation in the presence of noise. These efforts represent a small subset of the many possibilities for source estimation under degrading conditions.

13.5 Enhancement Based on Auditory Masking

In the phenomenon of auditory masking, one sound component is concealed by the presence of another sound component. Heretofore in this chapter and throughout the text, we have sporadically made use of this auditory masking principle in reducing the perception of noise. In Chapter 12, we exploited masking of quantization noise by a signal, both noise and signal occurring within a particular frequency band. In Section 13.3.2 of this chapter, we exploited the masking of additive noise by rapid change in a signal, both noise and signal change occurring at a particular time instant. These two different psychoacoustic phenomena are referred to as frequency and temporal masking, respectively. Research in psychoacoustics has also shown that we can have difficulty hearing weak signals that fall in the frequency or time vicinity of stronger signals (as well as those superimposed in time or frequency on the masking signal, as in the above two cases). A small spectral component may be masked by a stronger nearby spectral component. A similar masking can occur in time for two closely-spaced sounds. In this section, this principle of masking is exploited for noise reduction in the frequency-domain. While temporal masking by adjacent sounds has proven useful, particularly in wideband audio coding [28], it has been less widely used in speech processing because it is more difficult to quantify.

In this section, we begin with a further look at frequency-domain masking that is based on the concept of a critical band. Then using the critical band paradigm, we describe an approach to determine the masking threshold for complex signals such as speech. The speech masking threshold is the spectral level (determined from the speech spectrum) below which non-speech components are masked by speech components in frequency. Finally, we illustrate the use of the masking threshold in two different noise reduction systems that are based on generalizing spectral subtraction.

13.5.1 Frequency-Domain Masking Principles

We saw in Chapter 8 that the basilar membrane, located at the front-end of the human auditory system, can be modeled as a bank of about 10,000 overlapping bandpass filters, each tuned to a specific frequency (the characteristic frequency) and with bandwidths that increase roughly logarithmically with increasing characteristic frequency. These physiologically-based filters thus perform a spectral analysis of sound pressure level appearing at the ear drum. In contrast, there also exist psychoacoustically-based filters that relate to a human’s ability to perceptuallyresolve sound with respect to frequency. The bandwidths of these filters are known as the critical bands of hearing and are similar in nature to the physiologically-based filters.

Frequency analysis by a human has been studied by using perceptual masking. Consider a tone at some intensity that we are trying to perceive; we call this tone the maskee. A second tone, adjacent in frequency, attempts to drown out the presence of the maskee; we call this adjacent tone the masker. Our goal is to determine the intensity level of the maskee (relative to the absolute level of hearing) at which it is not audible in the presence of the masker. This intensity level is called the masking threshold of the maskee. The general shape12 of the masking curve for a masking tone at frequency Ωo with a particular sound pressure level (SPL) in decibels was first established by Wegel and Lane [56] and is shown in Figure 13.8. Adjacent tones that have an SPL below the solid lines are not audible in the presence of the tone at Ωo. We see then that there is a range of frequencies about the masker whose audibility is affected.

12 The precise shape is more complicated due to the generation of harmonics of the tonal masker by nonlinearities in the auditory system; the shape in Figure 13.8 more closely corresponds to a narrow band of noise centered at Ωo acting as the masker [18].

We see in Figure 13.8 that maskee tones above the masking frequency are more easily masked than tones below this frequency. The masking threshold is therefore asymmetric, the masking threshold curve for frequencies higher than Ωo having a milder slope, as we see in Figure 13.8. Furthermore, the steepness of this slope in the higher frequencies is dependent on the level of the masking tone at frequency Ωo, with a milder slope as the level of the masking tone increases. On the other hand, for frequencies lower than Ωo, the masking curve is modeled with a fixed slope [18],[56].

Figure 13.8 General shape of the masking threshold curve for a masking tone at frequency Ωo. Tones with intensity below the masking threshold curve are masked (i.e., made inaudible) by the masking tone.

Image

Another important property of masking curves is that the bandwidth of these curves increases roughly logarithmically as the frequency of the masker increases. In other words, the range of frequencies that are affected by the masker increases as the frequency of the masking tone increases. This range of frequencies in which the masker and maskee interact was quantified by Fletcher [15],[16],[18] through a different experiment. In Fletcher’s experiment, a tone (the maskee) is masked by a band of noise centered at the maskee frequency. The level of the tone was set so that the tone was not audible in the presence of wideband white noise. The bandwidth of the noise was decreased until the tone became audible. This experiment was repeated at different frequencies and the resulting bandwidths were dubbed by Fletcher the critical bands. The critical band also relates to the effective bandwidth of the general masking curve in Figure 13.8. Critical bands reflect the frequency range in which two sounds are not experienced independently but are affected by each other in the human perception of sound, thus also relating to our ability to perceptually resolve frequency components of a signal.

Given the roughly logarithmically increasing width of the critical band filters, this suggests that about 24 critical band filters cover our maximum frequency range of 15000 Hz for human perception. A means of mapping linear frequency to this perceptual representation is through the bark scale. In this mapping, one bark covers one critical band with the functional relation of frequency ƒ to bark z given by [44]

(13.17)

Image

In the low end of the bark scale (< 1000 Hz), the bandwidths of the critical band filters are found to be about 100 Hz and in higher frequencies the bandwidths reach up to about 3000 Hz[18]. A similar mapping (which we apply in Chapter 14), uses the mel scale. The mel scale is approximately linear up to 1000 Hz and logarithmic thereafter [44]:

(13.18)

Image

Although Equation (13.17) provides a continuous mapping from linear to bark scale, most perceptually motivated speech processing algorithms use quantized bark numbers of 1, 2, 3… 24 that correspond approximately to the upper band edges of the 24 critical bands that cover our range of hearing. We must keep in mind that although these bark frequencies cover our hearing frequency range, physiologically, there exist about 10,000 overlapping cochlear filters along the basilar membrane. Nevertheless, this reduced bark representation (as well as a quantized melscale) allows us to exploit perceptual masking properties with feasible computation in speech signal processing, and also provides a perceptual-based framework for feature extraction in recognition applications, as we will see in Chapter 14.

13.5.2 Calculation of the Masking Threshold

For complex signals such as speech, the effects of individual masking components are additive; the overall masking at a frequency component due to all the other frequency components is given by the sum of the masking due to the individual frequency components, giving a single masking threshold [47],[49],[52]. This threshold tells us what is or is not perceptible across the spectrum. For a background noise disturbance (the maskee) in the presence of speech (the masker) we want to determine the masking threshold curve, as determined from the speech spectrum, below which the background noise would be inaudible. For the speech threshold calculation, however, we must consider that the masking ability of tonal and noise components of speech (in masking background noise) is different [22].

Based on the above masking properties, a common approach to calculating the background noise masking threshold on each speech frame, denoted by T(pL, ω), was developed by Johnston [29], who does the analysis on a critical-band basis. This approach approximates the masking threshold and reduces computation as would be required on a linear-frequency basis. The method can be stated in the following four steps [29],[54]:

S1: The masking threshold is obtained on each analysis frame from the clean speech by first finding spectral energies (by summing squared magnitude values of the discrete STFT), denoted by Ek with k the bark number, within the above 24 critical bands; as we have seen, the critical band edges have logarithmically increasing frequency spacing. This step accounts approximately for the frequency selectivity of a masking curve associated with a single tone at bark number k with energy Ek. Because only noisy speech is available in practice, an approximate estimate of the clean speech spectrum is computed with spectral subtraction.

S2: To account for masking among neighboring critical bands, the critical band energies Ek from step S1 are convolved with a “spreading function” [47]. This spreading function has an asymmetric shape similar to that in Figure 13.8, but with fixed slopes,13 and has a range of about 15 on a bark scale [29],[54]. If we denote the spreading function by hk on the bark scale, then the resulting masking threshold curve is given by Tk = Ek * hk.

13 In other methods [49],[52], the masking threshold has a slope dependent on the masker level for frequencies higher than the maskers.

S3: We next subtract a threshold offset that depends on the noise-like or tone-like nature of the masker. One approach to determine this threshold offset uses the method of Sinha and Tewfik, based on speech being typically tone-like in low frequencies and noise-like in high frequencies [50],[54].

S4: We then map the masking threshold Tk resulting from step S3 from the bark scale back to a linear frequency scale to obtain T(pL, ω) where ω is sampled as DFT frequencies [29].

An example by Virag [54] of masking curves for clean and noisy speech is shown in Figure 13.9 for a voiced speech segment over a 8000 Hz band. The above critical-band approach gives a step-like masking curve, as is shown by Figure 13.9, because we are mapping a bark scale back to a linear frequency scale. A comparison of the masking thresholds derived from both the clean and noisy speech (enhanced with spectral subtraction) shows little difference in the masking curves.

Figure 13.9 Auditory masking threshold curves derived from clean and noisy speech. The light dashed curve is the STFT magnitude in dB.

SOURCE: N. Virag, “Single Channel Speech Enhancement Based on Masking Properties of the Human Auditory System” [54]. ©1999, IEEE. Used by permission.

Image

13.5.3 Exploiting Frequency Masking in Noise Reduction

In exploiting frequency masking, the basic approach is to attempt to make inaudible spectral components of annoying background residual (from an enhancement process) by forcing them to fall below a masking threshold curve as derived from a measured speech spectrum.14 We are interested in masking this annoying (often musical) residual while maximizing noise reduction and minimizing speech distortion. There are a variety of psychoacoustically motivated speech enhancement algorithms that seek to achieve this goal by using suppression filters similar to those from spectral subtraction and Wiener filtering [10],[17],[49],[54]. Each algorithm establishes a different optimal perceptual tradeoff between the noise reduction, background residual (musical) artifacts, and speech distortion. In this section, for illustration, we describe two particular suppression algorithms that exploit masking in different ways. The first approach by Virag [54] applies less attenuation when noise is heavily masked so as to limit speech distortion. The second approach by Gustafsson, Jax, and Vary [17],[19] seeks residual noise that is perceptually equivalent to an attenuated version of the input noise without explicit consideration of speech distortion.

14 An interesting question arises as to whether the masking phenomenon of the auditory system provides its own mechanism for speech enhancement. In fact, masking is associated with lateral inhibition, which we discussed in Chapter 8, and which led Wang and Shamma [55] to propose the “auditory spectrum” that, in the presence of noise, represents an enhanced version of spectral components important for perception.

In the approach of Virag, a masking threshold curve is used to modify parameters of a suppression filter that is a generalization of spectral subtraction. The suppression filter in this method, originally proposed by Berouti, Schwartz, and Makhoul [4], can be written as

(13.19)

Image

where Q(pL, ω) is the ratio of the estimated background power spectrum to the measured STFT magnitude:

Image

An advantage of this filtering scheme over basic spectral subtraction is that it provides a tradeoff between noise reduction and speech and background residual distortion. In Virag’s algorithm, the parameters of the generalized spectral subtraction filter are adapted to the masking threshold curve on each frame [54]. The factor α controls the extent of noise reduction. Typically, for α > 1 noise reduction is obtained at the expense of speech distortion. The additional factor β gives the minimum noise floor and provides a means to add background noise to mask the perceived residual (musical) noise but at the expense of added background noise. The exponent γ1 = 1/γ2 controls the sharpness of the transition in the suppression curve associated with Hs(pL, ω) (similar to that in Figure 13.1).

The steps in Virag’s noise reduction algorithm are stated as follows:

S1: Estimation of the background noise power spectrum required in the spectral subtraction filter by averaging over non-speech frames.

S2: Calculation, from the short-time speech spectrum, of the masking threshold curve on each frame, denoted by T(pL, ω), using the approach by Johnston described in the previous section.

S3: Adaptation of the parameters α and β of the spectral subtraction filter to the masking curves T(pL, ω). Ideally, we want the residual (musical) noise to fall below the masking curves, thus making it inaudible. When the masking threshold is high, the background noise is already masked and does not need to be reduced to avoid speech distortion. When the masking threshold is low, we reduce the residual noise to avoid the residual appearing above the masking threshold. Maximum and minimum parameter values are defined such that αmin and βmin map to T(pL, ω)max corresponding to the least noise reduction, and αmax and βmax map to T(pL, ω)min corresponding to the largest noise reduction. Interpolation is performed across these extrema.

S4: Application of the noise suppression filter from step S3 and overlap-add (OLA) synthesis.

In determining the performance of the algorithm, Virag used both objective and subjective measurements of speech quality. Two objective measures used by Virag are the articulation index and a perceptually-weighted all-pole spectral distance measure (giving less weight to regions of the spectrum with greater energy), both of which were briefly described in Chapter 12. For the above enhancement algorithm these methods were found to correlate with subjective testing, unlike other objective measures such as segmental signal-to-noise ratio.15 Although not achieving the theoretical limit (defined in Equation (13.4) as having the clean STFT magnitude and noisy phase), Virag found that the proposed spectral subtraction scheme that adapts to auditory masking outperformed the more classical spectral subtraction approaches, according to the above two objective measures. Finally, Virag used the subjective Mean Opinion Score (MOS)16 test to show that the auditory-based algorithm also outperforms other subtractive-type noise suppression algorithms with respect to human perception; the algorithm was judged to reduce musical artifacts and give acceptable speech distortion [54].

15The standard segmental signal-to-noise ratio weights equally all spectral components and does not impart a perceptual weighting, thus not correlating well with subjective measures.

16The Mean Opinion Score (MOS), alluded to in Chapter 12, is one standardized subjective test. In this test, the listener is asked to rank a test utterance between 1 (least desirable) and 5 (most desirable). The score reflects the listener’s opinion of the speech distortion, noise reduction, and residual noise.

In an alternative suppression algorithm based on auditory masking, rather than using the masking threshold curve to modify a standard suppression filter, Gustafsson, Jax, and Vary[19] use the masking threshold to derive a new suppression filter that results in perceived noise which is an attenuated version of the original background noise. In this formulation, given an original noisy signal y[n] = x[n] + b[n], the desired signal can be written as d[n] = x[n] + αb[n], where α is a noise suppression scale factor. With hs[n] denoting the impulse response of the suppression filter, an estimate of the short-time power spectrum of the noise error, αb[n] − hs[n] * b[n], can be shown to be (Exercise 13.6):

(13.20)

Image

where Hs(pL, ω) is the frequency response of hs[n] on the pth frame and Image is an estimate of the background power spectrum. If this error falls below the speech masking threshold curve, then only the attenuated noise is perceived. Thus we form the constraint:

Image

so that

(13.21)

Image

This gives a range of values on Hs(pL, ω) for which the output noise is perceived as the desired attenuated original noise, so that no musicality occurs in the residual noise. Selecting the upper limit of Equation (13.21) (and constraining H(pL, ω) ≤ 1) gives the minimum attenuation (and thus distortion) of the speech signal. As expected, this algorithm gives a noise output that is perceptually equivalent to an attenuated version of the noise input and thus contains no musical artifacts, while it gives speech distortion similar to that in conventional spectral subtraction[17],[19]. Tradeoffs in the degree of noise attenuation and the speech distortion can be obtained through the parameter α. The reader is asked to explore this tradeoff, as well as a comparison with the basic Wiener filter in Exercise 13.6. An extension of the suppression algorithm by Gustafsson, Jax, and Vary that reduces speech distortion has been introduced by Govindasamy [17]. This method uses frequency-domain masking to explicitly seek to hide speech distortion (defined as βx[n] − hs[n] * x[n], β a constant scale factor) simultaneously with the noise distortion (αb[n] − hs[n] * b[n]).

13.6 Temporal Processing in a Time-Frequency Space

In this chapter, we have thus far performed speech enhancement by applying various forms of spectral subtraction and Wiener filtering on short-time speech segments, holding the time variable fixed in the STFT. We now take a different approach in which we hold the frequency variable fixed and filter along time-trajectories of STFT filter-bank outputs.

13.6.1 Formulation

Recall from Chapter 7 the filter bank interpretation of the STFT:

(13.22)

Image

where w[n] is the analysis window, also referred to as the analysis filter. We refer to the demodulated output of each filter as the time-trajectory at frequency ω. Unless needed, for simplicity throughout this section, we assume no time decimation of the STFT by the frame interval L.

Suppose we denote a short-time segment at time n, xn[m], by the two-dimensional function ƒ[n, m] in the time index variables n and m. Then the corresponding STFT can be expressed as

Image

Considering the Fourier transform of X(n, ω) along the time axis, at specific frequencies, we then have the function [2]

(13.23)

Image

which is a two-dimensional (2-D) Fourier transform of the 2-D sequence x[n, m]. The 2-D transform Image can be interpreted as a frequency analysis of the filter-bank outputs. The frequency composition of the time-trajectory of each channel is referred to as the modulation spectrum with modulation frequency as the frequency variable θ [25].

13.6.2 Temporal Filtering

The temporal processing of interest to us is the filtering of time-trajectories to remove distortions incurred by a sequence x[n]. The blind homomorphic deconvolution method of cepstral mean subtraction (CMS) introduced in Section 13.2.3 aims at this same objective and is related to the temporal filtering of this section. We formalize this temporal processing with a multiplicative modification Image:

Image

which can be written as a filtering operation in the time-domain variable n [2]; i.e., for each ω, we invert Image with respect to the variable θ to obtain

(13.24)

Image

where P(n, ω) denotes the time-trajectory filter at frequency ω. We see then that the STFT is convolved along the time dimension, while it is multiplied along the frequency dimension.

We now consider the problem of obtaining a sequence from the modified 2-D function Y(n, ω). In practice, we replace continuous frequency ω by Image corresponding to N uniformly spaced filters in the discrete STFT that we denote by Y(n, k). One approach to synthesis17 is to apply the filter-bank summation (FBS) method given in Equation (7.9) of Chapter 7. For the modified discrete STFT, Y(n, k), the FBS method gives

17 Throughout this section, the filter-bank summation (FBS) method is used for synthesis. Other methods of synthesis from a modified STFT, such as the overlap-add (OLA) or least-squared-error (LSE) method of Chapter 7, however, can also be applied. As in Chapter 7, each method has a different impact on the synthesized signal. The effect of OLA synthesis, for example, from a temporally processed STFT has been investigated by Avendano [2].

(13.25)

Image

We saw in Chapter 7 that when no modification is applied, the original sequence x[n] is recovered uniquely with this approach under the condition that the STFT bandpass filters sum to a constant (or, more strictly, that w[n] has length less than N ). On the other hand, with modification by time-trajectory filters Image (representing a different filter for each uniformly spaced frequency Image, we can show that (Exercise 13.7)

(13.26)

Image

Filtering along the N discrete STFT time-trajectories (with possibly different filters) is thus equivalent to a single linear time-invariant filtering operation on the original time sequence x[n] [2]. The following example considers the simplifying case where the same filter is used for each channel:

Example 13.6       Consider a convolutional modification along time-trajectories of the discrete STFT with the same causal sequence p[n] for each discrete frequency Image. The resulting discrete STFT is thus

Y (n, k) = X (n, k) * p[n],     k = 0, 1,… N − 1

i.e., we filter by p[n] along each time-trajectory of the discrete STFT. For the FBS output we have Equation (13.26). We are interested in finding the condition on the filter p[n] and the analysis window w[n] under which y[n] = x[n].

Using Equation (13.26), we can show that y[n] can be written as

Image

Rearranging this expression, we have

(13.27)

Image

Figure 13.10 Illustration of the constraint on w[n] and p[n] for y[n] = x[n] in Example 13.6: (a) the sequence Image with p[n] assumed causal; (b) the constraint given by Image.

Image

where Image and where we have used the result that a sum of harmonically related complex exponentials equals an impulse train. We can then simplify Equation (13.27) as

y[n] = x[n] * υ[n]

where

Image

Then the constraint on p[n] and w[n] for y[n] = x[n], illustrated pictorially in Figure 13.10, is given by

Image

or, assuming no zeros within the duration of Image, the constraint for y[n] = x[n] becomes

Image

where Nw and M are the lengths of the window w[n] and filter p[n], respectively.Image

We see from the previous example that we can recover x[n] when X(n, ω) is temporally filtered along the n dimension by a single filter. We have also seen in this section that linear filtering along time-trajectories is equivalent to applying a single linear time-invariant filter to the sequence x[n]. Although these results are of academic interest, our ultimate objective is to use temporal filtering along channels to remove distortion. We will now see that one approach to realizing the advantage of temporal processing lies with nonlinear transformations of the trajectories prior to filtering, for which there does not always exist an equivalent time-domain filtering of x[n] [2].

13.6.3 Nonlinear Transformations of Time-Trajectories

We have seen in Chapter 6 examples of nonlinear processing in which the logarithm and spectral root operators were applied to the STFT. Although the resulting homomorphic filtering was applied to spectral slices of the nonlinearly transformed STFT, and not along STFT filter-bank output trajectories, the concept motivates the temporal processing of this section.

Magnitude-Only Processing — We begin with “magnitude-only” processing where the STFT magnitude only of a sequence x[n] is processed along time-trajectories to yield a new STFT magnitude:

Image

where we assume that the filter P(n, ω) is such that |Y(n, ω)| is a positive 2-D function. As with other processing methods of this chapter that utilize only the STFT magnitude, we attach the phase of the original STFT to the processed STFT magnitude,18 resulting in a modified 2-D function of the form

18 We can also consider a spectral magnitude-only reconstruction, e.g., the iterative solution in Section 7.5.3, whereby a signal phase is obtained through the modified STFT magnitude.

Image

where Image(n, ω) denotes the phase of X(n, ω). Then, using the modified discrete STFT, Y(n, k), we apply FBS synthesis to obtain a sequence. This nonlinear process can be shown to be equivalent to a time-varying filter that consists of a sum of time-varying bandpass filters corresponding to the discretized frequencies, specifically (Exercise 13.8),

(13.28)

Image

where Image and where the time-varying filters

Image

with

Image

Denoting the composite filter Image by q[n, m] according to our interpretation in Chapter 2, q[n, m] is a time-varying filter impulse response at time n to a unit sample applied m samples earlier. In contrast to the time-varying filter of Equation (13.26) [derived from temporal processing of X(n, ω)], this time-varying filter is quite complicated, nonlinear, and signal-dependent in the sense that it requires the phase of the STFT of x[n]. This is also in contrast to the time-varying multiplicative modification of Chapter 7 for which the FBS method results in an equivalent time-varying linear filter, Equation (7.23), that is not signal-dependent. For example, such an equivalent time-domain linear filter can be found for the zero-phase spectral subtraction and Wiener filters developed in the earlier sections of this chapter.

The magnitude function is one example of a nonlinear operation on the STFT prior to temporal filtering along time-trajectories. More generally, we can write

(13.29)

Image

where O is a nonlinear operator, and where we invoke the inverse operator O−1 mind the objective of signal synthesis. We now look at two other methods of nonlinear temporal processing, for which there is no equivalent time-domain linear filter (not even time-varying and signal-dependent) [2], and which have effective application with convolutional and additive disturbances. In these applications, we do not always seek to synthesize a signal; rather, the modified STFT may be replaced by some other feature set. We touch briefly upon these feature sets here and discuss them in more detail in Chapter 14 in the specific application domain of speaker recognition.

RASTA Processing — A generalization of cepstral mean subtraction (CMS), that we introduced in Section 13.2.3, is RelAtive SpecTrAl processing (RASTA) of temporal trajectories. RASTA, proposed by Hermansky and Morgan [23], addresses the problem of a slowly time-varying linear channel g[n, m] (i.e., convolutional distortion) in contrast to the time-invariant channel g[n] removed by CMS. The essence of RASTA is a cepstral lifter that removes low and high modulation frequencies and not simply the DC component, as does CMS.

In addition to being motivated by a generalization of CMS, RASTA is also motivated by certain auditory principles. This auditory-based motivation is, in part, similar to that for adaptivity in the Wiener filter of Section 13.3.2: The auditory system is particularly sensitive to change in a signal. However, there is apparent evidence that auditory channels have preference for modulation frequencies near 4 Hz [23]. This peak modulation frequency is sometimes called the syllabic rate because it corresponds roughly to the rate at which syllables occur. RASTA exploits this modulation frequency preference. With slowly varying (rather than fixed) channel degradation, and given our insensitivity to low modulation frequencies,19 in RASTA a filter that notches out frequency components at and near DC is applied to each channel. In addition, the RASTA filter suppresses high modulation frequencies to account for the human’s preference for signal change at a 4 Hz rate. (We let the reader explore the possible inconsistency of this high-frequency removal with our auditory-based motivation for RASTA.)

19 It is known, for example, that humans become relatively insensitive to stationary background noises over long time durations.

In RASTA, the nonlinear operator O in the general nonlinear processing scheme of Equation (13.29) becomes the magnitude followed by the logarithm operator. The RASTA filtering scheme then attenuates the slow and fast changes in the temporal trajectories of the logarithm of the STFT magnitude. Using the signal processing framework in Equation (13.29), we thus write the modified STFT magnitude used in RASTA enhancement as:

Image

Figure 13.11 Frequency response of the RASTA bandpass filter.

SOURCE: H. Hermansky and N. Morgan, “RASTA Processing of Speech” [23]. ©1994, IEEE. Used by permission.

Image

where a single filter p[n] is used along each temporal trajectory and where here Y(n, ω) denotes the STFT of a convolutionally distorted sequence x[n]. The particular discrete-time filter used in one formulation of RASTA is an IIR filter given by [23]

Image

where the denominator provides a lowpass effect and the numerator a highpass effect. The sampling frequency of this RASTA filter is 100 Hz, i.e., the frame interval L corresponds to 10 ms.20

20 The RASTA filter was originally designed in the context of speech recognition where speech features are not obtained at every time sample [23].

The frequency response of the resulting bandpass RASTA filter is shown in Figure 13.11. The RASTA filter is seen to peak at about 4 Hz. As does CMS, RASTA reduces slowly varying signal components, but, in addition, suppresses variations above about 16 Hz. The complete RASTA temporal processing for blind deconvolution is illustrated in Figure 13.12. In this figure, a slowly varying distortion log |G(n, ω)|, due to a convolutional distortion g[n] to be removed by the RASTA filter p[n], is added to the rapidly varying speech contribution log |X(n, ω)|.

Although we have formulated RASTA as applied to the STFT, the primary application for RASTA has been in speech and speaker recognition where filtering is performed on temporal trajectories of critical band energies used as features in these applications. In fact, the characteristics of the RASTA filter shown in Figure 13.11 were obtained systematically by optimizing performance of a speech recognition experiment in a degrading and changing telephone environment [23]. In Chapter 14, we describe another common use of RASTA in the particular application of speaker recognition.

RASTA-Like Additive Noise Reduction — In addition to reducing convolutional distortion, RASTA can also be used to reduce additive noise. The temporal processing is applied to the STFT magnitude and the original (noisy) phase along each temporal trajectory is kept intact. In performing noise reduction along STFT temporal trajectories,21 we assume that the noise background changes slowly relative to the rate of change of speech, which is concentrated in the 1–16 Hz range.

21 Noise reduction has also been performed by temporally filtering along sinewave amplitude, frequency, and phase trajectories [48]. Given the continuity of these functions by the interpolation methods of Chapter 9, it is natural to consider this form of temporal filtering.

Figure 13.12 Complete flow diagram of RASTA processing for blind deconvolution. A linear filter p[n] is applied to each nonlinearly-processed STFT temporal trajectory consisting of the rapidly varying object component log |X(pL, ωk)| and the slowly varying distortion component log |G(pL, ωk)| where ωk = 2πk/N.

Image

A nonlinear operator used in Equation (13.29) such as the logarithm, however, does not preserve the additive noise property and thus linear trajectory filtering is not strictly appropriate. Nevertheless, a cubic-root nonlinear operator followed by RASTA filtering has led to noise reduction similar in effect to spectral subtraction, including the characteristic musicality [2],[24].

An alternative approach taken by Avendano [2] is to design a Wiener-like optimal filter that gives a best estimate of each desired temporal trajectory. In his design, Avendano chose a power-law modification of the magnitude trajectories, i.e.,

Image

and, with the original phase trajectories, an enhanced signal is obtained using FBS synthesis. Here Y(m, ω) denotes the STFT of a sequence x[n] degraded by additive noise. Motivation for the power-law, as well as the earlier cube-root modification, is evidence that the auditory system exploits this form of nonlinearity on envelope trajectories [2],[24]. For the discretized frequencies Image used in FBS, the objective is to find for each trajectory an optimal filter pk[n] = P(n, ωk) that, when applied to the noisy trajectory, yk[n] = |Y(n, ωk)|1/γ, results in a modified trajectory that meets a desired trajectory, dk[n], which is the trajectory of the clean speech. This filter is designed to satisfy a least-squared-error criterion, i.e., we seek to minimize

(13.30)

Image

In forming the error Ek, a clean speech reference is used to obtain the desired trajectories dk[n]. Assuming a non-causal filter pk[n] that is non-zero over the interval [−L/2, L/2] (L assumed even), Avendano then solved this mean-squared error optimization problem for the 2L + 1 unknown values of pk[n] (Exercise 13.11) and found that the resulting filters have frequency responses that resemble the RASTA filter of Figure 13.11 but with somewhat milder rolloffs. The filters are Wiener-like in the sense of preserving the measured modulation spectrum in regions of high SNR and attenuating this spectrum in low SNR regions. Interestingly, it was observed that the non-causal filter impulse responses are nearly symmetric and thus nearly zero-phase. With this filter design technique, the value of γ = 1.5 for the power-law nonlinearity was found in informal listening to give preferred perceptual quality.

For various noise conditions, the optimal RASTA-like filtering was also found to give a reduction in mean-squared error in critical band energies relative to conventional spectral subtraction and Wiener filtering [2]. This reduction in mean-squared error, however, was obtained at the expense of more annoying residual “level fluctuations.” Nevertheless, an important aspect of this comparison is that because of the nonlinear power operation, there is no equivalent single time-domain filtering process as with conventional short-time spectral subtraction and Wiener filtering [2]. Although this optimal filtering of temporal trajectories suffers from residual fluctuations, it has not exploited any form of adaptive smoothing as in the Wiener filtering of time slices of Sections 13.3.2 and 13.3.4. Moreover, the approach is based on auditory principles that give promise for further improvement.

13.7 Summary

In this chapter, we have studied approaches to reduce additive noise and convolutional distortion in a speech signal over a single channel.22 For each disturbance, we first investigated techniques of filtering one spectral slice at a particular time instant, motivated by the Fourier transform view of the STFT. These techniques include spectral subtraction, Wiener filtering, and optimal spectral magnitude estimation. For each disturbance, we also investigated filtering a multitude of time slices at a particular frequency, motivated by the bandpass filtering view of the STFT. We observed that one approach to realizing the advantage of the later temporal processing lies with nonlinear transformations of the temporal trajectories, prior to filtering, thus mimicking, in part, early stages of auditory front-end processing. These techniques include cepstral mean subtraction and RASTA processing. We also introduced the principle of auditory masking both in time, with adaptive Wiener filtering for concealing noise under dynamically changing components of speech and, in frequency, with spectral subtraction for concealing small spectral components by nearby large spectral components. In addition, we exploited auditory phenomena in creating binaural presentations of the enhanced signal and its complement for possible further enhancement. A characteristic of all the techniques studied is that the original noisy STFT phase is preserved. No attempt is made to estimate the phase of the desired signal; rather, the noisy measurement phase is returned to the estimate. We saw that the work of Vary [53], however, indicates that phase can be important perceptually at low SNR, pointing to the need for the development of phase estimation algorithms in noise (one possibility given in Exercise 13.4). Finally, in this chapter, we have observed or implied that, in spite of significant improvements in reducing signal distortion and residual artifacts, improved intelligibility of the enhanced waveform for the human listener remains elusive.

22 This is in contrast to enhancement with multiple recordings of the degraded speech as from multiple microphones. A useful survey of such techniques is given by Deller, Proakis, and Hansen [11].

Given the space constraints of this chapter, there remain many developing areas in speech enhancement that we did not cover. Three of these areas involve multi-resolution analysis, nonlinear filtering, and advanced auditory processing models. Here we mention a few representative examples in each area. In the first area of multi-resolution analysis, Anderson and Clements [1] have exploited auditory masking principles in the context of multi-resolution sinusoidal analysis/synthesis, similar to that illustrated in Figure 9.18. Irino [26] has developed a multi-resolution analysis/synthesis scheme based explicitly on level-adaptive auditory filters with impulse responses modeled as gammachirps [Chapter 11 (Section 11.2.2)], and Hansen and Nandkumar [20] have introduced an auditory-based multi-resolution Wiener filter that exploits lateral inhibition across channels. Wavelet-based noise reduction systems, whose bases approximate critical band responses of the human auditory system, bridge multi-resolution analysis and nonlinear filtering by applying nonlinear thresholding techniques to wavelet coefficients[12],[45]. Other nonlinear filtering methods include Occam filters based on data compression[42], and Teager-based and related quadratic-energy-based estimation algorithms [14],[27]. All of these methods, both multi-resolution and nonlinear, provide different approaches to seek the preserving of fine temporal structure while suppressing additive noise. In the third developing area, advanced auditory processing models, we have only begun to explore the possibilities. In the growing field of auditory scene analysis [6], for example, components of sounds that originate from different sources (e.g., speech and additive background noise) are grouped according to models of auditory perception of mixtures of sounds. Component groups for signal separation include harmonically related sounds, common onset and offset times, and common amplitude or frequency modulation23 [9].

23 A fascinating set of experiments by McAdams [36] shows the importance of modulation in perceptually separating two summed synthetically-generated vowels. In one experiment, with fixed pitch and vocal tract, the vowels were not perceived as distinct. When frequency modulation is added to the pitch of one vowel, aural separation of the vowels significantly improves.

Although these emerging areas are in their infancy, they are providing the basis for new directions not only for the reduction of additive noise and convolutional distortions, but also for the mitigation of other common distortions not covered in this chapter. These disturbances include reverberation, nonlinear degradations that we introduce in Chapter 14, and interfering speakers that we touched on briefly in Chapter 9 (Exercise 9.17).

Appendix 13.A: Stochastic-Theoretic Parameter Estimation

In this appendix, we consider the estimation of a parameter vector image given a measurement vector image that is a probabilistic mapping of image, e.g., image are linear prediction coefficients and image are noisy speech measurements. Three estimation methods of interest are the maximum likelihood (ML), maximum a posteriori (MAP), and minimum-mean-squared error (MMSE) estimation [32].

Maximum Likelihood (ML): Suppose that the parameter vector image is deterministic and the probability density of image given image is known. In ML estimation, the desired parameter vector is selected that most likely resulted in the observation vector image. This corresponds to maximizing the conditional probability function image over all image in the parameter space image and where image falls in the space image.

Maximum a posteriori (MAP): Suppose that the parameter vector image is random and the a posteriori probability density of image given image, image, is known. In MAP estimation, the parameter vector is selected to maximize image over the space of parameter vectors image. When the a priori probability density image is flat over the range of image, ML and MAP yield the same parameter estimation.

Minimum-Mean-Squared Error (MMSE): Suppose again that the parameter vector image is random and the a posteriori probability density of image given image, image, is known. In MMSE estimation, the parameter vector is selected by maximizing the mean-squared error image which can be shown to result in the conditional a posteriori mean image thus, when the maximum of image equals its mean, the MAP and MMSE estimates are equal.

Exercises

13.1 Consider a signal y[n] of the form y[n] = x[n] + b[n] where x[n] is a sinewave, i.e., x[n] = A cos(ωon), and b[n] is a white noise background disturbance with variance σ2. In this problem, you investigate the signal properties used in Example 13.1 for y[n], analyzed by a short-time window w[n]of length Nw.

(a) Show that when the sinewave frequency ωo is larger than the width of the main lobe of the analysis window, it follows that

Image

where Image and thus

Image

where E denotes the expectation operator.

(b) Show that the average power of the windowed noise is constant with frequency, i.e.,

Image

(c) Using the property that x[n] and b[n] are uncorrelated, show that the SNR ratio in Equation (13.9) follows. Argue that Δw in Equation (13.9) represents the 3-dB bandwidth of the window main lobe.

13.2 For the signal Image with the object random process x[n] uncorrelated with the background noise random process b[n], derive the Wiener filter in Equation (13.10). Then show its time-varying counterpart in Equation (13.11) interms of signal-to-noise ratio R(n, ω) = Image x(n, ω)/ Imageb(ω).

13.3 Consider a filter bank Image that meets the FBS constraint. In this problem, you develop a single noise suppression filter, applied to all N channel outputs for a noisy input Image. Assume the object random process x[n] uncorrelated with the background noise random process b[n]. Specifically, find the optimal noise suppression filter hs[n] that minimizes the error criterion

Image

Express your solution in terms of the object and background spectra Sx(ω) and Sb(ω), respectively. Explain intuitively the difference between your solution and the standard Wiener filter.

13.4 All enhancement methods in this chapter avoid phase estimation, whether in processing STFT slices along frequency or temporal trajectories. In this problem you are asked to develop a method of estimating the phase of the Fourier transform of a sequence x[n], i.e., the phase of X(ω), from the noisy sequence Image. We work here with the entire sequences, rather than the STFT, although the approach can be generalized to the STFT.

Suppose that we switch the roles of time and frequency. Then one approach to estimating the phase function ∠X(ω) is to estimate the complex Fourier transform X(ω) in the presence of the disturbance B(ω) (i.e., the Fourier transform of b[n]) where we view the complex functions X(ω), B(ω), and Y(ω) as “time signals.”

(a) Find the best frequency-domain linear smoothing filter Hs(ω) such that

Image

gives the minimum-mean-squared error between the desired X(ω) and its estimate Image, i.e., minimize the error

Image

Using Parseval’s Theorem, express your result in the time domain as an operation on y[n] and assume you know |x[n]|2 and |b[n]|2. The procedure yields both a magnitude and phase estimate of X(ω) through smoothing the real and imaginary parts of Y(ω). The magnitude may not be as accurate an estimate as that derived from the conventional Wiener filter, but the phase estimate may be an improvement. Hint: Given that we have switched the roles of time and frequency, we expect the result to invoke a multiplication of y[n] by a function, corresponding to a smoothing of Y(ω). An example of a phase estimate from this method, for an exponentially-decaying sinewave x[n] and white Gaussian noise b[n], is shown in Figure 13.13b. In this example, |x[n]|2 is assumed to be known and |b[n]|2 is replaced by the variance of b[n].

(b) Suppose we now combine the two estimates from the frequency-domain filter of part (a) and a conventional time-domain Wiener filter. Propose an iterative estimation scheme that uses both estimates and that may provide a phase estimate that improves on each iteration. Hint: Apply the conventional time-domain Wiener filter, but “deterministic” version where |X(ω)|2 and |B(ω)|2 are assumed known, followed by the frequency-domain filter from part (a), first estimating magnitude and then estimating phase. Heuristically, the time-domain method deemphasizes regions of the complex spectrum at low-energy regions where the phase is known to bear little resemblance to the phase of the actual signal. The frequency-domain filter then smooths the complex spectrum. The resulting smooth phase estimate for the example of an exponentially-decaying sinewave in white noise is shown in Figure 13.13c after three iterations. For the frequency-domain filter, as in part (a), |x[n]|2 is assumed to be known and |b[n]|2 is replaced by the variance of b[n]. For the time-domain filter, |B(ω)|2 is replaced by the variance of b[n].

Figure 13.13 Iterative phase estimation based on time-frequency Wiener filtering of an exponentially-decaying sinewave in white noise in Exercise 13.4: (a) original noisy phase; (b) phase estimate from the frequency-domain Wiener filter; (c) phase estimate from the combined time- and frequency-domain Wiener filters after three iterations. In each panel, the phase of the decaying sinewave is shown as a dashed line and the phase estimate by a solid line.

Image

13.5 Suppose a speech waveform is modeled with vocal tract poles and zeros, and consider the problem of estimating the speech in the presence of additive noise. Propose a (linear) iterative method, as a generalization of the LMAP algorithm for all-pole estimation of Section 13.4, that estimates both the poles and zeros, as well as the speech, simultaneously. Hint: Maximize the a posteriori probability density p(a, b, x/ y). The vectors Image and b represent the pole and zero polynomial coefficients, respectively, and the vectors Image and Image represent (for each analysis frame) the clean and noisy speech, respectively.

13.6 In this problem, you explore the use of a masking threshold in a suppression filter hs[n] that results in perceived noise that is an attenuated version of the original noise. In this formulation, given an original noisy signal Image, the desired signal can be written as d[n] = x[n] + αb[n] where α is the noise suppression scale factor.

(a) Show that an estimate of the short-time power spectrum of the noise error αb[n]−hs[n] * b[n] is given by Equation (13.20).

(b) Derive the suppression filter range, Equation (13.21), for which the short-time noise error power spectrum estimate falls below the masking threshold T(pL, ω).

(c) Discuss tradeoffs in the degree of noise attenuation and the speech distortion that can be obtained though the parameter α. Compare this suppression tradeoff with that of the standard Wiener filter.

13.7 Show that with FBS synthesis in Section 13.6.2, the operation equivalent to filtering an STFT by Image along temporal trajectories is a filtering by a single time-invariant linear filter consisting of a sum of bandpass filters, i.e.,

Image

13.8 Suppose that we apply the filter P(n, ω) along each time-trajectory of the STFT magnitude of a sequence x[n], as in Section 13.6.3. Show that applying the filter-bank summation (FBS) method with discretized frequencies Image results in the input sequence x[n] being modified by the time-varying filter of Equation (13.28).

13.9 Consider a signal y[n] of the form y[n] = x[n] * g[n] where g[n] represents a linear time-invariant distortion of a desired signal x[n]. In this problem you explore different formulations of the STFT of y[n].

(a) Given y[n] = x[n] * g[n], show that

Y (n, ω) = (g[n]ejωn) * X (n, ω)

where the above convolution is performed with respect to the time variable n. Then argue that the two block diagrams in Figure 13.14 are equivalent.

Figure 13.14 Effect of convolutional distortion on the STFT: (a) filter-bank interpretation; (b) equivalence to (a).

SOURCE: C. Avendano, Temporal Processing of Speech in a Time-Feature Space [2]. ©1997, C. Avendano. Used by permission.

Image

(b) Rewrite the STFT of y[n] as

Image

and argue that if the window w[n] is long and smooth relative to the impulse response g[n] so that w[n] is approximately constant over the duration of g[n], then w[nm]g[m] ≈ w[n]g[m], from which it follows that

Y (n, ω) ≈ X (n, ω)G (ω)

i.e., the convolutional distortion results in approximately a multiplicative modification G(ω) to the STFT of x[n]. Discuss practical conditions under which this approximation may not be valid.

13.10 Suppose we compute the complex cepstrum of the STFT of a sequence x[n] at each frame p, i.e.,

Image

where we have assumed a frame interval L = 1. Show that applying a linear time-invariant filter h[p] to log[X (p, ω)] along the time dimension for each frequency ω is equivalent to applying the same filter to c[p, n] with respect to the time variable p for each time n (i.e., for each cepstral coefficient). That is, h[p] * log[X(p, ω)] gives a complex cepstrum h[p] * c[p, n]. Show that this relation is also valid for an arbitrary frame length L. Filtering the temporal trajectories of the logarithm of the STFT is, therefore, equivalent to filtering the temporal trajectories of the corresponding cepstral coefficients.

13.11 This problem addresses the use of RASTA-like filtering to reduce additive noise, as described in Section 13.6.3. Consider the design of a distinct optimal filter pk[n] along each power-law-modified temporal trajectory yk[n] of the STFT magnitude for discretized frequencies Image and assume a known desired trajectory dk[n]. Also assume a non-causal filter pk[n], non-zero over the interval [−L/2, L/2] (L assumed even). Minimize the error function [Equation (13.30)]

Image

with respect to the unknown pk[n]. Discuss the relation of your solution for pk[n] with that of the conventional Wiener filter applied to short-time signal segments.

13.12 (MATLAB) Design in MATLAB a noise reduction system based on the spectral subtraction rule in Equation (13.6) and OLA synthesis. Use a 20-ms analysis window and a 10 ms frame interval. Apply your function to the noisy signal speech_noisy_8k (at 8000 samples/s and a 9 db SNR) given in workspace ex13M1.mat located in companion website directory Chap_exercises/chapter13. (This signal is the one used in Example 13.5.) You will need to obtain an estimate of the power spectrum of the background noise, Image, from an initial waveform segment. The clean signal version, speech_clean_8k, and the signal enhanced by the adaptive Wiener filter of Section 13.3.2 (Figure 13.4), speech_wiener_8k, are also given in the workspace. Comment on the musicality artifact from spectral subtraction and compare your result to speech_wiener_8k. In your spectral subtraction design, apply a scale factor α to your background power estimate, Image, in Equation (13.6). Then comment on the tradeoff between speech distortion, noise reduction, and musicality in the residual noise as you vary α below and above unity, corresponding to an under-and over-estimation of Image, respectively.

Bibliography

[1] D.V. Anderson and M.A. Clements, “Audio Signal Noise Reduction Using Multi-Resolution Sinusoidal Modeling,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, vol. 2, pp. 805–808, Phoenix, AZ, March 1999.

[2] C. Avendano, Temporal Processing of Speech in a Time-Feature Space, Ph.D. Thesis, Oregon Graduate Institute of Science and Technology, April 1997.

[3] W.A. van Bergeijk, J.R. Pierce, and E.E. David, Waves and the Ear, Anchor Books, Doubleday & Company, Garden City, NY, 1960.

[4] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of Speech Corrupted by Additive Noise,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp. 208–211, April 1979.

[5] S.F. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–29, no. 2, pp. 113–120, April 1979.

[6] A.S. Bregman, Auditory Scene Analysis, The MIT Press, Cambridge, MA, 1990.

[7] O. Cappe, “Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor,” IEEE Trans. Speech and Audio Processing, vol. 2, no. 1, pp. 345–349, April 1994.

[8] O. Cappe and J. Laroche, “Evaluation of Short-Time Attenuation Techniques for Restoration of Musical Recordings,” IEEE Trans. Speech and Audio Processing, vol. 3, no. 1, pp. 84–93, Jan. 1995.

[9] M. Cooke, Modeling Auditory Processing and Organization, Cambridge University Press, Cambridge, England, 1993.

[10] A. Czyzewski and R. Krolikowski, “Noise Reduction in Audio Signals Based on the Perceptual Coding Approach,” Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, Oct. 1999.

[11] J.R. Deller, J.G. Proakis, and J.H.L. Hansen, Discrete-Time Processing of Speech, Macmillan Publishing Co., New York, NY, 1993.

[12] D. Donahue and I. Johnson, “Ideal Denoising in an Orthonormal Basis Chosen from a Library of Bases,” C.R. Academy of Science, Paris, France, vol. 1, no. 319, pp. 1317–1322, 1994.

[13] Y. Ephraim and D. Malah, “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Amplitude Estimator,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–32, no. 6, pp. 1109–1121, Dec. 1984.

[14] J. Fang and L.E. Atlas, “Quadratic Detectors for Energy Estimation,” IEEE Trans. Signal Processing, vol. 43, no. 11, pp. 2582–2594, Nov. 1995.

[15] H. Fletcher, “Auditory Patterns,” Rev. Mod. Phys., vol. 12, pp. 47–65, 1940.

[16] B. Gold and N. Morgan, Speech and Audio Signal Processing, John Wiley and Sons, Inc., New York, NY, 2000.

[17] S. Govindasamy, A Psychoacoustically Motivated Speech Enhancement System, S.M. Thesis, Massachusetts Institute of Technology, Dept. Electrical Engineering and Computer Science, Jan. 2000.

[18] D.M. Green, An Introduction to Hearing, John Wiley and Sons, New York, NY, 1976.

[19] S. Gustafsson, P. Jax, and P. Vary, “A Novel Psychoacoustically Motivated Audio Enhancement Algorithm Preserving Background Noise Characteristics,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 397–400, Seattle, WA, May 1998.

[20] J.H. Hansen and S. Nandkumar, “Robust Estimation of Speech in Noisy Backgrounds Based on Aspects of the Auditory Process,” J. Acoustical Society of America, vol. 97, no. 6, pp. 3833–3849, June 1995.

[21] J.H. Hansen and M.A. Clements, “Constrained Iterative Speech Enhancement with Application to Automatic Speech Recognition,” IEEE Trans. Signal Processing, vol. 39, no. 4, pp. 795–805, April 1991.

[22] R.P. Hellman, “Asymmetry of Masking Between Noise and Tone,” Perception and Psychophysics, vol. 11, pp. 241–246, 1972.

[23] H. Hermansky and N. Morgan, “RASTA Processing of Speech,” IEEE Trans. Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, Oct. 1994.

[24] H. Hermansky, N. Morgan, and H.G. Hirsch, “Recognition of Speech in Additive and Convolutional Noise Based on RASTA Spectral Processing,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, vol. 2, pp. 83–86, Minneapolis, MN, April 1993.

[25] T. Houtgast and H.J.M. Steeneken, “A Review of the MTF Concept in Room Acoustics and its Use for Estimating Speech Intelligibility in Auditoria,” J. Acoustical Society of America, vol. 77, no. 3, pp. 1069–1077, March 1985.

[26] T. Irino, “Noise Suppression Using a Time-Varying, Analysis/Synthesis Gammachip Filter Bank,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 97–100, Phoenix, AZ, March 1999.

[27] F. Jabloun and A.E. Cetin, “The Teager Energy Based Feature Parameters for Robust Speech Recognition in Noise,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 273–276, Phoenix, AZ, March 1999.

[28] N. Jayant, J. Johnston, and R. Safranek, “Signal Compression Based on Models of Human Perception,” Proc. IEEE, vol. 81, no. 10, pp. 1385–1422, Oct. 1993.

[29] J.D. Johnston, “Transform Coding of Audio Signals Using Perceptual Noise Criteria,” IEEE J. Selected Areas Communication, vol. 6, no. 2, pp. 314–323, Feb. 1988.

[30] H.P. Knagenhjelm and W. B. Kleijn, “Spectral Dynamics is More Important than Spectral Distortion,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 732–735, Detroit, MI, May 1995.

[31] K.D. Kryter, “Methods for the Calculation and the Use of the Articulation Index,” J. Acoustical Society of America, vol. 34, pp. 1689–1697, Nov. 1962.

[32] J.S. Lim and A.V. Oppenheim, “Enhancement and Bandwidth Compression of Noisy Speech,” Proc. of the IEEE, vol. 67, no. 12, pp. 1586–1604, Dec. 1979.

[33] P. Lockwood and J. Boudy, “Experiments with a Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and Projection, for Robust Recognition in Cars,” Speech Comm., vol. 11, pp. 215–228, June 1992.

[34] J. Makhoul, T.H. Crystal, D.M. Green, D. Hogan, R.J. McAulay, D.B. Pisoni, R.D. Sorkin, and T.G. Stockham, “Removal of Noise from Noise-Degraded Speech,” Panel on Removal of Noise from a Speech/Noise Signal, National Academy Press, Washington, D.C. 1989.

[35] D. Marr, Vision: A Computational Investigation into the Human Representation of Visual Information, W.H. Freeman and Company, New York, NY, 1982.

[36] S. McAdams, Spectral Fusion, Spectral Parsing, and Formation of Auditory Images, Ph.D. Thesis, CCRMA, Stanford University, Dept. of Music, May 1984.

[37] R.J. McAulay and M.L. Malpass, “Speech Enhancement Using a Soft-Decision Maximum Likelihood Noise Suppression Filter,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–28, no. 2, pp. 137–145, April 1980.

[38] R.J. McAulay, “Optimum Speech Classification and Its Application to Adaptive Noise Classification,” IEEE Proc. Int. Conf. Acoustics, Speech, and Signal Processing, pp. 425–428, Hartford, CT, April 1977.

[39] R.J. McAulay, “Design of a Robust Maximum Likelihood Pitch Estimator in Additive Noise,” Technical Note 1979–28, Massachusetts Institute of Technology, Lincoln Laboratory, June 11, 1979.

[40] R.J. McAulay and T.F. Quatieri, “Speech Analysis-Synthesis Based on a Sinusoidal Representation,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–34, no. 4, pp. 744–754, Aug. 1986.

[41] B.C.J. Moore, An Introduction to the Psychology of Hearing, Academic Press, 2nd Edition, New York, NY, 1988.

[42] B.K. Natarajan, “Filtering Random Noise from Deterministic Signals via Data Compression,” IEEE Trans. Signal Processing, vol. 43, no. 11, pp. 2595–2605, Nov. 1995.

[43] S.H. Nawab, T.F. Quatieri, and J.S. Lim, “Signal Reconstruction from Short-Time Fourier Transform Magnitude,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–31, no. 4, pp. 986–998, Aug. 1983.

[44] D. O’Shaughnessy, Speech Communication: Human and Machine, Addison-Wesley, Reading, MA, 1987.

[45] I. Pinter, “Perceptual Wavelet-Representation of Speech Signals and its Application to Speech Enhancement,” Computer Speech and Language, vol. 10, pp. 1–22, 1996.

[46] T.F. Quatieri and R. Baxter, “Noise Reduction Based on Spectral Change,” Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 8.2.1–8.2.4, New Paltz, NY, Oct. 1997.

[47] M.R. Schroeder, B.S. Atal, and J.L. Hall, “Optimizing Digital Speech Coders by Exploiting Masking Properties of the Human Ear,” J. Acoustical Society of America, vol. 66, pp. 1647–1652, Dec. 1979.

[48] A. Seefeldt, Enhancement of Noise-Corrupted Speech Using Sinusoidal Analysis/Synthesis, Masters Thesis, Massachusetts Institute of Technology, Dept. Electrical Engineering and Computer Science, May 1987.

[49] D. Sen, D.H. Irving, W.H. Holmes, “Use of an Auditory Model to Improve Speech Coders,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 2, pp. 411–414, Minneapolis, MN, April 1993.

[50] D. Sinha and A.H. Tewfik, “Low Bit Rate Transparent Audio Compression Using Adapted Wavelets,” IEEE Trans. Signal Processing, vol. 41, no. 2, pp. 3463–3479, Dec. 1993.

[51] T. Stockham, T. Cannon, and R. Ingebretsen, “Blind Deconvolution Through Digital Signal Processing,” Proc. IEEE, vol. 63, pp. 678–692, April 1975.

[52] E. Terhardt, “Calculating Virtual Pitch,” Hearing Research, vol. 1, pp. 155–199, 1979.

[53] P. Vary, “Noise Suppression by Spectral Magnitude Estimation—Mechanism and Theoretical Limits,” Signal Processing, vol. 8, pp. 387–400, 1985.

[54] N. Virag, “Single Channel Speech Enhancement Based on Masking Properties of the Human Auditory System,” IEEE Trans. Speech and Audio Processing, vol. 7, no. 2, pp. 126–137, March 1999.

[55] K. Wang and S.A. Shamma, “Self-Normalization and Noise-Robustness in Early Auditory Representations,” IEEE Trans. Speech and Audio Processing, vol. 2, no. 3, pp. 421–435, July 1994.

[56] R.L. Wegel and C.E. Lane, “The Auditory Masking of One Pure Tone by Another and Its Probable Relation to the Dynamics of the Inner Ear,” Physical Review, vol. 23, no. 2, pp. 266–285, 1924.

[57] M.R. Weiss, E. Aschkenasy, and T.W. Parsons, “Study and Development of the INTEL Technique for Improving Speech Intelligibility,” Nicolet Scientific Corp, Final Rep. NSC-FR/4023, Dec. 1974.

[58] J.D. Wise, J.R. Caprio, and T.W. Parks, “Maximum Likelihood Pitch Estimation,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–24, no. 5, pp. 418–423, Oct. 1976.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset