Chapter 2
A Discrete-Time Signal Processing Framework

2.1 Introduction

In this chapter we review the foundation of discrete-time signal processing which serves as a framework for the discrete-time speech processing approaches in the remainder of the text. We investigate some essential discrete-time tools and touch upon the limitations of these techniques, as manifested through the time-frequency uncertainty principle and the theory of time-varying linear systems, that will arise in a speech processing context. We do not cover all relevant background material in detail, given that we assume the familiarity of the reader with the basics of discrete-time signal processing and given that certain topics are appropriately cited, reviewed, or extended throughout the text as they are needed.

2.2 Discrete-Time Signals

The first component of a speech processing system is the measurement of the speech signal, which is a continuously varying acoustic pressure wave. We can transduce the pressure wave from an acoustic to an electrical signal with a microphone, amplify the microphone voltage, and view the voltage on an oscilloscope. The resulting analog waveform is denoted by xa(t), which is referred to as a continuous-time signal or, alternately, as an analog waveform. To process the changing voltage xa(t) with a digital computer, we sample xa(t) at uniformly spaced time instants, an operation that is represented in Figure 2.1 by a sampler with T being the sampling interval. The sampler is sometimes called a continuous-to-discrete (C/D) converter [7] and its output is a series of numbers xa(nT) whose representation is simplified as

x[n] = xa(nT).

Figure 2.1 Measurement (a) and sampling (b) of an analog speech waveform.

Image

The series x[n] is referred to as a discrete-time signal or sequence. Unless otherwise specified, when working with samples from an analog signal, we henceforth use the terms discrete-time signal, sequence, and (for simplicity) signal interchangeably. We have assumed that the analog signal xa(t) is sampled “fast enough” to be recoverable from the sequence x[n], a condition that is called the Nyquist criterion to which we return at the end of this chapter.

The C/D converter that generates the discrete-time signal is characterized by infinite amplitude precision. Therefore, although the signal x[n] is discrete in time, it is continuous in amplitude. In practice, however, a physical device does not achieve this infinite precision. As an approximation to a C/D converter, an analog-to-digital (A/D) converter quantizes each amplitude to a finite set of values closest to the actual analog signal amplitude [7]. The resulting digital signal is thus discrete in time and in amplitude. Associated with discrete-time signals are discrete-time systems whose input and output are sequences. Likewise, digital systems are characterized by digital inputs and outputs. The principal focus of this text is discrete-time signals and systems, except where deliberate amplitude quantization is imparted, as in speech coding for bit rate and bandwidth reduction, which is introduced in Chapter 12.

Some special sequences serve as the building blocks for a general class of discrete-time signals [7]. The unit sample or “impulse” is denoted by

Image

The unit step is given by

Image

and can be obtained by summing the unit sample: Image. Likewise, the unit sample can be obtained by differencing the unit step with itself shifted one sample to the right, i.e., forming the first backward difference: δ[n] = u[n] − u[n − 1]. The exponential sequence is given by

x[n] = n

where if A and α are real, then x[n] is real. Moreover, if 0 ≤ α ≤ 1 and A > 0, then the sequence x[n] is positive and decreasing with increasing n. If − 1 ≤ α ≤ 0, then the sequence values alternate in sign. The sinusoidal sequence is given by

Image

with frequency ω, amplitude A, and phase offset Image. Observe that the discrete-time sinusoidal signal is periodic1 in the time variable n with period N only if N = integer = 2πk / ω.

1 This is in contrast to its continuous-time counterpart Image that is always periodic with period = 2π/Ω. Here the uppercase frequency variable Ω is used for continuous time rather than the lower case ω for discrete time. This notation will be used throughout the text.

The complex exponential sequence with complex gain Image is written as

Image

An interesting property, which is a consequence of being discrete, is that the complex exponential sequence is periodic in the frequency variable ω with period 2π, i.e., Aej(ω+2π)n = Aejωn. This periodicity in ω also holds for the sinusoidal sequence. Therefore, in discrete time we need to consider frequencies only in the range 0 ≤ ω > 2π. The complex exponential and the above four real sequences serve as building blocks to discrete-time speech signals throughout the text.

2.3 Discrete-Time Systems

Discrete-time signals are often associated with discrete-time systems. A discrete-time system can be thought of as a transformation T [x] of an input sequence to an output sequence, i.e.,

y[n] = T(x[n]).

When we restrict this transformation to have the properties of linearity and time invariance, we form a class of linear time-invariant (LTI) systems. Let x1[n] and x2[n] be the inputs to a discrete-time system. Then for arbitrary constants a and b, the system is linear if and only if

T (ax1[n] + bx2[n]) = aT (x1[n]) + bT (x2[n])

which is sometimes referred to as the “principle of superposition” [7]. A system is time-invariant if a time shift in the input by no samples gives a shift in the output by no samples, i.e.,

if y[n] = T (x[n]),

then y[nno] = T (x[nno]).

An important property of an LTI system is that it is completely characterized by its impulse response, which is defined as the system’s response to a unit sample (or impulse). Given an LTI system, the output y[n] for an input x[n] is given by a sum of weighted and delayed impulse responses, i.e.,

(2.1)

image

which is referred to as the convolution of x[n] with h[n], where “*” denotes the convolution operator. We can visualize convolution either by the weighted sum in Equation (2.1) or by flipping h[n] in time and shifting the flipped h[n] past x[n]. The length of the resulting sequence y[n] = x[n] * h[n] is M + L − 1. Since convolution is commutative, i.e, x[n] * h[n] = h[n] * x[n] (Exercise 2.1), we can also flip x[n] and run it past the response h[n]. The convolution operation with h[n] is sometimes referred to as filtering the input x[n] by the system h[n], an operation useful in our modeling of speech production and in almost all speech processing systems.

Two useful properties of LTI systems are stability and causality, which give a more restricted class of systems (Example 2.1). In a stable system, every bounded input produces a bounded output, i.e., if |x[n]| < ∞, then |y[n]| < ∞ for all n. A necessary and sufficient condition for stability is that h[n] be absolutely summable, i.e.,

image

A causal system is one where for any time instant no, y[no] does not depend on x[n] for n > no, i.e., the output does not depend on the future of the input. A necessary and sufficient condition for causality is that h[n] = 0, for n < 0. One can argue this necessary and sufficient condition by exploring the signal-flip interpretation of convolution (Exercise 2.2). A consequence of causality is that if x1[n] = x2[n] for n < no, then y1[n] = y2[n] for n < no.

Example 2.1       The two properties of stability and causality are illustrated with an LTI system having an exponentially decaying impulse response h[n] = Aan for n ≥ 0 and zero otherwise. The system is causal because h[n] = 0 for n < 0 and, when |a| < 1, the system is stable because the impulse response is absolutely summable:

Image

where we have used the geometric series relation Image. If, on the other hand, |a| ≥ 1, then the geometric series does not converge, the response is not absolutely summable, and the system is unstable. Note that, according to this condition, a system whose impulse response is the unit step function, i.e., h[n] = u[n], is unstable.Image

The terminology formulated in this section for systems is also used for sequences, although its physical meaning for sequences is lacking. A stable sequence is defined as an absolutely summable sequence, and a causal sequence is zero for n < 0. A causal sequence will also be referred to as a right-sided sequence.

2.4 Discrete-Time Fourier Transform

The previous section focused on time-domain representations of signals and systems. Frequency-domain representations, the topic of this section, are useful for the analysis of signals and the design of systems for processing signals. We begin with a review of the Fourier transform.

A large class of sequences can be represented as a linear combination of complex exponentials whose frequencies lie in the range2 [−π, π]. Specifically, we write the following pair of equations:

2 Recall that ej(ω+2π)n = ejωn

(2.2)

Image

This pair of equations is known as the discrete-time Fourier transform pair representation of a sequence. For convenience, the phrase Fourier transform will often be used in place of discrete-time Fourier transform. Equation (2.2) represents x[n] as a superposition of infinitesimally small complex exponentials dωX (ω)ejωn, where X (ω) determines the relative weight of each exponential. X (ω) is the Fourier transform of the sequence x[n], and is also referred to as the “analysis equation” because it analyzes x[n] to determine its relative weights. The first equation in the pair is the inverse Fourier transform, also referred to as the “synthesis equation” because it puts the signal back together again from its (complex exponential) components. In this text, we often use the terminology “analysis” and “synthesis” of a signal. We have not yet explicitly shown for what class of sequences such a Fourier transform pair exists. Existence means that (1) X (ω) does not diverge, i.e., the Fourier transform sum converges, and (2) x[n] can be obtained from X (ω). It can be shown that a sufficient condition for the existence of the pair is that x[n] be absolutely summable, i.e., that x[n] is stable [7]. Therefore, all stable sequences and stable system impulse responses have Fourier transforms.

Some useful properties of the Fourier transform are as follows (Exercise 2.3):

P1: Since the Fourier transform is complex, it can be written in polar form as

Image

where the subscripts r and i denote real and imaginary parts, respectively.

P2: The Fourier transform is periodic with period 2π:

X (ω + 2π) = X (ω)

which is consistent with the statement that the frequency range [−π, π] is sufficient for representing a discrete-time signal.

P3: For a real-valued sequence x[n], the Fourier transform is conjugate-symmetric:

X (ω) = X * (−ω)

where * denotes complex conjugate. Conjugate symmetry implies that the magnitude and real part of X (ω) are even, i.e., |X (ω)| = |X (ω)| and Xr (ω) = Xr (−ω), while its phase and imaginary parts are odd, i.e., ∠X (ω) = −∠X (−ω) and Xi(ω) = −Xi (−ω). It follows that if a sequence is not conjugate-symmetric, then it must be a complex-valued sequence.

P4: The energy of a signal can be expressed by Parseval’s Theorem as

(2.3)

Image

which states that the total energy of a signal can be given in either the time or frequency domain. The functions |x[n]|2 and |X (ω)|2 are thought of as energy densities, i.e., the energy per unit time and the energy per unit frequency, because they describe the distribution of energy in time and frequency, respectively. Energy density is also referred to as power at a particular time or frequency.

Example 2.2       Consider the shifted unit sample

x[n] = δ[nno].

The Fourier transform of x[n] is given by

Image

since x[n] is nonzero for only n = no. This complex function has unity magnitude and a linear phase of slope −no. In time, the energy in this sequence is unity and concentrated at n = no, but in frequency the energy is uniformly distributed over the interval [−π, π] and, as seen from Parseval’s Theorem, averages to unity. Image

More generally, it can be shown that the Fourier transform of a displaced sequence x[nno] is given by X (ω)ejωno. Likewise, it can be shown, consistent with the similar forms of the Fourier transform and its inverse, that the Fourier transform of eonx[n] is given by X (ωωo). This later property is exploited in the following example:

Example 2.3       Consider the decaying exponential sequence multiplied by the unit step:

x[n] = anu[n]

with a generally complex. Then the Fourier transform of x[n] is given by

Image

so that the convergence condition on a becomes |a| < 1. If we multiply the sequence by the complex exponential eon, then we have the following Fourier transform pair:

Image

An example of this later transform pair is shown in Figure 2.2a,b where it is seen that in frequency the energy is concentrated around Image. The two different values of a show a broadening of the Fourier transform magnitude with decreasing a corresponding to a faster decay of the exponential. From the linearity of the Fourier transform, and using the above relation, we can write the Fourier transform pair for a real decaying sinewave as

Image

Figure 2.2 Frequency response of decaying complex and real exponentials of Example 2.3: (a) magnitude and (b) phase for complex exponential; (c) magnitude and (d) phase for decaying sinewave (solid for slow decay [a = 0.9] and dashed for fast decay [a = 0.7]). Frequency ωo = π/2.

Image

where we have used the identity Image. Figure 2.2c,d illustrates the implications of conjugate symmetry on the Fourier transform magnitude and phase of this real sequence, i.e., the magnitude function is even, while the phase function is odd. In this case, decreasing the value of a broadens the positive and negative frequency components of the signal around the frequencies ωo and −ωo, respectively. Image

Example 2.3 illustrates a fundamental property of the Fourier transform pair representation: A signal cannot be arbitrarily narrow in time and in frequency. We return to this property in the following section.

The next example derives the Fourier transform of the complex exponential, requiring in frequency the unit impulse which is also called the Dirac delta function.

Example 2.4       In this case we begin in the frequency domain and perform the inverse Fourier transform. Consider a train of scaled unit impulses in frequency:

Image

where 2π periodicity is enforced by adding delta function replicas at multiples of 2π (Figure 2.3a). The inverse Fourier transform is given by3

3 Although this sequence is not absolutely summable, use of the Fourier transform pair can rigorously be justified using the theory of generalized functions [7].

Image

which is our familiar complex exponential. Observe that this Fourier transform pair represents the time-frequency dual of the shifted unit sample δ[nno] and its transform ejnoω. More generally, a shifted Fourier transform X (ωωo) corresponds to the sequence x[n]eon, a property alluded to earlier. Image

Figure 2.3 Dirac delta Fourier transforms of (a) complex exponential sequence, (b) sinusoidal sequence, (c) sum of complex exponentials, and (d) sum of sinusoids. For simplicity, π and 2π factors are not shown in the amplitudes.

Image

Using the linearity of the Fourier transform, we can generalize the previous result to a sinusoidal sequence as well as to multiple complex exponentials and sines. Figure 2.3b-d illustrates the Fourier transforms of the following three classes of sequences:

Sinusoidal sequence

Image

Multiple complex exponentials

Image

Multiple sinusoidals

Image

For simplicity, each transform is represented over only one period; for generality, phase offsets are included.

2.5 Uncertainty Principle

We saw in Example 2.3 a fundamental property of the Fourier transform pair: A signal cannot be arbitrarily narrow in time and in frequency. We saw in Figure 2.2 that the Fourier transform increased in spread as the time sequence decreased in width. This property can be stated more precisely in the uncertainty principle. To do so requires a formal definition of the width of the signal and its Fourier transform. We refer to these signal characteristics as duration, denoted by D(x) and bandwidth,4 denoted by B(x), and define them respectively as

4 A more traditional definition of bandwidth, not necessarily giving the same value as that in Equation (2.4), is the distance between the 3 dB attenuation points around the average frequency.

(2.4)

Image

where Image is the average time of the signal, i.e.,Image, and Image is its average frequency, i.e, Image. In these definitions, in order that the time and frequency averages be meaningful, we assume that the signal energy is unity, i.e., Image or that the signal has been normalized by its energy. These duration and bandwidth values give us a sense of the concentration of a signal, or of its Fourier transform, about its average location. The definitions of signal or transform width are motivated by the definition of the variance, or “spread,” of a random variable. In fact |x[n]|2 and |X(ω)|2 viewed as energy densities, we will see later in the text, are analogous to the probability density function used in defining the variance of a random variable [2]. It follows that normalizing the magnitude functions in Equation (2.4) by the total signal energy ensures probability density-like functions that integrate to unity.

The uncertainty principle states that the product of signal duration and bandwidth cannot be less than a fixed limit, i.e.,

(2.5)

Image

Proof of the uncertainty principle is given by first applying Parseval’s Theorem to obtain

(2.6)

Image

from which Equation (2.5) follows. The reader is led through the derivation in Exercise 2.5. The principle implies that a wide signal gives a narrow Fourier transform, and a narrow signal gives a wide Fourier transform.5 The uncertainty principle will play a major role in spectrographic, and, more generally, time-frequency analysis of speech, especially when the speech waveform consists of dynamically changing events or closely-spaced time or frequency components.

5 The nomenclature “uncertainty” is somewhat misleading because there is no uncertainty in the measurement of the signal or its Fourier transform. This terminology evolved from Heisenberg’s uncertainty principle in the probabilistic context of quantum mechanics where it was discovered that the position and the momentum of a particle cannot be measured at a particular time simultaneously with absolute certainty. There is a mathematical similarity, and not a similarity in the physical interpretation, in Heisenberg’s uncertainty principle with Equation (2.5) because the position and momentum functions are related through the Fourier transform. The squared magnitude of the Heisenberg functions represent the probability of measuring a particle with a certain position and momentum, respectively, unlike the deterministic magnitude of a signal and its Fourier transform [2].

It is important to look more carefully at our definition of bandwidth in Equation (2.4). Observe that for a real sequence, from the conjugate-symmetry property, the Fourier transform magnitude is even. Thus the average frequency is zero and the bandwidth is determined by the distance between the spectral energy concentrations in positive and negative frequency. The resulting bandwidth, therefore, is not necessarily indicative of the distribution of energy of physically meaningful quantities such as system resonances (Exercise 2.4). The bandwidth of the signals with discrete-time Fourier transform magnitudes in Figure 2.2c is such a case. As a consequence, complex sequences, as those corresponding to the transform magnitudes in Figure 2.2a, or only the positive frequencies of a real sequence, are used in computing bandwidth. In the latter case, we form a sequence s[n] with Fourier transform

Image

where we implicitly assume that S (ω) is periodic with period 2π . The sequence s[n] must be complex because conjugate symmetry is not satisfied, and so can be represented by real and imaginary components

(2.7)

Image

Equation (2.7), the inverse Fourier transform of S(ω), is called the analytic signal representation of x[n] and is used occasionally later in the text. It can be shown that the real component Image and that the imaginary part si[n] can be obtained from sr[n] through the frequency-domain operation (Exercise 2.6):

(2.8)

Image

where H (ω) is called the Hilbert transformer and is given by

(2.9)

Image

with inverse Fourier transform (left to the reader to show)

(2.10)

Image

Because s[n] is complex, it can be expressed in polar form as

(2.11)

Image

where A[n] = |s[n]| and θ[n] = ∠s[n]. A[n] is often called the envelope of the underlying signal x[n]; it follows that the analytic signal s[n] is sometimes referred to as its complex envelope. It is useful to express the phase θ[n] in the form

(2.12)

Image

where ω(t) is referred to as the instantaneous frequency of the underlying analog signal, i.e., θ[n] is obtained by sampling the phase of the continuous-signal counterpart s(t) = A(t)e(t). The discrete-time instantaneous frequency then is the derivative of the phase θ(t) evaluated at time samples t = nT with T the sampling interval, i.e., Image. We will see later in the text that the instantaneous frequency plays an important role in many speech signal representations.

The analytic signal is close in form to a more familiar alternate complex signal representation. Specifically, suppose that we are given a real sequence Image. Then the complex quadrature rendition of this signal is of the form

(2.13)

Image

The condition under which the envelope and phase of the quadrature (Equation (2.13)) and analytic (Equation (2.11)) signal representations are equivalent is developed in Exercise 2.7. The analytic and quadrature signal representations will play an important role within certain speech analysis/synthesis systems, especially those based on filter-bank, sinewave, and resonant representations that we will see throughout the text.

In closing this section, we observe that although the uncertainty principle imposes a constraint between the duration and bandwidth of a signal, the bandwidth is not determined solely by the signal duration. Cohen [2] has shown that the bandwidth of a signal, as defined in Equation (2.4), can be expressed in continuous time for a signal of the form Image as (Exercise 2.8)

(2.14)

Image

where Ω denotes frequency for continuous time. The first term in Equation (2.14) is the average rate of change of the signal envelope a(t) and corresponds in part to signal duration; a shorter signal gives a more rapidly changing envelope. The second term is the average deviation of the instantaneous frequency Image around the average frequency Image and thus is a measure of frequency spread. The Fourier transform pair provides an “gaverage” signal characterization that does not necessarily reveal such finer aspects of the signal. The importance of these bandwidth components in the speech context, and the more general importance of a finer signal representation than that provided by the Fourier transform, will become clear throughout the text when we describe the rapidly varying nature of certain speech sounds and when we introduce alternate tools for analysis of fine detail in the signal.

2.6 z-Transform

A limitation of the Fourier transform is that it cannot represent most unstable sequences, i.e., sequences that are not absolutely summable. The z-transform is a generalization of the Fourier transform that allows for a much larger class of sequences and is given by

(2.15)

Image

which is an infinite geometric series in the complex variable z = re. For r = 1, the z-transform reduces to the Fourier transform

X (ω) = X (z)|z=e

where for convenience we have abused notation by replacing e with ω. Therefore, we can think of the z-transform as a generalization of the Fourier transform that permits a broader class of signals as linear combinations of complex exponentials whose magnitude may or may not be unity; sequences that are not absolutely summable may be made to converge. We can see this property by rewriting Equation (2.15) as

(2.16)

Image

where the sequence x[n]rn may be absolutely summable when x[n] is not.

Associated with the z-transform is a region of convergence (ROC) defined as all values of r such that

Image

This condition of absolute summability is a sufficient condition for the convergence of X (z). Since |z| = |re| = r, we can view the ROC in the z-plane. In particular, the ROC is generally an annular (ring) region in the z-plane. For r = 1, the z-transform becomes the Fourier transform and so the Fourier transform exists if the ROC includes the unit circle. As we have deduced from Equation (2.16), however, it is possible for the z-transform to converge when the Fourier transform does not.

As with the Fourier transform, there exists an inverse z-transform, also referred to as the z-transform “synthesis equation.” The z-transform pair is written as

Image

where the integral in the inverse transform is taken over C, a counterclockwise closed contour in the region of convergence of X (z) encircling the origin in the z-plane, and which can be formally evaluated using residue theory [7]. Often we are interested in the specific class of rational functions of the form

Image

where the roots of the polynomials P(z) and Q(z) define the zeros, i.e., values of z such that P(z) = 0, and poles, i.e., values of z such that Q(z) = 0, respectively. When the z-transform is known to be rational, a partial fraction expansion in terms of single and multiple poles (order two or higher) can be used [7]. We now look at a few simple sequences whose rational z-transforms form the building blocks of a more general class of rational functions. The ROC for these examples is illustrated in Figure 2.4.

Example 2.5       Consider the shifted impulse

x[n] = δ[nno].

Then the z-transform is given by

Image

where the ROC includes the entire z-plane, not including z = 0 for no > 0.

Figure 2.4 Region of convergence (ROC) of the z-transform for Examples 2.52.7: (a) x[n] = δ[n] − [n − 1]; (b) x[n] = anu[n]; (c) x[n] = −bnu[−n − 1]; (d) x[n] = anu[n] − bnu[−n − 1]. Poles are indicated by small crosses and zeros by small circles, and the ROC is indicated by shaded regions.

Image

Consider now the difference of unit samples

x[n] = δ[n] − [n − 1]

with a generally complex. From linearity of the z-transform operator and the previous result, the z-transform is given by

X (z) = 1 − az−1

where the ROC again includes the entire z-plane, not including z = 0. For this case, we say that X (z) has a pole at z = 0, i.e., the transform goes to infinity, and a zero at z = a, i.e., the transform takes on value zero. Image

Example 2.6       Consider the decaying exponential

x[n] = anu[n]

where a is generally complex. From the convergence property of a geometric series (Example 2.1), the z-transform is given by

Image

where the ROC includes the z-plane outside of the radius a, i.e., |z| > |a|. For this case, X (z) has a pole at z = a and a zero at z = 0. For |a| < 1 the ROC includes the unit circle.

A similar case is the time-reversed version of the decaying exponential

x[n] = −bnu[−n − 1].

Again, from the convergence property of a geometric series, the z-transform is given by

Image

but now the ROC includes the z-plane inside of the radius b, i.e., |z| < |b|. For this case, X (z) has a pole at z = b and a zero at z = 0. For |b| > 1, the ROC includes the unit circle. Image

From the above examples, we see that both the z-transform and an ROC are required to specify a sequence. When a sequence consists of a sum of components, such as the sequences of the previous examples, then, in general, the ROC is the intersection of the ROC for individual terms, as we show in the following example:

Example 2.7       We now combine the elemental sequences of the previous example. Let x1[n] and x2[n] denote the decaying exponential and its time-reversal, respectively. Then the sum of the two sequences is given by

x[n] = x1[n] + x2[n] = anu[n] − bnu[−n − 1].

From the linearity of the z-transform, X (z) is given by

Image

where, because both sequences must converge to have x[n] converge, the ROC is an annulus in the z-plane defined by |a| < |z| < |b|. Rewriting X (z) as

Image

we see that X (z) is characterized by two poles at z = a, b, one zero at Image, and a zero at z = 0. Image

The z-transforms of the previous examples can be generalized to complex a and b; the poles and zeros are not constrained to the real axis, but can lie anywhere in the z-plane. For a general rational X (z), factoring P (z) and Q (z) gives the z-transform in the form

(2.17)

Image

where (1 − akz−1) and (1 − ckz−1) correspond to zeros and poles inside the unit circle and (1 − bkz) and (1 − dkz) to zeros and poles outside the unit circle with |ak|, |bk|, |ck|, |dk| < 1. If equality holds in any of these relations, then a zero or pole lies on the unit circle. The term zr represents a shift of the sequence with respect to the time origin (Example 2.5). For a real sequence x[n], the poles and zeros will occur in complex conjugate pairs. The ROC is generally an annulus bounded by the poles, the specific shape depending on whether the signal is right-sided (x[n] = 0 for n < 0) or left-sided (x[n] = 0 for n > 0), or two-sided (neither right-sided nor left-sided). The ROC may or may not include the unit circle. For a right-sided sequence, the ROC extends outside the outermost pole, while for a left-sided sequence, the ROC extends inside the innermost pole. For a two-sided sequence, the ROC is an annulus in the z-plane bounded by poles and, given convergence, not containing any poles. These different configurations are illustrated in Figure 2.5.

Figure 2.5 Poles and zeros, and region of convergence (ROC) of rational z-transforms: (a) left-sided, (b) right-sided, and (c) two-sided. The ROC is indicated by shaded regions.

Image

A specific case, important in the modeling of speech production, is when x[n] is stable and causal; in this case, because the unit circle must be included in the ROC, all poles are inside the unit circle and the ROC is outside the outermost pole. This configuration is shown later in Figure 2.8 and will be further described in Section 2.8.

It was stated earlier that a rational z-transform can be represented by a partial fraction expansion. The details of this partial fraction expansion are given in [7]; here it suffices to describe a particular signal class. Consider the case where the number of poles is greater than the number of zeros. Then, for the case of no poles outside the unit circle, we can write Equation (2.17) as

(2.18)

Image

This additive decomposition, in contrast to the earlier multiplicative representation of component poles, will be useful in modeling of speech signals, as in the representation of resonances of a vocal tract system impulse response.

2.7 LTI Systems in the Frequency Domain

We have completed our brief review of frequency-domain representations of sequences; we now look at similar representations for systems. Consider the complex exponential x[n] = ejωn as the input to an LTI system. From the convolutional property

(2.19)

Image

The second term in Equation (2.19) is the Fourier transform of the system impulse response which we denote by H (ω)

(2.20)

Image

so that

y[n] = H (ω)ejωn.

Therefore, a complex exponential input to an LTI system results in the same complex exponential at the output, but modified by H (ω). It follows that the complex exponential is an eigenfunction of an LTI system, and H (ω) is the associated eigenvalue [7].6 H (ω) is often referred to as the system frequency response because it describes the change in ejωn with frequency; its z-domain generalization, H (z), is referred to as the system function or sometimes the transfer function.

6 This eigenfunction/eigenvalue terminology is also often used in mathematics. In linear algebra, for example, for a matrix A and a vector Image, Image where Image is the eigenfunction and λ the eigenvalue.

The following example exploits the eigenfunction property of the complex exponential:

Example 2.8       A sinusoidal sequence can be expressed as

Image

Then, by superposition, the output to an LTI system H (ω) to the input x[n] is given by

Image

where we have used the conjugate symmetry property of the Fourier transform, H*(−ωo) = H (ωo). Using the relation a + a* = 2Re[a] = 2|a| cos(θ) where a = |a|e, the output can be expressed as

Image

where we have invoked the polar form H (ω) = | H (ω) | ejH (ω) Because the system is linear, this result can be generalized to a sum of sinewaves, i.e., for an input of the form

Image

the output is given by

Image

A similar expression is obtained for an input consisting of a sum of complex exponentials. Image

Two important consequences of the eigenfunction/eigenvalue property of complex exponentials for LTI systems to be proven in EXERCISE 2.12 are stated below.

Convolution Theorem — This theorem states that convolution of sequences corresponds to multiplication of their corresponding Fourier transforms. Specifically, if

x[n] ↔ X (ω)

h[n] ↔ H (ω)

and if

y[n] = x[n] * h[n]

then

Y(ω) = X (ω) H (ω).

Windowing (Modulation) Theorem — The following theorem is the dual of the Convolution Theorem. Let

x[n] ↔ X (ω)

w[n] ↔ W (ω)

and if

y[n] = x[n]w[n]

then

Image

where Image denotes circular convolution, corresponding to one function being circularly shifted relative to the other with period 2π. We can also think of each function being defined only in the interval [−π, π] and being shifted modulo 2π in the convolution. The “windowing” terminology comes about because when the duration of w[n] is short relative to that of x[n], we think of w[n] as viewing (extracting) a short piece of the sequence x[n].

Example 2.9       Consider a sequence consisting of a periodic train of unit samples

Image

with Fourier transform (Exercise 2.13)

Image

Suppose that x[n] is the input to an LTI system with impulse response given by

h[n] = anu[n]

with Fourier transform

Image

Figure 2.6 Illustration of the Convolution Theorem in Example 2.9.

Image

Then from the Convolution Theorem

Image

We think of the magnitude of the system function as the spectral envelope to the train of Dirac delta pulses (Figure 2.6). This representation will be particularly important in modeling voiced speech sounds such as vowels.

The importance of Example 2.10, the dual of Example 2.9, will become apparent later in the text when we perform short-time analysis of speech.

Example 2.10       Consider again the sequence x[n] of Example 2.9, and suppose that the sequence is multiplied by a Hamming window of the form [7]

Image

Figure 2.7 Illustration of the Windowing Theorem in Example 2.10.

Image

with Fourier transform denoted by W (ω) (the reader should sketch the Fourier transform of w[n]). Then from the Windowing Theorem

Image

Therefore, the window function is replicated at the uniformly spaced frequencies of the periodic pulse train (Figure 2.7).

Both the Convolution and Windowing Theorems can be generalized with the z-transform [7]. For example, if y[n] = x[n] * h[n], then Y (z) = X (z) H (z), where the ROC of Y (z) is the intersection of the ROCs of X (z) and H (z). Extension of the Windowing Theorem is left to the reader to ponder.

2.8 Properties of LTI Systems

We saw in Equation (2.17) that an important class of LTI systems is represented by a rational z-transform. In this section, we constrain this class to satisfy properties of stability and causality that we cited earlier. LTI systems with rational z-transforms that satisfy these two properties are referred to as digital filters.7

7 The term digital filters more generally includes nonlinear filters where superposition does not hold. Also note that the causality constraint is artificial. We will see later in the text that noncausal filters are viable, as well as quite useful, in digital computer processing of speech.

2.8.1 Difference Equation Realization

The output of a digital filter is related to the input by an N th-order difference equation of the form

(2.21)

Image

which corresponds to a rational system function. The starting condition of the difference equation is that of initial rest because linearity and time-invariance require that the output be zero for all time if the input is zero for all time [7]. We can evaluate the impulse response by a recursive computation of the difference equation for input x[n] = δ[n] (assuming that the response is causal) or by means of the z-transform. In the latter case, we apply the z-transform to both sides of the difference equation, and use linearity of the z-transform operator and the delay property (i.e., x[nno] has z-transform X (z)zno) to obtain

Image

or

Image

which is a rational function with numerator and denominator that are polynomials in variable z−1. The transfer function can be expressed in the factored form of Equation (2.17), but since we assume that the system is causal and stable, poles are restricted to lie inside the unit circle. This constraint is understood by recalling that causality implies rightsidedness, which implies that the ROC is outside the outermost pole, and that stability implies that the ROC includes the unit circle so that all poles must fall inside the unit circle. The zeros, on the other hand, can fall inside or outside the unit circle (Figure 2.8). The factored transfer function is then reduced to

(2.22)

Image

with |ak|, |bk|, |ck| < 1, with Mi + Mo = M and Ni = N, and where the terms (1 − akz−1) and (1 − ckz−1) represent, respectively, zeros and poles inside the unit circle and (1 − bkz) represents zeros outside the unit circle.

Figure 2.8 Pole-zero configuration for a causal and stable discrete-time system.

Image

2.8.2 Magnitude-Phase Relationships

When all zeros as well as poles of a rational transfer function H (z) are inside the unit circle, the transfer function is said to be minimum phase, and the corresponding impulse response is referred to as a minimum-phase sequence [7]. Such zeros and poles will sometimes be called “minimum-phase components” of the transfer function. An arbitrary causal and stable digital filter, however, will have its poles, but not necessarily its zeros, inside the unit circle. Zeros (or poles) outside the unit circle are referred to as maximum-phase components. Therefore, H (z) is generally mixed-phase, consisting of a minimum-phase and maximum-phase component, i.e.,

H(z) = Hmin(z) Hmax(z)

where Hmin(z) consists of all poles and zeros inside the unit circle, and Hmax(z) consists of all zeros outside the unit circle. Although not strictly following our definitions, we also include in these terms zeros on the unit circle. The terminology “minimum-phase” and “maximum-phase” is also applied to discrete-time signals, as well as to systems and their impulse responses.

Alternatively, any digital filter can be represented by the cascade of a minimum-phase “reference” system Hrmp(z) and an all-pass system Aall(z):

H(z) = Hrmp(z) Aall(z)

where the all-pass system is characterized by a frequency response with unity magnitude for all ω. In particular, an arbitrary rational all-pass Aall (z) can be shown to consist of a cascade of factors of the form Image where |a| < 1 (Exercise 2.10). Consequently, such all-pass systems have the property that their poles and zeros occur at conjugate reciprocal locations; by this, we mean that a zero at Image implies a pole at z = a. It follows that if H(z) contains Q zeros then, because |H(z)| = |Hrmp (z)|, there exists a maximum of 2Q different phase functions (excluding linear phase), and thus 2Q different sequences, for the given magnitude function.8 These phase functions can be obtained by reflecting zeros about the unit circle to their conjugate reciprocal locations by multiplying H(z) by Image. Because the all-pass function with unity power (+1) has negative phase (Exercise 2.10), if we flip a zero of a minimum-phase function outside the unit circle, then the resulting phase is more negative than the original. The term minimum phase-lag would thus be more precise than the commonly used minimum phase [7].

8 Because we assume causality and stability, the poles lie inside the unit circle. Different phase functions, for a specified magnitude, therefore are not contributed by the poles.

There are a number of important properties of minimum-phase sequences that have important consequences for speech modeling and processing. A minimum-phase sequence is uniquely specified by the magnitude of its Fourier transform. This result will be proven formally in Chapter 10 in the context of the complex cepstrum representation of a sequence, but can be seen intuitively from the example of a stable rational z-transform. In this case, for a given Fourier-transform magnitude, there is only one sequence with all zeros (or poles) inside the unit circle. Likewise, the Fourier transform phase of a minimum-phase sequence H (ω) uniquely specifies the sequence (to within a scale factor).

Another useful property relates to the energy concentration of a minimum-phase sequence. From Parseval’s theorem, all sequences with the same Fourier transform magnitude have the same energy, but when we flip zeros (or poles) inside and outside the unit circle to their conjugate reciprocal locations, this energy gets distributed along the time axis in different ways. It can be shown that a finite-length (all-zero) minimum-phase sequence has energy most concentrated near (and to the right of) the time origin, relative to all other finite-length causal sequences with the same Fourier transform magnitude, and thus tends to be characterized by an abrupt onset or what is sometimes referred to as a fast “attack” of the sequence.9 This property can be formally expressed as [7]

9 It is of interest to note that a sufficient but not necessary condition for a causal sequence to be minimum phase is that Image.

(2.23)

Image

where h[n] is a causal sequence with the Fourier transform magnitude equal to that of the reference minimum-phase sequence hrmp[n]. As zeros are flipped outside the unit circle, the energy of the sequence is delayed in time, the maximum-phase counterpart having maximum energy delay (or phase lag) [7]. Similar energy localization properties are found with respect to poles. However, because causality strictly cannot be made to hold when a z-transform contains maximum-phase poles, it is more useful to investigate how the energy of the sequence shifts with respect to the time origin. As illustrated in Example 2.11, flipping poles from inside to outside the unit circle to their conjugate reciprocal location moves energy to the left of the time origin, transforming the fast attack of the minimum-phase sequence to a more gradual onset. We will see throughout the text that numerous speech analysis schemes result in a minimum-phase vocal tract impulse response estimate. Because the vocal tract is not necessarily minimum phase, synthesized speech may be characterized in these cases by an unnaturally abrupt vocal tract impulse response.

Example 2.11       An example comparing a mixed-phase impulse response h[n], having poles inside and outside the unit circle, with its minimum-phase reference hrmp[n] is given in Figure 2.9. The minimum-phase sequence has pole pairs at 0.95e±j0.1 and 0.95e±j0.3. The mixed-phase sequence has pole pairs at 0.95e±j0.1 and Image. The minimum-phase sequence (a) is concentrated to the right of the origin and in this case is less “dispersed” than its non-minimum-phase counterpart (c). Panels (b) and (d) show that the frequency response magnitudes of the two sequences are identical. As we will see later in the text, there are perceptual differences in speech synthesis between the fast and gradual “attack” of the minimum-phase and mixed-phase sequences, respectively. Image

Figure 2.9 Comparison in Example 2.11 of (a) a minimum-phase sequence hrmp[n] with (c) a mixed-phase sequence h[n] obtained by flipping one pole pair of hrmp[n] outside the unit circle to its conjugate reciprocal location. Panels (b) and (d) show the frequency response magnitudes of the minimum- and mixed-phase sequences, respectively.

Image

2.8.3 FIR Filters

There are two classes of digital filters: finite impulse response (FIR) and infinite impulse response (IIR) filters [7],[10]. The impulse response of an FIR filter has finite duration and corresponds to having no denominator in the rational function H(z), i.e, there is no feedback in the difference Equation (2.21). This results in the reduced form

(2.24)

Image

Implementing such a filter thus requires simply a train of delay, multiply, and add operations. By applying the unit sample input and interpreting the output as the sum of weighted delayed unit samples, we obtain the impulse response given by

Image

Because h[n] is bounded over the duration 0 ≤ nM, it is causal and stable. The corresponding rational transfer function in Equation (2.22) reduces to the form

Image

with Mi + Mo = M and with zeros inside and outside the unit circle; the ROC is the entire z-plane except at the only possible poles z = 0 or z = ∞.

FIR filters can be designed to have perfect linear phase. For example, if we impose on the impulse response symmetry of the form h[n] = h[Mn], then under the simplifying assumption that M is even (Exercise 2.14), H(ω) = A(ω)e(M/2) where A(ω) is purely real, implying that phase distortion will not occur due to filtering [10], an important property in speech processing.10

10 This does not mean that M(ω) is positive. However, if the filter h[n] has most of its spectral energy where M(ω) > 0, then little speech phase distortion will occur.

2.8.4 IIR Filters

IIR filters include the denominator term in H(z) and thus have feedback in the difference equation representation of Equation (2.21). Because symmetry is required for linear phase, most11 IIR filters will not have linear phase since they are right-sided and infinite in duration. Generally, IIR filters have both poles and zeros. As we noted earlier for the special case where the number of zeros is less than the number of poles, the system function H(z) can be expressed in a partial fraction expansion as in Equation (2.18). Under this condition, for causal systems, the impulse response can be written in the form

11 A class of linear phase IIR filters has been shown to exist [1]. The transfer function for this filter class, however, is not rational and thus does not have an associated difference equation.

Image

where ck is generally complex so that the impulse response is a sum of decaying complex exponentials. Equivalently, because h[n] is real, it can be written by combining complex conjugate pairs as a set of decaying sinusoids of the form

Image

where we have assumed no real poles and thus Ni is even. Given a desired spectral magnitude and phase response, there exist numerous IIR filter design methods [7],[10].

In the implementation of IIR filters, there exists more flexibility than with FIR filters. A “direct-form” method is seen in the recursive difference equation itself [7],[10]. The partial fraction expansion of Equation (2.18) gives another implementation that, as we will see later in the text, is particularly useful in a parallel resonance realization of a vocal tract transfer function. Suppose, for example, that the number of poles in H(z) is even and that all poles occur in complex conjugate pairs. Then we can alter the partial fraction expansion in Equation (2.18) to take the form

(2.25)

Image

which represents Image second-order IIR filters in parallel. Other digital filter implementation structures are introduced as needed in speech analysis/synthesis schemes throughout the text.

2.9 Time-Varying Systems

Up to now we have studied linear systems that are time-invariant, i.e., if x[n] results in y[n], then a shifted input x[n− no] results in a shifted output y[nno]. In the speech production mechanism, however, we often encounter time-varying linear systems. Although superposition holds in such systems, time-invariance does not. A simple illustrative example is a “system” which multiplies the input by a sequence h[n]. The system is linear because (αx1[n] + βx2[n])h[n] = αx1[n]h[n] + βx2[n]h[n], but it is not time-invariant because, in general, x[nno]h[n] ≠ x[nno]h[nno].

A time-varying linear system is characterized by an impulse response that changes for each time m. This system can be represented by a two-dimensional function g[n, m], which is the impulse response at time n to a unit sample applied at time m. The response, for example, to a unit sample at time m = 0 is g[n, 0], while the response to δ[nno]is g[n, no]. The two-dimensional function g[n, m] is sometimes referred to as Green’s function [8]. Because the system is linear and because an input x[n] consists of a sum of weighted and delayed unit samples, i.e., Image, then the output to the input x[n] is given by

(2.26)

Image

which is a superposition sum, but not a convolution. We can see how Equation (2.26) differs from a convolution by invoking an alternate two-dimensional representation which is the response of the system at time n to a unit sample applied m samples earlier at time [nm]. This new function, called the time-varying unit sample response and denoted by h[n, m], is related to Green’s function by h[n, m] = g[n, nm] or, equivalently, h[n, nm] = g[n, m]. We can then write the time-varying system output as (Exercise 2.16)

(2.27)

Image

where we have invoked a change in variables and where the weight on each impulse response corresponds to the input sequence m samples in the past. When the system is time-invariant, it follows that Equation (2.27) is expressed as (Exercise 2.16)

(2.28)

Image

which is the convolution of the input with the impulse response of the resulting linear time-invariant system.

It is of interest to determine whether we can devise Fourier and z-transform pairs, i.e., a frequency response and transfer function, for linear time-varying systems, as we had done with linear time-invariant systems. Let us return to the familiar complex exponential as an input to the linear time-varying system with impulse response h[n, m]. Then, from Equation (2.27), the output y[n] is expressed as

(2.29)

Image

where

Image

which is the Fourier transform of h[n, m] at time n evaluated with respect to the variable m and referred to as the time-varying frequency response. Equivalently, we can write the time-varying frequency response in terms of Green’s function as (Exercise 2.17)

(2.30)

Image

which, except for the linear phase factor ejωn, is the Fourier transform of Green’s function at time n.

Because the system of interest is linear, its output to an arbitrary input x[n] is given by the following superposition [8] (Exercise 2.15):

(2.31)

Image

so that the output y[n] of h[n, m] at time n is the inverse Fourier transform of the product of X(ω) and H (n, ω), X(ω)H (n, ω) which can be thought of as a generalization of the Convolution Theorem for linear time-invariant systems. This generalization, however, can be taken only so far. For example the elements of a cascade of two time-varying linear systems,i.e., H1(n, ω) followed by H2(n, ω), do not generally combine in the frequency domain by multiplication and the elements cannot generally be interchanged, as illustrated in the following example. Consequently, care must be taken interchanging the order of time-varying systems in the context of speech modeling and processing.

Example 2.12       Consider the linear time-varying multiplier operation

y[n] = x[n]ejωon

cascaded with a linear time-invariant ideal low-pass filter h[n], as illustrated in Figure 2.10. Then, in general, (x[n]eon) * h[n] ≠ (x[n] * h[n])eon. For example, let Image, and h[n] have lowpass cutoff frequency at Image. When the lowpass filter follows the multiplier, the output is zero; when the order is interchanged, the output is nonzero.Image

We will see in following chapters that under certain “slowly varying” conditions, linear time-varying systems can be approximated by linear time-invariant systems. The accuracy of this approximation will depend on the time-duration over which we view the system and its input, as well as the rate at which the system changes. More formal conditions have been derived by Matz and Hlawatsch [5] under which a “transfer function calculus” is allowed for time-varying systems.

Figure 2.10 Cascade configurations of linear time-invariant and time-varying systems.

Image

2.10 Discrete Fourier Transform

The Fourier transform of a discrete-time sequence is a continuous function of frequency [Equation (2.2)]. Because, in practice, when using digital computers we cannot work with continuous frequency, we need to sample the Fourier transform, and, in particular, we want to sample finely enough to be able to recover the sequence. For sequences of finite length N, sampling yields a new transform referred to as the discrete Fourier transform or DFT. The DFT pair representation of x[n] is given by

(2.32)

Image

The sequence x[n] and transform X(k), although specified only in the interval [0, N − 1], take on an implicit periodicity with period N in operations involving the DFT [7]. Observe that if we sample the discrete-time Fourier transform (continuous in frequency) at uniformly spaced samples Image, we obtain values of the discrete Fourier transform. The two equations of the DFT pair are sometimes referred to as the analysis (DFT) and synthesis (inverse DFT) equations for the DFT.

The properties of the DFT are similar to those of the discrete-time Fourier transform. These include the DFT counterparts to the four properties given in Section 2.3. For example, Parseval’s theorem for the DFT can be shown to be given by

Image

The functions |x[n]|2 and |X(k)|2 are still thought of as “energy densities,” i.e., the energy per unit time and the energy per unit frequency, because they describe the distribution of energy in time and frequency, respectively.

There are, however, some important distinctions from the discrete-time Fourier transform. Because in the DFT representation we think of x[n] as one period of a periodic sequence with period N, the argument of x[n] is computed modulo N, i.e., x[n] = x[n modulo N], which is also denoted by the subscript notation x[n] = x[(n)N]. Consequently, many operations are performed modulo N. Consider, for example, a shifted unit sample.

Example 2.13       For a shifted unit sample, the delay is computed modulo N:

x[n] = δ[(nno)N ]

which we can think of as shifting the periodic construction of the unit sample (with period N), sometimes referred to as a circular shift or rotation, and then extracting the resulting sequence over 0 ≤ nN − 1. The DFT is given by

Image

because Image is periodic in n with period N and because x[n] is nonzero for only n = no.Image

Consider now the Convolution Theorem introduced in Section 2.6. We saw that convolution of two sequences, sometimes referred to as a linear convolution, corresponds to multiplication of their discrete Fourier transforms. Multiplication of DFTs of two sequences, on the other hand, corresponds to a circular convolution of the (implied) periodic sequences [7]. Specifically, let x[n] have N-point DFT X(k) and h[n] have N-point DFT H(k). Then the inverse DFT of X(k)H (k) = Y(k) is not the linear convolution, i.e., y[n] = x[n] * h[n], but is rather the circular convolution Image, where one function is circularly shifted relative to the other with period N. We can also think of each sequence being defined only in the interval [0, N) and being shifted modulo N in the convolution. Suppose now that x[n] is nonzero over the interval 0 ≤ nM − 1 and h[n] is nonzero over the interval 0 ≤ nL − 1. The linear and circular convolutions are equivalent only when the sum of the sequence durations (minus one) is less than the DFT length, i.e.,

Image

only if M + L − 1 ≤ N. The implication is that, because circular convolution is performed modulo N, zero padding of the respective sequences is required to obtain a linear convolution. Similar considerations must be made in frequency for the DFT realization of the Windowing Theorem. A more extensive description of the DFT realization of the Convolution and Windowing Theorems can be found in [7].

Finally, there exists an implementation of the DFT through the fast Fourier transform (FFT). The straightforward DFT requires on the order N2 operations, i.e., multiples and additions, while the FFT requires order Nlog(N) [7]. In doing convolution with the DFT, it is often faster to perform two forward FFTs and one inverse FFT rather than the direct shift, multiply, and accumulate operations.

2.11 Conversion of Continuous Signals and Systems to Discrete Time

2.11.1 Sampling Theorem

We began this review by sampling a continuous-time speech waveform to generate a discrete-time sequence, i.e., x[n] = xa(t)|t=nT, where T is the sampling interval. An implicit assumption was that the sampling is fast enough so that we can recover xa(t) from x[n]. The condition under which recovery is possible is called the Sampling Theorem.

For example, in speech we might assume a 5000 Hz bandwidth. Therefore, for signal recovery we must sample at Image samples/s corresponding to a T = 100 μs sampling interval.

The basis for the Sampling Theorem is that sampling xa(t) at a rate of Image results in spectral duplicates spaced by Image, so that sampling at the Nyquist rate avoids aliasing, thus preserving the spectral integrity of the signal. The sampling can be performed with a periodic impulse train with spacing T and unity weights, i.e., Image. The impulse train resulting from multiplication with the signal xa(t), denoted by xp(t), has weights equal to the signal values evaluated at the sampling rate, i.e.,

(2.33)

Image

The impulse weights are values of the discrete-time signal, i.e., x[n] = xa(nT), and therefore, as illustrated in Figure 2.11, the cascade of sampling with the impulse train p(t) followed by conversion of the resulting impulse weights to a sequence is thought of as an ideal A/D converter (or C/D converter). In the frequency domain, the impulse train p(t) maps to another impulse train with spacing 2π Fs, i.e., the Fourier transform of p(t) is Image where Image. Using the continuous-time version of the Windowing Theorem, it follows that P (Ω) convolves with the Fourier transform of the signal xa(t), thus resulting in a continuous-time Fourier transform with spectral duplicates

Figure 2.11 Path from sampling to reconstruction (2 ΩN = Ωs ).

Image

(2.34)

Image

where Ωs = 2π s. Therefore, the original continuous-time signal xa(t) can be recovered by applying a lowpass analog filter, unity in the passband Image, and zero outside this band.

This perspective also leads to a reconstruction formula which interpolates the signal samples with a sin function. Using the continuous-time version of the Convolution Theorem, application of an ideal lowpass filter of width Ωs corresponds to the convolution of the filter impulse response with the signal-weighted impulse train xp(t). Thus, we have a reconstruction formula given by

Image

because the function Image is the inverse Fourier transform of the ideal lowpass filter. As illustrated in Figure 2.11, the cascade of the conversion of the sequence y[n] = x[n] to a continuous-time impulse train (with weights xa(nT) followed by lowpass filtering is thought of as a discrete-to-continuous (D/C) converter. In practice, however, a digital-to-analog converter is used. Unlike the D/C converter, because of quantization error and other forms of distortion, D/A converters do not achieve perfect reconstruction.

The relation between the Fourier transform of xa(t), Xa(Ω), and the discrete-time Fourier transform of x[n] = xa(nT), X(ω), can now be deduced. When the Sampling Theorem holds, over the frequency interval [−π, π] X(ω) is a frequency-scaled (or frequency-normalized) version of Xa(Ω). Specifically, over the interval [−π, π] we have

Image

This relation can be obtained by first observing that Xp (Ω) can be written as

(2.35)

Image

and then by applying the continuous-time Fourier transform to Equation (2.33), and comparing this result with the expression for the discrete-time Fourier transform in Equation (2.2) [7]. We see that if the sampling is performed exactly at the Nyquist rate, then the normalized frequency π corresponds to the highest frequency in the signal. For example, when FN = 5000 Hz, then π corresponds to 5000 Hz.

The entire path, including sampling, application of a discrete-time system, and reconstruction, as well as the frequency relation between signals, is illustrated in Figure 2.11. In this illustration, the sampling frequency equals the Nyquist rate, i.e., 2 ΩN = Ωs. The discrete-time system shown in the figure may be, for example, a digital filter which has been designed in discrete time or derived from sampling an analog filter with some desired properties.

A topic related to the Sampling Theorem is the decrease and increase of the sampling rate, referred to as decimation and interpolation or, alternatively, as downsampling and upsampling, respectively. Changing the sampling rate can be important in speech processing where one parameter may be deemed to be slowly-varying relative to another; for example, the state of the vocal tract may vary more slowly, and thus have a smaller bandwidth, than the state of the vocal cords. Therefore, different sampling rates may be applied in their estimation, requiring a change in sampling rate in waveform reconstruction. For example, although the speech waveform is sampled at, say, 10000 samples/s, the vocal cord parameters may be sampled at 100 times/s, while the vocal tract parameters are sampled at 50 times/s. The reader should briefly review one of the numerous tutorials on decimation and interpolation [3],[7],[10].

2.11.2 Sampling a System Response

In the previous section, we sampled a continuous-time waveform to obtain discrete-time samples for processing by a digital computer or other discrete-time-based system. We will also have occasion to transform analog systems to discrete-time systems, as, for example, in sampling a continuous-time representation of the vocal tract impulse response, or in the replication of the spectral shape of an analog filter. One approach to this transformation is to simply sample the continuous-time impulse response of the analog system; i.e., we perform the continuous-to-discrete-time mapping

h[n] = ha(nT)

where ha(t) is the analog system impulse response and T is the sampling interval. This method of discrete-time filter design is referred to as the impulse invariance method [7].

Similar to sampling of continuous-time waveforms, the discrete-time Fourier transform of the sequence h[n], H(ω), is related to the continuous-time Fourier transform of ha(t), Ha(Ω), by the relation

Image

where we assume ha(t) is bandlimited and the sampling rate is such to satisfy the Nyquist criterion [7]. The frequency response of the analog signal is therefore preserved. It is also of interest to determine how poles and zeros are transformed in going from the continuous- to the discrete-time filter domains as, for example, in transforming a continuous-time vocal tract impulse response. To obtain a flavor for this style of conversion, consider the continuous-time rendition of the IIR filter in Equation (2.25), i.e.,

Image

whose Laplace transform is given in partial fraction expansion form (the continuous counterpart to Equation (2.18)) [7]

Image

Then the impulse invariance method results in the discrete-time impulse response

Image

whose z-transform is given by

Image

with poles at z = e(sk T) inside the unity circle in the z-plane (|e(sk T)| = e(Re[sk]T) < 1 when Re[sk] < 0) is mapped from poles in the s-plane from s = sk located to the left of the jΩ axis. Poles being to the left of the jΩ axis is a stability condition for causal continuous systems. Although the poles are mapped inside the unit circle, the mapping of the zeros depends on both the resulting poles and the coefficients Ak in the partial fraction expansion. It is conceivable, therefore, that a minimum-phase response may be mapped to a mixed-phase response with zeros outside the unit circle, a consideration that can be particularly important in modeling the vocal tract impulse response.

Other continuous-to-discrete-time conversion methods, e.g., the bilinear transformation [7], will be described as needed throughout the text.

2.11.3 Numerical Simulation of Differential Equations

In a loose sense, the impulse invariance method can be thought of as a numerical simulation of a continuous-time system by a discrete-time system. Suppose that the continuous-time system is represented by a differential equation. Then a discrete-time simulation of this analog system could alternatively be obtained by approximating the derivatives by finite differences, e.g., Image is approximated by Image. In mapping the frequency response of the continuous-time system to the unit circle, however, such an approach has been shown to be undesirable due to the need for an exceedingly fast sampling rate, as well as due to the restriction on the nature of the frequency response [6]. Nevertheless, in this text we will have occasion to revisit this approach in a number of contexts, such as in realizing analog models of speech production or analog signal processing tools in discrete time. This will become especially important when considering differential equations that are not necessarily time-invariant, are possibly coupled, and/or which may contain a nonlinear element. In these scenarios, approximating derivatives by differences is one solution option, where digital signal processing techniques are applied synergistically with more conventional numerical analysis methods. Other solution options exist, such as the use of a wave digital filter methodology to solve coupled, time-varying, nonlinear differential equations [4].

2.12 Summary

In this chapter we have reviewed the foundation of discrete-time signal processing which will serve as a framework for the remainder of the text. We reviewed discrete-time signals and systems and their Fourier and z-transform representations. A fundamental property of the Fourier transform is the uncertainty principle that imposes a constraint between the duration and bandwidth of a sequence and limits our ability to simultaneously resolve both in time and frequency dynamically changing events or events closely spaced in time and frequency. We will investigate this limitation in a speech context more fully in Chapters 7 and 11.

In this chapter, we introduced the concepts of minimum- and mixed-phase sequences and looked at important relationships between the magnitude and phase of their Fourier transforms. A property of these sequences, which we will see influences the perception of synthesized speech, is that a minimum-phase sequence is often characterized by a sharper “attack” than that of a mixed-phase counterpart with the same Fourier transform magnitude. Also in this chapter, we briefly reviewed some considerations in obtaining a discrete-time sequence by sampling a continuous-time signal, and also reviewed constraints for representing a sequence from samples of its discrete-time Fourier transform, i.e., from its DFT. Finally, we introduced the notion of time-varying linear systems whose output is not represented by a convolution sum, but rather by a more general superposition of delayed and weighted input values. An important property of time-varying systems is that they do not commute, implying that care must be taken when interchanging their order. The importance of time-varying systems in a speech processing context will become evident as we proceed deeper into the text.

EXERCISES

2.1 Prove the commutativity of discrete-time convolution, i.e., x[n] * h[n] = h[n] * x[n].

2.2 Argue, through the “flip and shift” interpretation of discrete-time convolution, that a necessary and sufficient condition for causality of an LTI system is that the system impulse response be zero for n < 0.

2.3 In this exercise, you are asked to consider a number of properties of the Fourier transform.

(a) Prove the 2π periodicity of the Fourier transform.

(b) Prove the conjugate symmetry of the Fourier transform for a real sequence x[n]. Then, using this property, show the resulting following properties: (1) The magnitude and real part of X(ω) are even, i.e., |X (ω)| = |X(−ω)| and Xr(−ω) = Xr(−ω); (2) Its phase and imaginary parts are odd, i.e., ∠X (ω) = −∠X(−ω) and Xi(ω) = −Xi(ω).

(c) Prove Parseval’s Theorem. Hint: In Equation (2.3), write |X (ω)|2 = X(ω)X*(ω) and substitute the Fourier transform expression for X (ω).

2.4 In this exercise, you explore the notion of duration and bandwidth of a sequence. Assume that the signal energy is unity.

(a) Show that the average frequency location Image of a real sequence x[n] is zero. Compute the average time location Image of an even complex sequence.

(b) Determine the bandwidth Image of the sequences with Fourier transforms in Figure 2.12.

Figure 2.12 Fourier transforms with different bandwidths: (a) complex sequence;(b) real sequence.

Image

2.5 In this exercise, you are asked to prove the uncertainty principle in Equation (2.5).

(a) Show that if x[n] has Fourier transform X (ω), then differentiating in frequency results in the pair

Image

(b) Using your result from part (a), derive Equation (2.6) using the Schwartz inequality:Image.

(c) Complete the derivation of the uncertainty principle. Hint: Write the Fourier transform X(ω) in terms of its magnitude and phase, integrate, and then determine terms that are always positive.

2.6 In this exercise, you are asked to investigate the analytic signal representation.

(a) Show that the real part of the analytic signal representation of x[n] in Equation (2.7) is given by sr[n] = x[n]/2. Show that the imaginary part si[n] is obtained by applying the Hilbert transformer

Image

to sr[n] = x[n]/2 as in Equation (2.8).

(b) Find the output of the Hilbert transformer when the input sequence is given by

x[n] = cos(ω0n)

where 0 < ω0 < π.

(c) Find the magnitude and phase of the analytic signal

s[n] = sr[n] + jsi[n]

for the sequence x[n] of part (b).

(d) Repeat parts (b) and (c) for the input sequence

Image

where 0 < ω0ω1 < π and 0 < ω0 + ω1 < π.

2.7 In this exercise, you are asked to find the condition under which the analytic and quadrature representations of the real signal Image are equivalent. We write the quadrature and analytic signal representations, respectively, as

Image

and

Image

(a) Show that

Image

and therefore

Image

(b) Let Sq(ω) and Sa(ω) denote the Fourier transforms of the quadrature and analytic signals, respectively. Show that

Image

(c) Using the following mean-squared difference as a criterion of closeness between the quadrature and analytic signals:

Image

(the factor of 2 is needed to account for removing negative frequencies in obtaining the analytic signal), and using Parseval’s Theorem and your result from part (b), show that

Image

which is twice the energy of the quadrature signal for negative frequencies.

(d) Based on the closeness criterion of part (c), state whether the analytic and quadrature representations are equivalent for each of the following two signals:

Image

where 0 ≤ ω < π. Explain your reasoning.

2.8 Consider a complex continuous-time signal of the form Image. Show that the bandwidth of s(t), as defined in Equation (2.4), can be expressed as

(2.36)

Image

where Image is the signal’s average frequency, i.e., Image Hint: Use the relation Image.

2.9 The autocorrelation function for a real-valued stable sequence x[n] is defined as

Image

(a) Show that the z-transform of cxx[n]is

Cxx(z) = X(z)X(z−1).

Determine the region of convergence for Cxx (z).

(b) Suppose that x[n] = anu[n]. Sketch the pole-zero plot for Cxx (z), including the region of convergence. Also find cxx[n] by evaluating the inverse z-transform of Cxx (z).

(c) Specify another sequence, x1[n], that is not equal to x[n] in part (b) but that has the same autocorrelation function, cxx[n], as x[n] in part (b).

2.10 Show that the magnitude of the all-pass transfer function Image is unity on the unit circle. Show that the phase of this all-pass function is negative on the unit circle.

2.11 We saw in the text that replacing a zero (or pole) by its conjugate reciprocal counterpart does not alter the magnitude of its Fourier transform. In this exercise, we explore the change in Fourier transform phase with such operations.

(a) Consider a minimum-phase zero corresponding to the component Image of the rational z-transform X(z). Denote the phase of Q(z) along the unit circle by θ1(ω) and write the polar representation of X(z) along the unit circle as X (ω) = A(ω)e(ω). Give an expression for the phase of the new z-transform when the zero is flipped to its conjugate reciprocal location.

(b) Suppose now that the zero in part (a) is flipped to its conjugate reciprocal pole location, i.e., Image becomes (1 − q*z). Show that the phase of the resulting z-transform is unaltered.

2.12 Use the eigenfunction property and superposition principle of linear systems to derive the Convolution Theorem. Then argue informally for the Windowing Theorem using the fact that the discrete-time Fourier transform and its inverse have similar structures. Specifically, the discrete-time Fourier transform differs from its inverse by a scale factor, a sign change in the exponent of the complex exponential, and by a sum over all time rather than a circular integration over frequency.

2.13 Suppose

Image

Show that the Fourier transform of x[n] is given by

Image

Hint: Work backward and observe that the inverse Fourier transform of X (ω) can be expressed as Image.

2.14 By imposing on an FIR impulse response “symmetry” of the form

h[n] = h[Mn]

show under the simplifying assumption that M is even that

H(ω) = A(ω)e(M/2)

where A(ω) is real and therefore its frequency response has linear phase.

2.15 Show that the output y[n] of a linear time-varying system with response h[n, m] is the inverse Fourier transform of the product of X (ω) and H (n, ω), as in Equation (2.31).

2.16 In this exercise, you show that the output expression Equation (2.26) for a linear time-varying system reduces to a convolution when the system is time-invariant.

(a) Argue that Green’s function g[n, m] and the time-varying unit sample response h[n, m] are related by h[n, m] = g[n, nm] or, equivalently, h[n, nm] = g[n, m].

(b) From part (a), derive Equation (2.27).

(c) Argue that if the system is time-invariant, then Green’s function g[n, m] depends only on the difference [nm], which is the number of samples between the application of the unit sample input at time m and the output at time n, i.e., g[n, m] = g[nm].

(d) From your results in parts (a)–(c), show that when the system is time-invariant h[n, m] = h[m], and thus the output of the linear time-varying system in Equation (2.27) reduces to

Image

2.17 Derive Equation (2.30), i.e., show that the frequency response of a linear time-varying system can be written in terms of Green’s function as

Image

2.18 Consider an input sequence x[n] which is processed by an LTI system with impulse response h[n] to yield an output sequence y[n]. Let X (ω), H(ω), and Y(ω) denote the Fourier transforms of the input sequence, system impulse response, and output sequence, respectively. That is,

Y(ω) = X(ω)H (ω)

where the system frequency response is given by

Image

(a) We can estimate Y(ω) by formulating an iteration of the form

Yk(ω) = A(ω)X(ω) + C(ω)Yk1(ω)

where

Y1(ω) = 0.

Derive a closed-form expression for Yk(ω).

(b) Show that if

|C(ω)| < 1,

then Yk(ω) converges to Y(ω) for each ω, as k increases. That is, show

Image

Hint: Use a geometric series.

(c) Propose a similar iterative scheme for recovering X (ω) from Y(ω). Discuss briefly why such iterative methods are useful in overcoming practical problems in recovering convolutionally distorted signals.

2.19 In Exercise 2.6, you found the analytic signal representation of the sequence

Image

which allowed you to compute the magnitude (which can be thought of as the “envelope” of x[n]) a[n] = sin(ω1n)/πn and also the phase Image. In this exercise, you are asked to stretch out x[n] along the time axis without changing the sinewave frequency ω0. The approach of this exercise is similar in style to that used later in the text in modifying the articulation rate of a speaker, i.e., in making a speaker sound as though he/she is talking faster or slower. This operation is also referred to as time-scale modification.

(a) Design an interpolation scheme to stretch out the envelope of x[n] along the time axis by a factor of two. The interpolation will first require putting zeros between every other sample of a[n].

(b) Using the phase function from Exercise 2.6, design an interpolation scheme to stretch the phase by two in time while maintaining the underlying sinewave frequency. This will involve a conventional interpolator followed by multiplicative scaling.

(c) Use the results of parts (a) and (b) to obtain the time-scaled signal

Image

2.20 (MATLAB) In this MATLAB exercise, you investigate some properties of a windowed speech waveform. On the companion website, you will find in the directory Chap exercises/chapter2 a workspace ex2M1.mat. Load this workspace and plot the speech waveform labeled speech1_10k. This speech segment was taken from a vowel sound that is approximately periodic (sometimes referred to as “quasi-periodic”), is 25 ms in duration, and was sampled at 10000 samples/s.

(a) Plot the log-magnitude of the Fourier transform of the signal over the interval [0, π], using a 1024-point FFT. The signal should be windowed with a Hamming window of two different durations, 25 ms and 10 ms, with the window placed, in each case, at the signal’s center. Show the log-magnitude plot for each duration. In doing this exercise, use MATLAB functions fft.m and hamming.m.

(b) From the speech waveform, estimate the period in seconds of the quasi-periodic waveform. As we will see in Chapter 3, the pitch is the reciprocal of the period and equals the rate of vocal cord vibration. In discrete time, the pitch of the waveform is given (in radians) by Image, where P is the period in samples. From Example 2.10, we see that for a discrete-time periodic impulse train, the pitch is the spacing between window transform replicas in frequency. Using your result from part (a) and Example 2.10, argue which spectral estimate from part (a), i.e., from a 25 ms or from a 10 ms window, is more suitable to pitch estimation. Propose a method of pitch estimation using the log-magnitude of the Fourier transform.

2.21 (MATLAB) In this MATLAB exercise, you will listen to various versions of a bandpass-filtered speech waveform. On the companion website, you will find in the directory Chap exercises/chapter2 a workspace ex2M2.mat. Load this workspace and plot the speech waveforms labeled speech2_10k and speech3_10k. These speech passages were sampled at 10000 samples/s.

(a) Use the MATLAB function remez.m to design a lowpass filter, i.e.,

bl[n] = remez(102, [0 .5 .6 1], [1 1 0 0]).

Plot the log-magnitude of its Fourier transform over the interval [0, π], using a 1024-point FFT. In doing this exercise, use MATLAB function fft.m.

(b) Normalize bl[n] by the sum of its values and then design a highpass filter as

bh[n] = δ[n − 52] − bl[n].

Plot the log-magnitude of the Fourier transform of bh[n] over the interval [0, π], using a 1024-point FFT. In doing this exercise, use MATLAB function fft.m and the script high_low.m provided in the directory Chap exercises/chapter2.

(c) Using the MATLAB function conv.m, filter the speech waveforms speech2_10k and speech3_10k with the lowpass and highpass filters you designed in parts (a) and (b).

(d) It is sometimes said that speech possesses redundant information in frequency, i.e., the same information may appear in different frequency bands. Using the MATLAB function sound.m, listen to the lowpass- and highpass-filtered speech from part (c) and describe its intelligibility. Comment on the notion of spectral redundancy.

BIBLIOGRAPHY

[1] M.A. Clements and J.W. Pease, “On Causal Linear Phase IIR Digital Filters,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 37, no. 4, pp. 479–485, April 1989.

[2] L. Cohen, Time-Frequency Analysis, Prentice Hall, Englewood Cliffs, NJ, 1995.

[3] R.E. Crochiere and L.R. Rabiner, Multirate Digital Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1983.

[4] A. Fettweis, “Wave Digital Filters: Theory and Practice,” Proc. IEEE, vol. 74, no. 2, pp. 270–327, 1986.

[5] G. Matz and F. Hlawatsch, “Time-Frequency Transfer Function Calculus (Symbolic Calculus) of Linear Time-Varying Systems (Linear Operators) Based on a Generalized Underspread Theory,” J. of Mathematical Physics, vol. 39. no. 8, pp. 4041–4070, Aug. 1998.

[6] A.V. Oppenheim and R.W. Schafer, Digital Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1975.

[7] A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1989.

[8] M.R. Portnoff, “Time-Frequency Representation of Digital Signals and Systems Based on Short-Time Fourier Analysis,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–28, no. 1, pp. 55–69, Feb. 1980.

[9] T.F. Quatieri and A.V. Oppenheim, “Iterative Techniques for Minimum Phase Signal Reconstruction from Phase or Magnitude,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP–29, no. 6, pp. 1187–1193, Dec. 1981.

[10] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice Hall, Englewood Cliffs, NJ, 1978.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset