Chapter 4
Acoustics of Speech Production

4.1 Introduction

In the previous chapter, we used the term “sound” without a formal definition. In this chapter, we give a more quantitative description of the phenomenon of sound through the wave equation. The wave equation, derived in Section 4.2, is a linear partial differential equation whose solution describes pressure and velocity of air particles in a sound wave as a function of time and space, and provides a means to approximately describe the acoustics of speech production. In arriving at the wave equation, we remove nonlinear contributions to the relationship between particle pressure and velocity. This simplification, although valid under a small particle velocity assumption, prohibits rotational flows, such as vortices, alluded to in Chapter 3 and further described in Chapter 11.

In Section 4.3, we gain intuition for sound in the vocal tract by using a simple lossless uniform tube model. This model is then extended to a nonuniform tube with energy loss, including vocal tract wall vibration, viscosity, and thermal conduction of air particles, and radiation loss at the lips. These losses are represented by partial differential equations coupled to the wave equation, and the full equation set is solved by numerical simulation to obtain airflow velocity and pressure at the lips output. Although solution by numerical simulation gives insight into properties of the speech waveform, it does not result in a closed-form transfer function from the glottis to the lips. In Section 4.4, an alternative to numerical simulation is presented in which the vocal tract is approximated by a concatenation of acoustic tubes, each of small length and of uniform cross-section. With this concatenated acoustic tube model and the wave equation, we derive a transfer function of air particle velocity from the glottis to the lips. This transfer function is first derived in continuous time and is then converted to a discrete-time form using the impulse invariance method of Chapter 2.

The notion of a vocal tract transfer function assumes that the vocal tract is linear and time-invariant. More accurate models invoke a time-varying vocal tract and nonlinear coupling between the vocal tract input, i.e., the glottal airflow velocity, and the pressure in the vocal tract resonant cavity. In Section 4.5, a simplified model of the nonlinear interaction between the glottal source and a time-varying vocal tract is described. Numerical simulation of the model shows certain time- and frequency-domain properties of speech not seen with a linear model, an important example being the “truncation” effect in speech production where a rapid bandwidth increase in vocal tract formants during glottal opening corresponds to an abrupt decay of the speech waveform within a glottal cycle.

4.2 Physics of Sound

4.2.1 Basics

The generation of sound involves the vibration of particles in a medium. In speech production, air particles are perturbed near the lips, and this perturbation moves as a chain reaction through free space to the listener; the perturbation of air molecules ends in the listener’s ear canal, vibrating the ear drum and initiating a series of transductions of this mechanical vibration to neural firing patterns that are ultimately perceived by the brain. The mechanism of the chain reaction in sound generation can be illustrated by a simple analogy. Consider a set of pool balls configured in a row on a pool table. The striking of the cue ball sets up a chain reaction whereby each ball in the series is struck until the last ball reaches the intended pool-table pocket. Each ball has a “particle velocity” (m/s) and there is a “pressure” (newtons/m2) felt by each ball due to the interaction between balls. After the first ball is struck, a pressure increase is felt successively by each ball, as well as a change in velocity. As the balls move closer together, we say they are in a compression state which occurs locally in space, i.e., in the vicinity of each ball. On the other hand, if the pool stick were to move fast enough away from the first ball, we can stretch our imagination to envision a vacuum set up behind the first ball, thus moving each ball successively in the other direction, creating a local decrease in pressure. For this case, the balls move farther apart which is referred to as a rarefaction state near each ball. If we move the pool stick sinusoidally, and fast enough, we can imagine creating a traveling wave of local compression and rarefaction fluctuations.

A deficiency in this pool-ball analogy is that there is no springiness between the balls as there is between air particles. Air particles stick together and forces exist between them. To refine our analogy, we thus connect the pool balls with springs. Again we have two states: (1) compression (pushing in the springs between balls) and (2) rarefaction (pulling out the springs between balls). Also, as before, a traveling wave of compression and rarefaction fluctuations is created by pushing and pulling the springs. Observe that the pressure and velocity changes are not permanent; i.e., the balls are not permanently displaced, and this is why we refer to the changes in pressure and velocity as local fluctuations in compression and rarefaction.

In describing sound propagation, we replace the pool balls with air molecules and the springs with forces between the air molecules; in Figure 4.1 we depict these forces with little springs. As with the pool balls with connecting springs, air molecules are locally displaced, they have a particle velocity, and there is pressure built up among the molecules due to the springiness and collisions that occur with themselves.1

1 Pressure = force/unit area = time rate of change of momentum/unit area [2].

Figure 4.1 Compression and rarefaction of air particles in front of an infinitely large wall: (a) illustration of springiness among air particles; (b) compression; (c) rarefaction.

Image

Consider now placing a moving wall in front of the air molecules, and suppose that the wall is infinitely large in the y, z direction (z coming out of the page). As with the balls attached by springs, by moving the wall to the right or to the left, we create a chain reaction of compression or rarefaction, respectively, and, with a sinusoidal movement of the wall, we create a traveling wave of compression and rarefaction fluctuations. When the wall is pushed to the right, the molecules in front of the wall are compressed and collide with one another and build up pressure. The pressure increase in front of the wall moves as a chain reaction to the right. When pulled back, the piston creates a region of rarefaction in front of the wall. The local decrease in pressure travels as a chain reaction to the left. When moved sinusoidally, the wall generates a traveling wave of rarefaction and compression fluctuations characterized by local sinusoidal changes in particle pressure. In addition, local changes in particle velocity and displacement accompany pressure changes. Here the sound travels as a plane wave longitudinally along the x axis of Figure 4.1 [2]. This is in contrast to perturbations of air particles in other forms, i.e., other than the longitudinal flow wave of the compression-rarefaction type, such as rotational or jet flow [14].

We are now in a position to give a formal definition of a sound wave.

2 Sound waves can also propagate in, for example, solid and liquid media.

A pressure disturbance will reside about some ambient (atmospheric) pressure. The pressure variation at the ear causes the ear’s diaphragm to vibrate, thus allowing the listener to “perceive” the sound wave. For a sinusoidal disturbance, with the sinusoidally-varying wall, there are a number of important properties that characterize the sound wave (Figure 4.2).

The wavelength is the distance between two consecutive peak compressions (or rarefactions) in space (not in time), and is denoted by λ. If you take a snapshot at a particular time instant, you will see a sinusoidally-varying pressure or velocity pattern whose crests are separated by the wavelength. Alternately, the wavelength is the distance the wave travels in one cycle of the vibration of air particles. The frequency of the sound wave, denoted by ƒ, is the number of cycles of compression (or rarefaction) of air particle vibration per second. If we were to place a probe in the air medium, we would see a change in pressure or in particle velocity of ƒ cycles in one second. Therefore, the wave travels a distance of ƒ wavelengths in one second. The velocity of sound, denoted by c, is therefore given by c = ƒ λ. Because the speed of sound c = ƒ λ and observing that the radian frequency Ω = 2π ƒ, then Ω/c = 2π/λ. We denote 2π/λ by k, which we call the wavenumber. At sea level and 70°F, c = 344 m/s; this value is approximate because the velocity of sound varies with temperature. The velocity of sound is distinctly different from particle velocity; the former describes the movement of the traveling compression and rarefaction through the medium, and the latter describes the local movement of the particles which, on the average, go nowhere (as we saw earlier in the pool-ball analogy).

Figure 4.2 Traveling sinusoidal wave characterized by wavelength λ and frequency ƒ.

Image

Example 4.1       Suppose the frequency of a sound wave ƒ = 50 cycles/s (or Hertz, denoted by Hz) and assume that the velocity of sound at sea level is c = 344 m/s. Then the wavelength of the sound wave λ = 6.88 m. For ƒ = 1000 Hz, λ = 0.344 m, while for ƒ = 10000 cycles/s, λ = 0.0344 m. This wide range of wavelengths occurs in the propagation of speech sounds.Image

In describing sound propagation, it is important to distinguish pressure change due to variations that occur rapidly or slowly [2]. An isothermal process is a slow variation that stays at constant temperature. Isothermal compression of a gas results in an increase in pressure because a given number of molecules are forced into a smaller volume and will necessarily collide with each other (and the boundaries of an enclosure) more often, thus increasing the time rate of change of momentum. However, because the variation is slow, there is time for the heat generated by collisions to flow to other parts of the air medium; thus the temperature remains constant. On the other hand, an adiabatic process is a fast variation in which there is no time for heat to flow away and thus there occurs a temperature change in the medium. The heat generated causes an even greater number of collisions. The molecules get hotter faster and collide with each other (or the boundaries of an enclosure) more frequently. They have greater momentum (they are moving faster) and thus transfer more momentum to each other during collisions. It follows that an adiabatic gas is “stiffer” (than an isothermal gas) since it takes more force to expand or compress it.3 For most of the audible range, a sound wave is an adiabatic process,4 a property that we use in deriving the wave equation.

3 Think about a bicycle pump. If you move the pump handle very quickly, the temperature rises and the local heat energy increases, not having time to distribute itself within the pump. As you pump faster, the hot air becomes stiffer, i.e., the gas is adiabatic. On the other hand, if you move the handle slower, then the instantaneous temperature rise has time to distribute itself and there is negligible average temperature increase; the pumping thus becomes easier. This is an isothermal process.

4 This result is determined by measuring the relative speed of a thermal and acoustic wave. At very low frequency, the adiabatic condition may break down for sound waves.

4.2.2 The Wave Equation

Imagine a small cube of air particles sitting in space with volume ΔxΔyΔz and which is characterized by mass, pressure, and velocity. As before, a vibrating wall lies to the left of the cube and is infinite in extent (Figure 4.3), so all quantities are one-dimensional (1-D), i.e., propagating waves are planar with no change in the y or z direction. The pressure within the cube is a function of both time and space and is denoted by p(x, t), and fluctuates about an ambient or average (atmospheric) pressure Po. The pressure is the amount of force acting over a unit area of surface and is measured in newtons/m2 and increases with the number of particle collisions and change in particle momentum. Therefore, p(x, t) is an incremental pressure, the total pressure being expressed as Po + p(x, t). Henceforth, the average pressure po will be dropped because the wave equation requires only incremental pressure. The atmospheric pressure is typically po = 105 newtons/m2, while the threshold of hearing, i.e., the minimum perceivable pressure above atmospheric pressure, is about p(x, t) = 2(10−5) newtons/m2 at 1000 Hz (while the threshold of pain is about 20 newtons/m2) [23]. The ear is thus extremely sensitive to pressure changes. The particle velocity is the rate of change in the location of an air particle and fluctuates about zero average velocity. The particle velocity is measured in m/s and is denoted by υ(x, t). The density of air particles, denoted by ρ(x, t), is the mass per unit volume and is measured in kg/m3 around an average density ρo, the total density being ρo + ρ(x, t).

Figure 4.3 Cube of air in front of an infinite vibrating wall. Cube configuration is shown with first-order pressure change across the cube. Arrows indicate forces on two vertical surfaces.

Image

There are three laws of physics used to obtain the desired relations between pressure and velocity of the air particles within the cube [2]. Newton’s Second Law of Motion states that the total force on the cube is the mass times the acceleration of the cube and is written as F = ma. This law predicts that a constant applied force produces a constant acceleration of the cube. The Gas Law, from thermodynamics, relates pressure, volume, and temperature and, under the adiabatic condition—the case of interest for speech sound propagation—reduces to the relation PVγ = C, where P is the total pressure on the cube of volume V, C is a constant, and γ = 1.4 is the ratio of the specific heat of air at constant pressure to the specific heat of air at constant volume [2]. Given that the cube is assumed deformable, this formula predicts that if the pressure increases, then the volume of the cube decreases. The third law is the Conservation of Mass: The total mass of gas inside the deformable cube must remain fixed. This law states that a pressure differential can deform the cube, but the total number of particles in the deformable cube remains constant. The moving boundaries of the cube are defined by the original particles. The local density may change, but the total mass does not change. Using these three laws, we can derive the wave equation. We do not give the complete derivation of the wave equation, but rather give a flavor of one approach to its derivation.

The first step in deriving the wave equation relies on Newton’s Second Law of Motion. Three assumptions are made that result in equations that are linear and computationally tractable [2],[27]. First, we assume there is negligible friction of air particles in the cube with those outside the cube, i.e., there is no shearing pressure due to horizontal movement of the air. We refer to this shearing pressure as viscosity. Therefore, the pressure on the cube is due to only forces on the two vertical faces of the cube, as illustrated in Figure 4.3. Our second assumption is that the cube of air is small enough so that the pressure change across the cube in the horizontal dimension (Δx) is of “first order,” corresponding to sounds of not extremely large intensity.

This means that the second- and higher-order terms in a Taylor series expansion of the pressure function with respect to the x argument can be neglected, resulting in

Image

where Image is the rate at which the pressure increases in the x direction (left to right). Henceforth, we drop the (x, t) notation unless explicitly needed. The third assumption is that the density of air particles is constant in the cube and equal to the average atmospheric density, i.e., ρ(x, t) + ρo = ρo, which we will denote by simply ρ.

Because the pressure is force divided by the surface area of the vertical face of the cube, the net force in the x direction on the cube is the pressure difference across the cube multiplied by the surface area:

Image

We have assumed the density in the cube to be constant. Therefore, the mass m of the cube is given by

m = ρΔxΔyΔz.

The acceleration of the cube of air is given by Image, where υ(x, t) denotes the velocity of the particles in the cube.5 Finally, from Newton’s Second Law of Motion, F = ma, we have the net force acting on the cube of air as

5 Although the velocity of particles can change across the tube, we let υ(x, t) denote the average velocity in the cube, which becomes exact in the limit as Δx → 0 [2].

Image

and cancelling terms we have

(4.1)

Image

which is accurate “in the limit,” i.e., as Δx gets very small the approximations become more accurate because the differential pressure becomes “1st order” and the density constant.

The reader might have observed a slight-of-hand in arriving at the final form in Equation (4.1); the total derivative Image in Newton’s Second Law was replaced with the partial derivative Image. Because υ(x, t) is a function of space x as well as time t, the true acceleration of the air particles is given by [19]

(4.2)

Image

and therefore Equation (4.1) is correctly written as

(4.3)

Image

which is a nonlinear equation in the variable υ because the particle velocity υ multiplies Image. Consequently, it is difficult to determine a general solution. The approximation Equation (4.1) is accurate when the correction term Image is small relative to Image. Typically, in speech production it is assumed that the particle velocity is very small and therefore the correction term introduces only second-order effects.6 This approximation, in vector form, rules out rotational or jet flow[19], and thus the possibility of vortices along the oral cavity, as alluded to in Chapter 3 and further described in Chapter 11.

6 Portnoff [26] argues that when the DC component of the particle velocity υ, i.e., the steady net flow component to υ, is small relative to the speed of sound, then the correction term is negligible. The conditions under which the steady flow component of particle velocity through the vocal folds and within the vocal tract is small relative to the speed of sound are discussed in Chapter 11.

Equation (4.1) takes us halfway to the wave equation. Completing the derivation requires the use of the Gas Law and Conservation of Mass principle which can be shown to result in the relation [2]

(4.4)

Image

where c is the velocity of sound and ρ is the air density assumed to be constant.7 The pair of equations Equation (4.1) and Equation (4.4) represents one form of the wave equation. A second form is obtained by differentiating Equations (4.1) and (4.4) by x and t, respectively:

7 As with Equation (4.1), the correct form of Equation (4.4) involves a nonlinear correction term and can be shown to be of the form Image[19]. As before, the correction term is negligible if we assume the velocity multiplier is small.

Image

which, when combined to eliminate the mixed partials, can be written as a second-order partial differential equation in pressure only:

(4.5)

Image

Likewise, the above equation pair can be alternatively combined to form the second-order partial differential equation in velocity only:

(4.6)

Image

The alternate forms of the wave equation given by Equations (4.5) and (4.6) describe the pressure and velocity of air particles, respectively, as a function of position and time. In summary, the two different wave equation pairs are approximately valid under the following assumptions: (1) The medium is homogeneous (constant density), (2) The pressure change across a small distance can be linearized, (3) There is no viscosity of air particles, (4) The air particle velocity is small (implying that the full derivative in Equation (4.3) is not necessary), and (5) Sound is adiabatic.

Our goal is to predict the speech sound pressure that we measure at the output of the lips. This must account for various sound sources, time and space variations of the vocal tract, nasal coupling, radiation at the lips, and energy losses. There are also other effects such as nonlinear coupling between airflow through the glottis and pressure in the vocal tract, and rotational flows that may arise, for example, with particle shearing pressure or when the particle velocity is not small enough to make nonlinear correction terms negligible. A formal theory of sound propagation for the complete configuration with all such considerations does not exist. Nevertheless, we can gain much insight into speech sound production with simplified analog models.

4.3 Uniform Tube Model

4.3.1 Lossless Case

In this section, following the development of Rabiner and Schafer [28], we use the wave equation to derive pressure and velocity relations in the uniform lossless tube of Figure 4.4. The uniform tube approximates an oral cavity with a roughly constant cross-section, the moving piston provides a model for glottal airflow velocity, and the open tube end represents the open lips.8 This simple configuration approximates the vocal tract shape for the vowel /A/, as in the word “up.” The tube has a time- and space-invariant cross-section A(x, t) = A and planar sound waves are assumed to propagate longitudinally along the x axis. We assume, for the moment, that there is no friction along the walls of the tube. In reality, there is friction and the velocity is maximum at the center of the tube because the (minutely) corrugated surface of the tube will stop the wave and give zero velocity along the walls. The tube is open on the right side at x = l, l being the tube length. Recalling that p(x, t) represents the incremental pressure about the average (atmospheric) pressure, at the open end of the tube, we assume there are no variations in air pressure, i.e., p(x = l, t) = 0, but there are variations in particle velocity. This open-tube configuration is analogous to a short-circuit in electrical circuits. On the left end of the tube, the moving piston provides an ideal particle velocity source υ(x, t), i.e., the piston moves independently of pressure. This is analogous to an ideal current source in electrical circuits. We shortly further describe and refine these acoustic/electric analogies.

8 The bend in the vocal tract must also be taken into account. Sondhi [29] has found, however, that the curvature changes the location of resonances by only a few percent.

Figure 4.4 Uniform lossless tube. The piston provides an ideal airflow velocity source, and the pressure at the end of the tube is assumed zero.

Image

In the case of one-dimensional (planar) sound propagation in confined areas, it is convenient to use the velocity of a volume of air rather than particle velocity. We denote volume velocity by u(x, t) and define it as the rate of flow of air particles perpendicularly through a specified area. Therefore, the relation between volume velocity and particle velocity for our uniform tube is given by u(x, t) = (x, t). We can then rewrite the various forms of the wave equation in terms of volume velocity using the relation υ(x, t) = u(x, t)/A. In particular, the second-order differential equation, Equation pair (4.5), (4.6), remains intact except for a replacement of υ(x, t) by u(x, t), while the Equation pair (4.1), (4.4) becomes

(4.7)

Image

We can show that two solutions of the following form (Exercise 4.1) satisfy Equation (4.7):

(4.8)

Image

which represent velocity and pressure, each of which is comprised of a forward- and backward-traveling wave (Figure 4.5), denoted by the superscript + and −, respectively. To see the forward-backward traveling wave property of Equation (4.8), consider a velocity wave at time t0 and time t1. To show that the wave travels a distance c(t1t0), consider the forward going wave

Figure 4.5 Forward-traveling velocity waves. The quantity c(t1t0) is the distance moved in the time t1t0, where t1 > t0 and where c is the speed of sound. Backward-traveling waves are similar, but moving toward the left.

Image

Image

where x0 = (t1t0)c. Thus, the wave at time t1 is the wave at time t0 shifted in space by x0to the right. A similar argument can be made for the backward traveling wave. Therefore, the forward and backward waves propagate without change in shape.

It is instructive to observe that the wave equation for plane-wave sound propagation is similar to that for plane-wave propagation along an electrical transmission line, where voltage and current are analogous to pressure and volume velocity [2],[28]. The voltage e(x, t) and current i(x, t) for a lossless transmission line are related by the coupled equations

(4.9)

Image

where L is inductance and C is capacitance per unit length of the transmission line. Figure 4.6 illustrates the acoustic and electrical analogies where Image, defined as the acoustic inductance, is analogous to the electrical inductance L, and Image, defined as the acoustic capacitance, is analogous to the electrical capacitance C. These quantities reflect the “inertia” (mass or density) and the “springiness” (elasticity) of the air medium, respectively. In the context of electrical transmission lines, steady-state solutions are often written in the frequency domain using a complex sinewave representation. For example, a sinusoidal current wave of frequency Ω, traveling forward at speed c in the transmission line medium, is written as i(x, t) = I(x, Ω)ejΩ(tx/c). Motivated by this electrical/acoustic analogy, we will use similar representations in describing pressure and volume velocity in acoustic propagation.

We have seen above the general form of the solution for a uniform-tube configuration. In order to determine a specific solution, we observe that at the boundary x = 0 the volume velocity is determined by the sinusoidally vibrating piston, i.e., u(0, t) = U(Ω)ejΩt, where by the above convention we use the complex form of the real driving velocity Re[U(Ω)ejΩt]. This is analogous to a sinusoidal current source for a transmission line. We write the source as a function of frequency because we are ultimately interested in the frequency response of the uniform tube. Because this volume velocity represents the glottal airflow velocity, we also write this boundary condition as u(0, t) = ug(t) = ug(Ω)ejΩt. The second boundary condition is that the pressure is zero at the tube termination (the lips), i.e., p(l, t) = 0, which is analogous to placing a short circuit at the end of a transmission line. We will see later in this section that neither of these boundary conditions is accurate in practice with a real oral tract, i.e., the glottal airflow velocity source is not ideal, and there is energy loss due to radiation from the lips, resulting in a nonzero pressure drop at the lips.

Figure 4.6 Acoustic tube/electric transmission line analogies. Inductance and capacitance represent quantities in per-unit length.

SOURCE: L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals [28]. ©1978, Pearson Education, Inc. Used by permission.

Image

We have already seen the general solution to the wave equation in Equation (4.8). Because our velocity driving function is sinusoidal,9 we assume the specific velocity and pressure wave solutions to be of the form

9 Observe that the vibrating piston can take on a more general form which, from the Fourier transform pair for continuous-time signals, can be written as a sum (integral) of complex exponentials of the form U(Ω)ejΩt, i.e., Image When the uniform tube configuration is linear, the solution can be written as a sum of the individual solutions to the differential components U(Ω)ejΩtd Ω. With the boundary condition at the tube termination of zero pressure p(l, t) = 0, the uniform tube can be shown to be linear (Exercise 4.2).

(4.10)

Image

which are sinewaves traveling forward and backward in time and where k+ and k represent the amplitudes of the forward- and backward-going waves, respectively.

The next objective, given our two boundary conditions, is to solve for the unknown constants k+ and k. The boundary condition at the source gives

u(0, t) = k+ejΩtkejΩt = ug(Ω)ejΩt

resulting in

(4.11)

Image

Likewise, the boundary condition at the open end gives

Image

yielding

(4.12)

Image

Solving simultaneously for the two expressions in k+ and k, from Equation (4.11), k = −ug(Ω) + k+, which we substitute into Equation (4.12) to obtain

k+ejΩl/c − [ug(Ω) − k+]ejΩl/c = 0,

which is rewritten as

(4.13)

Image

Likewise, from Equation (4.11), k+ = ug(Ω) + k, which we substitute into Equation (4.12) to obtain

[ug(Ω) + k]ejΩl/c + kejΩl/c = 0,

which is rewritten as

(4.14)

Image

Substituting Equations (4.13) and (4.14) into the Equation pair (4.10), we obtain

(4.15)

Image

which expresses the relationship between the sinusoidal volume velocity source and the pressure and volume velocity at any point in the tube. The particular solution is referred to as standing waves because the forward- and backward-traveling waves have added to make the wave shape stationary in time; the difference Image or the sum Image no longer appears as an argument in the complex exponential function. Since the volume velocity and pressure solutions are in complex form, we can think of the cosine and sine multiplier functions as the envelope of the volume velocity and pressure functions along the tube, given respectively by Image and Image. We see that the two envelope functions are 90° out of phase (Figure 4.7), and so velocity and pressure are “orthogonal” in space. The two functions are also orthogonal in time because of the imaginary j multiplier of the pressure function. We can see this relation by writing the real counterpart of Equation (4.15):

Figure 4.7 Envelopes of pressure and volume velocity standing waves resulting from sound propagation in a uniform tube with zero pressure termination.

Image

(4.16)

Image

which reveals that at any time instant the velocity and pressure variations are 90° out of phase. An interesting consequence of this phase relation is that the acoustic kinetic and potential energies along the tube are also “orthogonal” (Exercise 4.3).

Consider now the specific relationship between the volume velocity at the open end of the tube (the lips) and the volume velocity at the source (the glottis). From the volume velocity component of Equation (4.15), we have at x = l

(4.17)

Image

Let Image, i.e., the complex amplitude at the open tube end for complex input ug(Ω)ejΩt. Then we can write

(4.18)

Image

where Va(Ω) denotes the frequency response relating input to output volume velocities of the analog uniform-tube model of our simple speech production configuration. The roots of the denominator of Va (Ω), i.e., the uniformly-spaced frequencies at which the peak (infinite) amplitude occurs, are the resonances of the analog system. These roots are given by solutions to Image, where k is an integer or

Image

Example 4.2       Consider the frequency response in Equation (4.18) and suppose that a uniform tube has length l = 35 cm. With the speed of sound c = 350 m/s, the roots of our particular uniform tube are

Image

The frequency response is shown in Figure 4.8. Observe that the resonances have zero bandwidth since there is no energy loss in the system (the reader should argue this using the definitions of bandwidth in Chapter 2). As we decrease the length l of the tube, the resonant frequencies increase. We have all had the experience of filling a tall bottle under a faucet. The bottle (tube) is open at the top, and we can think of the water hitting the bottom as providing a volume velocity source. As the water runs, the open volume decreases, reducing the tube length and increasing the aurally perceived resonances. An example is shown in Figure 4.8 in which the effective bottle length decreases from l = 35 cm to l = 17.5 cm. Resonances can also be thought of as “natural frequencies” which are excited by a disturbance. Why should a slug of air have natural frequencies? As we have seen, air has mass and springiness, and, as with lumped mechanical or electrical systems having inductance and capacitance, such a “mass-spring” system is more responsive to certain frequencies. An alternate analogy is the transmission line that we reviewed earlier as having inductance and capacitance per unit length; the acoustic “line” also has inductance and capacitance per unit length. Image

Figure 4.8 Illustration of time-varying resonances as the effective length of a bottle decreases as it is filled with running water (Example 4.2). We aurally perceive the increasing resonances as the tube length decreases.

Image

An important quantity relating pressure and volume velocity is their ratio along the tube. We call this quantity the acoustic impedance, analogous to electric impedance given by voltage divided by current, and from Equation (4.15) we can express it as

Image

When the tube is very short with length denoted by Δx, using a Taylor series expansion of the tangent function, we obtain the approximation (looking into the tube from x = 0)

Image

where Image can be thought of as an acoustic mass, and the short plug of air acts as a lumped acoustic inductance10 where Image is inductance per unit length. As with electrical and mechanical impedance, a large value of ZA(Ω) implies that a large pressure is required to obtain a specified volume velocity, i.e., the medium “impedes” the flow of particles when acted upon. Impedance will become particularly important at glottal and lip boundaries because the impedance gives the pressure and velocity relations, coupled with the wave equation, that must be maintained in crossing these boundaries.

10 When the short tube is closed on the right, rather than open, then its acoustic impedance is of the form Image and acts rather as a lumped acoustic capacitance (Exercise 4.8).

Returning to our solution Equation (4.18), we can change the frequency response to a transfer function by replacing Ω by Image to obtain11

11 The Laplace transform of the continuous-time impulse response gives its transfer function. Recall that in the Laplace transform the complex variable s = σ + jΩ. When s is evaluated on the imaginary axis jΩ, i.e., σ = 0, then s = jΩ. Therefore, Image.

(4.19)

Image

From functional analysis, if a complex function ƒ (x) of a complex variable x meets certain restrictions, then, using a Taylor expansion, we can write ƒ (x) in terms of an infinite product of its zero factors, i.e., in terms of factors of the form (xxk), where xk is a zero. Flanagan has used this relation to show that Equation (4.19) can be written as [8]

Image

where, because the corresponding response is real, the poles occur in complex conjugate pairs. There are then an infinite number of poles that occur on the jΩ axis and can be written as

Image

corresponding to the resonant frequencies of the uniform tube. The poles fall on the jΩ axis, and thus the frequency response at resonant frequencies is infinite in amplitude, because there is no energy loss in the simple configuration. In particular, it was assumed that friction of air particles does not occur along the vocal tract walls, that heat is not conducted away through the vocal tract walls, and that the walls are rigid and so energy was not lost due to their movement. In addition, because we assumed boundaries consisting of an ideal volume velocity source and zero pressure at the lips, no energy loss was entailed at the input or output to the uniform tube. In practice, on the other hand, such losses do occur and will not allow a frequency response of infinite amplitude at resonance. Consider by analogy a lumped electrical circuit with a resistor, capacitor, and inductor in series. Without the resistor to introduce thermal loss into the network, the amplitude of the frequency response at resonance equals infinity. We investigate these forms of energy loss in the following two sections. Energy loss may also occur due to rotational flow, e.g., vortices and other nonlinear effects that were discarded in the derivation of the wave equation, such as from Equation (4.3). These more complex forms of energy loss are discussed in Chapter 11.

4.3.2 Effect of Energy Loss

Energy loss can be described by differential equations coupled to the wave equation that describes the pressure/volume velocity relations in the lossless uniform tube. These coupled equations are typically quite complicated and a closed-form solution is difficult to obtain. The solution therefore is often found by a numerical simulation, requiring a discretization in time and in space along the x variable.

Wall Vibration — Vocal tract walls are pliant and so can move under pressure induced by sound propagation in the vocal tract. To predict the effect of wall vibration on sound propagation, we generalize the partial differential equations of the previous section to a tube whose cross-section is nonuniform and time-varying. Under assumptions similar to those used in the derivation of Equation (4.7), as well as the additional assumption that the cross-section of a nonuniform tube does not change “too rapidly” in space (i.e., the x direction) and in time, Portnoff [26] has shown that sound propagation in a nonuniform tube with time- and space-varying cross-section A(x, t) is given by

(4.20)

Image

which, for a uniform time-invariant cross-section, reduces to the previous Equation pair(4.1), (4.4).

Portnoff then assumed that small, differential pieces of the surface of the wall, denoted by Image, are independent, i.e., “locally reacting,” and that the mechanics of each piece are modeled by a mass mImage, spring constant kImage, and damping constant bImage per unit surface area illustrated in Figure 4.9. Because the change in cross-section due to pressure change is very small relative to the average cross-section, Portnoff expressed the cross-section along the dimension x as A(x, t) = Ao(x, t) + ΔA(x, t), where ΔA(x, t) is a linear perturbation about the average Ao(x, t). Using the lumped parameter model in Figure 4.9, the second-order differential equation for the perturbation term is given by

(4.21)

Image

where the mass, damping, and stiffness per unit surface area of the wall are assumed to be constant, i.e., not a function of the variable x along the tube, and where we have used the relation ΔA(x, t) = So(x, t)r, So(x, t) being the average vocal tract perimeter at equilibrium and r being the displacement of the wall perpendicular to the wall. (The reader should perform the appropriate replacement of variables in the model equation of Figure 4.9.) The pressure and perturbation relations in Equation (4.21) are coupled to the relations in the modified wave equation pair, Equation (4.20), and thus the three equations are solved simultaneously, under the two boundary conditions of the known volume velocity source u(0, t) and the output pressure p(l, t) = 0. Discarding second-order terms in the expansion for, Image the three coupled equations in variables p, u, and ΔA for the uniform tube reduce approximately to [26]

(4.22)

Image

where we assume that the average cross-section Ao is constant, i.e., the vocal tract is time-invariant being held in a particular configuration.

Under the steady-state assumption that sound propagation has occurred long enough so that transient responses have died out, and given that the above three coupled equations are linear and time-invariant, an input ug(t) = u(0, t) = U(Ω)ejΩt results in solutions of the form p(x, t) = P(x, Ω)ejΩt, u(x, t) = U(x, Ω)ejΩt, and ΔA(x, t) = ΔÂ(x, Ω)ejΩt along the tube. Therefore, the time-dependence of the three coupled equations is eliminated; the resulting three coupled equations are then functions of the variable x for each frequency Ω:

(4.23)

Image

Figure 4.9 Mechanical model of differential surface element Image of vibrating wall. Adjacent wall elements are assumed uncoupled.

SOURCE: M.R. Portnoff, A Quasi-One-Dimensional Digital Simulation for the Time-Varying Vocal Tract [26]. ©1973, M.R. Portnoff and the Massachusetts Institute of Technology. Used by permission.

Image

Portnoff has solved these coupled equations using standard numerical simulation techniques12 invoking finite-difference approximations to the spatial partial differential operator Image. Suppose that the length of the tube is l. Then we seek the acoustic pressure and volume velocity at N equally spaced points over the interval [0, l]. In particular, Image was approximated by a central difference with averaging; that is, for a partial differential with respect to x of a function ƒ (x), first compute the central difference over a spatial difference Image

12 We can also obtain in this case an approximate closed-form, frequency-domain solution [26],[28], but we choose here to describe the numerical simulation because this solution approach is the basis for later-encountered nonlinear coupled equations.

Image

and then perform a first backward average:

Image

It can be shown that this transformation from the continuous- to the discrete-space variable is equivalent to a bilinear transformation.13 This partial differential approximation leads to 2N − 2 equations in 2N unknown pressure and velocity variables. The two additional equations necessary for a unique solution are supplied by the boundary conditions at the lips and glottis.

13 The bilinear transformation for going from continuous to discrete time is given by the s-plane to z-plane mapping of Image and has the property that the imaginary axis in the s-plane maps to the unit circle in the z-plane [22]. The frequency Ω in continuous time is related to the discrete-time frequency by the tangential mapping Image. Likewise, the bilinear mapping in going from continuous space to discrete space is given by Image and has the same tangential mapping as the time variable.

The resulting frequency response Image of the numerical simulation is shown in Figure 4.10a (pressure and velocity computed at 96 samples in space along the x variable) for a uniform tube with yielding walls and no other losses, terminated in a zero pressure boundary condition [26]. The tube is 17.5 cm in length and 5 cm2 in cross section. The yielding wall parameters are given by Flanagan [8] to be mImage = 0.4 gm/cm2, bImage = 6500 dyne-sec/cm3, and kImage = 0. The result is markedly different from the lossless case. Because of energy loss due to the wall vibration, the poles of the transfer function are no longer on the jΩ axis and so the bandwidth is nonzero. In addition, the resonant frequencies have slightly increased. Finally, these effects are most pronounced at low frequencies because the inertial mass of the wall results in less wall motion at high frequencies.

Viscosity and Thermal Loss — The effects of both viscous and thermal loss can be represented by modification of Equations (4.20) and (4.21) with the introduction of a resistive term, representing the energy loss due to friction of air particles along the wall, and a conductive term, representing heat loss through the vibrating walls, respectively [8],[26]. The resulting coupled equations are solved numerically for the steady-state condition, again using a central difference approximation to the spatial partial derivative [26]. With viscous and thermal loss only, effects are less noticeable than with wall vibration loss, being more pronounced at high frequencies where more friction and heat are generated. The addition of viscosity and thermal conduction to the presence of vibrating walls yields slight decreases in resonant frequencies and some broadening of the bandwidths. Figure 4.10b gives the result of Portnoff’s numerical simulation with all three losses for the uniform tube with zero pressure termination [26].

4.3.3 Boundary Effects

We have up to now assumed that the pressure at the lips is zero and that the volume velocity source is ideal, and thus there is no energy loss at the output or input of our uniform tube. A more realistic picture for voiced speech is shown in Figure 4.11 which shows the addition of a glottal and radiation load.

A description of sound radiation at the lips and diffraction about the head is quite complicated. The effect of radiation, however, can be simplified by determining the acoustic impedance “seen” by the vocal tract toward the outside world. An approximate radiation impedance can be obtained by determining the impedance felt by a piston in a rigid sphere representing the head. The expression of this impedance as determined by Morse and Ingard [19], however, cannot be written in closed form. In a second approximation, the piston is assumed small compared to the sphere diameter, modeled by a piston set in an infinitely large wall (Figure 4.11). For this configuration, it has been shown [2],[8] that the acoustic impedance consists of a radiation resistance Rr (energy loss via sound propagation from the lips) in parallel with a radiation inductance Lr (inertial air mass pushed out at lips):

Figure 4.10 Frequency response of uniform tube with (a) vibrating walls with p(l, 0) = 0; (b) vibrating walls, and viscous and thermal loss with p(l, 0) = 0; (c) vibrating walls, viscous and thermal loss, and radiation loss [26],[28]. 3 dB bandwidths are given.

SOURCE: M.R. Portnoff, A Quasi-One-Dimensional Digital Simulation for the Time-Varying Vocal Tract [26]. ©1973, M.R. Portnoff and the Massachusetts Institute of Technology. Used by permission.

Image

Figure 4.11 Glottal and lip boundary conditions as impedance loads, given by linearized serial and parallel electric circuit models, respectively. Piston in infinite wall model for radiation from lips is illustrated.

Image

(4.24)

Image

For the infinite baffle, Flanagan has given values of Rr = 128/9π2 and Lr = 8a/3π c where a is the radius of the opening and c the speed of sound.

Equation (4.24) can be represented in the time domain by a differential equation that is coupled to the wave equation solution of sound propagation within the vocal tract. As with other energy losses, Portnoff [26] has numerically simulated the effect of this coupling for the steady-state vocal tract condition. The frequency response Image with loss from the radiation load, as well as from vibrating walls, viscosity, and thermal conduction, is shown in Figure 4.10c. Consequences of the radiation load are broader bandwidths and a lowering of the resonances, the major effect being on the higher frequencies because radiation loss is greatest in this region. We can see this effect through Equation (4.24). For very small Ω ≈ 0,

Zr ≈ 0

so that the radiation load acts as a short circuit with p(l, t) ≈ 0. For very large Ω with ΩLr >> Rr

Image

and so the radiation load takes on the resistive component at high frequencies. Because the energy dissipated in radiation is proportional to the real part of the complex impedance Zr [2],[28] and because the real part of Zr monotonically increases with the resistance Rr (the reader should confirm this monotonicity), we deduce that the greatest energy loss, and thus the greatest formant bandwidth increase, from radiation is incurred at high frequencies.

Our final boundary condition is the glottal source and impedance. This is the most difficult addendum to our simple uniform tube because the glottal volume velocity has been shown to be nonlinearly related to the pressure variations in the vocal tract. We return to the true complexity of the glottal impedance later in this chapter; for now it suffices to use a simplification to obtain a flavor for the glottal impedance effect. By simplifying a nonlinear, time-varying two-mass vocal fold model for predicting glottal airflow (Figure 3.5 in Chapter 3), Flanagan and Ishizaka [9] proposed a linearized time-invariant glottal impedance as a resistance Rg in series with an inductance Lg:

(4.25)

Image

As illustrated in Figure 4.11, this impedance is placed in parallel with an ideal volume velocity source ug(t) that, for the case of voiced speech, is a typical glottal airflow velocity function as given in Chapter 3. Applying Kirchoff’s current law for electric circuits to the glottal source and impedance in Figure 4.11, we find the modified boundary condition is given by

Image

which in the time domain is a differential equation coupled to the partial differential wave equation and other differential loss equations described above. Portnoff [26] performed a numerical simulation of this more realistic glottal boundary condition and found for steady-state a broadening of bandwidths at low resonances; this is consistent with Equation (4.25) because Zg (Ω) approaches an open circuit for high frequencies (Zg(Ω) ≈ j Ω Lg), and approaches a pure resistance for low frequencies (Zg(Ω) ≈ Rg).

We have up to now found the frequency response relating volume velocity at the lips, U(l,Ω), to input volume velocity at the glottis, ug(Ω). In practice, we measure the pressure at the lips with a pressure-sensitive microphone transducer. The pressure-to-volume velocity frequency response can be found as [28]

(4.26)

Image

where the radiation impedance Image, from Figure 4.11 and Equation (4.24), introduces a highpass filter effect onto the frequency response Va(Ω).

4.3.4 A Complete Model

From the previous sections, the complete model of the uniform tube involves boundary condition effects, as well as energy loss due to vibrating walls, viscosity, and thermal conduction. The effects can be summarized as follows:

1. Resonances are due to the vocal tract; there is some shift of the resonant center frequencies due to the various sources of energy loss.

2. Bandwidths of the lower resonances are controlled by vibrating walls and glottal impedance loss. In practice, the glottal impedance is assumed to be infinite, i.e., Zg = ∞, and the sources of broadening the bandwidths of lower resonances are lumped into the vibrating walls.

3. Bandwidths of the higher resonances are controlled by radiation, viscous, and thermal loss.

The systems described in the previous sections, from which these conclusions are drawn, were numerically simulated under certain simplifying assumptions such as a fixed uniform tube (except for the inclusion of yielding walls) and linearized and time-invariant networks representing glottal and radiation boundary conditions. In reality, these conditions are only approximate. For example, the vocal tract tube varies in time as well as in space, and the glottal airflow impedance model was significantly simplified, the nonlinear two-mass vocal fold model of Figure 3.5 being a more realistic basis. As such, the time-dependence cannot be taken out of the equations, as we did, for example, in taking Equation (4.22) to Equation (4.23), and therefore the resulting coupled equations, generally nonlinear,14 must be solved in time as well as in space. Solving these nonlinear two-dimensional partial differential equations poses a considerable challenge.

14 For the moment, we are not addressing all nonlinear contributions that influence airflow and pressure, as described at the end of Section 4.2.

Portnoff performed a numerical simulation of an even more complete model of sound propagation in the vocal tract using Flanagan and Ishizaka’s nonlinear two-mass vocal fold model and with fixed nonuniform tubes representing various vowels. This model involves coupling of the vocal tract wave equation with differential equations of loss, as well as nonlinear differential equations describing the two-mass vocal fold model of Flanagan and Ishizaka and a nonlinear volume velocity/pressure relation at the glottis to be described in Section 4.5. Unlike in the numerical simulation of Equation (4.23), the time-dependence of these coupled equations cannot be eliminated because of the presence of nonlinearity. The numerical simulation, therefore, requires discretization in time Δt as well as in the spatial variable Δx. As illustrated in Figure 4.12, the discretization occurs over a semi-infinite strip in the time-space plane where the boundary condition in space on the line x = −l is given by the glottal airflow source velocity and impedance model, and the boundary condition on the line x = 0 is given by the radiation load. The boundary condition on the time line t = 0 (0 ≤ xl) is given by the is given by assuming zero initial conditions of pressure and volume velocity. Discretization in time is performed by replacing time differentials with a central difference operator followed by a first backward average, as was the discretization in space shown earlier. This discretization is performed over a net of uniformly spaced discrete points in time and space shown in Figure 4.1215 and results in a set of difference equations whose solution gives values of acoustic volume velocity and pressure at points on the net.

15 For a lossless system, with time and spatial sampling constrained by the relation Δx = c Δt , the partial differential Equation (4.20) is simulated exactly [26].

Figure 4.12 Numerical simulation specifications for a complete production model with glottal and radiation boundary conditions: (a) boundary conditions with semi-infinite strip in time-space (t, x) plane on which partial differential equations are solved; (b) rectangular net of time-space discretization.

SOURCE: M.R. Portnoff, A Quasi-One-Dimensional Digital Simulation for the Time-Varying Vocal Tract [26]. ©1973, M.R. Portnoff and the Massachusetts Institute of Technology. Used by permission.

Image

Example 4.3       The result of Portnoff’s more complete numerical simulation for the vowel /o/ is shown in Figure 4.13 illustrating both the estimated frequency response Va(Ω), as well as the estimated glottal volume velocity and lip pressure function [26]. The nonuniform area function was estimated from vocal tract measurements. Observe that the vocal tract impulse response is “truncated” roughly during the open phase of the vocal fold vibration, the envelope of the waveform abruptly falling at the onset of the glottal opening. In Section 4.5, we show that this envelope decay is typical and reflects a rapid decrease in the temporal envelope of primarily the first formant, and thus a corresponding increase in its bandwidth, due to a nonlinear coupling between the vocal tract pressure and glottal volume velocity.Image

The resulting simulations give a reasonable match16 (with respect to formant center frequencies and bandwidths) of predicted frequency responses with those of natural vowels [26],[28]. Nevertheless, the need to invoke a time and spatial numerical simulation is cumbersome. A simpler alternative, but requiring additional approximations on pressure/volume velocity relations, is to estimate the desired transfer function Va(s) through a concatenated tube approximation of the cross-section function A(x, t). This approach leads to a discrete-time model whose implementation can be made computationally efficient using digital signal processing techniques, and which is described in the following section.

16 For turbulent noise sources, generated by high velocity flow through a constriction, the approximations leading to our coupled sets of differential equations are not necessarily valid [26].

Figure 4.13 Complete numerical simulation of the vowel /o/: (a) frequency response of output lip pressure; (b) glottal airflow volume velocity; (c) lip output sound pressure [26].

SOURCE: M.R. Portnoff,A Quasi-One-Dimensional Digital Simulation for the Time-Varying Vocal Tract [26]. ©1973, M.R. Portnoff and the Massachusetts Institute of Technology. Used by permission.

Image

4.4 A Discrete-Time Model Based on Tube Concatenation

A widely used model of sound propagation in the vocal tract during speech production is based on the assumption that the vocal tract can be represented by a concatenation of short lossless uniform acoustic tubes, as illustrated in Figure 4.14. We assume no energy loss within each tube and that loss is incurred only at the two boundaries, i.e., at the glottis and the lips. Furthermore, as alluded to earlier, it is typical to assume the glottal impedance to be infinite, so that loss is incurred only by way of radiation from the lips. Advantages of this model are that it is linear and easy to work with, providing a straightforward means to obtain a frequency response Image avoiding numerical simulation. Although assuming no loss in each tube segment, we will see that the radiation impedance can be modified to match observed formant bandwidths. The concatenated tube model also provides a convenient transition from a continuous-time to a discrete-time all-pole model, and the resulting all-pole model leads naturally into the linear prediction speech analysis of Chapter 5. On the other hand, although the frequency response predicted from the concatenated tube model can be made to approximately match spectral measurements, the concatenated tube model is less accurate in representing the physics of sound propagation than the coupled partial differential equation models of the previous sections. The contributions of energy loss from vibrating walls, viscosity, and thermal conduction, as well as nonlinear coupling between the glottal and vocal tract airflow, are not represented in the lossless concatenated tube model.

Figure 4.14 Concatenated tube model. The kth tube has cross-sectional area Ak and length lk.

Image

4.4.1 Sound Propagation in the Concatenated Tube Model

Consider the concatenated tube model in Figure 4.14 in which the kth tube has cross-sectional area Ak and length lk . Because each tube is lossless and because we assume planar wave propagation, the pressure and volume velocity relations in each tube satisfy the wave equation, Equation (4.7), each quantity having a forward- and backward-traveling wave component as given in Equation (4.8). The volume velocity and presure solution for the kth tube is therefore written as

(4.27)

Image

where the kth tube has length lk, the spatial variable x falling in the range 0 ≤ xlk , as illustrated in Figure 4.14. To give a specific solution requires boundary conditions that are provided by the pressure and volume velocity at two adjacent tubes. In particular, we use the physical principle that pressure and volume velocity must be continuous both in time and in space everywhere in the system [28]. Therefore, at the kth / (k + 1)st junction we have

uk (lk, t) = uk+1(0, t)
pk (lk, t) = pk+1(0, t).

Applying the two boundary conditions with Equation (4.27), we then have

(4.28)

Image

Now define Image, which is the time of propagation down the length of the tube. Then the first equation in the Equation (4.28) pair can be written as

Image

and substituting this expression for Image into the second equation of our pair, we obtain, after some algebra,

(4.29)

Image

Then subtracting the top from the bottom component of the above modified equation pair, we obtain, after some rearranging,

(4.30)

Image

Equations (4.29) and (4.30) illustrate the general rule that at a discontinuity along x in the area function A(x, t) there occur propagation and reflection of the traveling wave, and so in each uniform tube, part of a traveling wave propagates to the next tube and part is reflected back (Figure 4.15). This is analogous to propagation and reflection due to a change in the impedance along an electrical transmission line.

From Equation (4.29), we can therefore interpret Image, which is the forward-traveling wave in the (k + 1)st tube at x = 0, as having two components:

1. A portion of the forward-traveling wave from the previous tube,Image, propagates across the boundary.

2. A portion of the backward-traveling wave within the (k+1)st tube,Image, gets reflected back.

Observe that Image occurs at the (k+1)st junction and equals the forward wave Image, which appeared at the kth junction τk seconds earlier. This interpretation is applied generally to the forward/backward waves at each tube junction. Likewise, from Equation (4.30), we can interpret Image, which is the backward-traveling wave in the kth tube at x = lk, as having two components:

1. A portion of the forward-traveling wave in the kth tube,Image gets reflected back.

2. A portion of the backward-traveling wave within the (k + 1)st tube, Image, propagates across the boundary.

Figure 4.15 further illustrates the forward- and backward-traveling wave components at the tube boundaries.

The above interpretations allow us to see how the forward- and backward-traveling waves propagate across a junction. As a consequence, we can think of Image as a reflection coefficient at the kth junction, i.e., the boundary between the kth and (k + 1)st tube, that we denote by rk:

(4.31)

Image

which is the amount of Image that is reflected at the kth junction. It is straightforward to show that because Ak > 0, it follows that −1 ≤ rk ≤ 1. From our definition Equation (4.31), we can write Equations (4.29) and (4.30) as

(4.32)

Image

which can be envisioned by way of a signal flow graph illustrated in Figure 4.16a. We will see shortly that these equations at the junctions lead to the volume velocity relations for multiple concatenated tubes. Before doing so, however, we consider the boundary conditions at the lips and at the glottis.

Figure 4.15 Sound waves in the concatenated tube model consist of forward- and backward-going traveling waves that arise from reflection and transmission at a tube junction.

Image

Figure 4.16 Signal flow graphs of (a) two concatenated tubes; (b) lip boundary condition; (c) glottal boundary condition. An open circle denotes addition.

SOURCE: L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals [28]. ©1978, Pearson Education, Inc. Used by permission.

Image

Suppose there are a total of N tubes. Then the boundary conditions at the lips relate pressure pN (lN, t) and the volume velocity uN (lN , t) at the output of the N th tube [28]. From continuity pN (lN , t) = pL(t), which is the pressure at the lips output. Likewise, uN (lN , t) = uL(t), which is the volume velocity at the output of the lips. Suppose now that the radiation impedance is real, corresponding to a very large frequency Ω and thus, from Equation (4.24), a purely resistive load, i.e., Zr(Ω) = Rr. Then, from Figure 4.11, we can write the pressure/volume velocity relation at the lips as

(4.33)

Image

where Zr is not a function of Ω. From our earlier expressions for pressure and volume velocity within a tube, we then have

Image

This expression can be written as

Image

and thus

(4.34)

Image

where the reflection coefficient at the lips is defined as

Image

Equation (4.34) states that the forward-going wave is reflected back at the lips boundary with reflection coefficient rL. There is no backward-going wave contribution from free space to the right. The volume velocity at the lips output can then be expressed as

Image

which is shown schematically in Figure 4.16b. A complex radiation impedance Zr (Ω) can be invoked by using a frequency-domain version of the previous derivation or by using a differential equation, the inductance resulting in a first-order ordinary differential equation.

We can also model the radiation from the lips with an additional tube of cross-section AN+1 and infinite length. The tube has no backward-propagating wave because of its infinite length, which is consistent with sound radiating out from the lips. In addition, if we select AN + 1 such that Image, then the expression for rL becomes

(4.35)

Image

which is our original definition of reflection coefficient rN for two adjacent tubes of cross-sectional area AN + 1 and AN . Observe that when Zr (Ω) = 0 then rL = 1, the pressure drop is zero, and the imaginary added tube becomes infinite in cross-section because AN must be finite.

The boundary condition model for the glottis is shown in Figure 4.16c where the glottal impedance is assumed approximated as a series resistance and inductance. We can think of the glottal slit, anatomically, as sitting just below (schematically, to the left of) the first tube of the concatenated tube model, and so we denote the pressure at the output of the glottis by p1 (0, t), i.e., the pressure at the left edge of the first tube, and the volume velocity output of the glottis by u1 (0, t), i.e., the volume velocity at the left edge of the first tube. Suppose that the glottal impedance is real (purely resistive), i.e.,Zg = Rg, then using Kirchoff’s current law, we can write in the time domain [28]

Image

Then from our earlier expressions for volume velocity and pressure within a uniform tube [Equation (4.27)], we have

Image

Solving for the forward-going traveling wave

(4.36)

Image

where

Image

If Zg(Ω) is complex, then, as with a complex Zr(Ω), a differential equation realization of the glottal boundary condition is required. A flow diagram of the flow relations in Equation (4.36) is illustrated in Figure 4.16c. As with the radiation impedance, we can also model the effect of glottal impedance with an additional tube of cross-section A0 and infinite length. (We are looking down into the lungs.) If we make A0 such that Image, then our expression for rg becomes

Image

Consider now the special case of two concatenated lossless tubes of equal length, with radiation and glottal boundary conditions, depicted in the flow diagram of Figure 4.18a. In Exercise 4.9, using this flow diagram, you are asked to show that the transfer function relating the volume velocity at the lips to the glottis is given by

(4.37)

Image

where b = (1 + rg)(1 + rL)(1 +r1)/2, a1 = r1rg + r1rL, and a2 = rLrg. In this model, as well as more general models with an arbitrary number of lossless tubes, all loss in the system occurs at the two boundary conditions. Nevertheless, it is possible to select loss at the glottis and lips to give observed formant bandwidths. Furthermore, with this configuration Fant [7] and Flanagan [8] have shown that with appropriate choice of section lengths and cross-sectional areas, realistic formant frequencies can be obtained for vowels. As we noted earlier, however, the resulting model is not necessarily consistent with the underlying physics. In the following section, we describe a means of converting the analog concatenated tube model to discrete time. In this discrete-time realization, as in the analog case, all loss, and hence control of formant bandwidths, is introduced at the two boundaries.

4.4.2 A Discrete-Time Realization

The flow diagram in Figure 4.18a, because of the discrete delay elements, suggests that the concatenated tubes may be easily taken to a discrete-time realization. Consider a model consisting of N lossless concatenated tubes with total length l. As with the two-tube example, we make the time delays down each tube, and thus the length of each tube, equal. Each tube is of length Image and the time for the wave to propagate through one tube is Image Following Rabiner and Schafer [28], consider now the response of the N -tube model to a glottal input volume velocity ug(t) equal to a single impulse δ(t). If the impulse traveled down the tube without reflection, it would appear at the output of the final tube with a time delay of Nτ seconds. The impulse, however, is partially reflected and partially propagated at each junction so that the impulse response is of the form

(4.38)

Image

where 2τ is the round-trip delay within a tube. The earliest arrival is at time Nτ and the following arrivals occur at multiples of 2τ after this earliest arrival due to multiple reflections and propagations. Because the Laplace transform (we are in continuous time) of a delayed impulse δ(tto) is given by est0, we can write the Laplace transform of Equation (4.38) as

Image

where esNτ corresponds to the delay required to propagate through N sections. Without this time delay, the impulse response is simply υa(t + Nτ); therefore, because we can always recover υa(t) with a simple time shift, we ignore the time delay of Nτ. Making the change of variables s = jΩ, we obtain the frequency response

Image

which can be shown to be periodic with period Image (Figure 4.17a):

Image

The intuition for this periodicity is that we have “discretized” the continuous-space tube with space-interval Image, and the corresponding time-interval Image, so we expect periodicity to appear in the transfer function representation.

From Figure 4.17a, we see that Va(Ω ) has the form of a Fourier transform of a sampled continuous waveform with sampling time interval T = 2τ. We use this observation to transform the analog filtering operation to discrete-time form with the following steps illustrated in Figure 4.17:

S1: Using the impulse-invariance method, i.e., replacing esT with the complex variable z where T = 2τ,we transform the system function Va(Ω) to discrete-time:

Image

which, with the replacement of esT by the complex variable z, becomes

Image

The frequency response V(ω) = V(z)|z=ejw will be designed to match desired formant resonances over the interval [−π, π].

S2: Consider an excitation function ug(t) that is bandlimited with maximum frequency Image and sampled with a periodic impulse train with sampling interval T = 2τ, thus meeting the Nyquist criterion to avoid aliasing. The Fourier transform of the resulting excitation is denoted by ug(Ω). We then convert the impulse-sampled continuous-time input to discrete time. This operation is illustrated in the frequency-domain in Figure 4.17b.

S3: A consequence of using the impulse invariance method to perform filter conversion is a straightforward conversion of a continuous-time flow graph representation of the model to a discrete-time version. An example of this transformation is shown in Figure 4.18 for the two-tube case. Since the mapping of esT to z yields V(z), the discrete-time signal flow graph can be obtained in a similar way. A delay of τ seconds corresponds to the continuous-time factor esτ = es2τ/2 = esT/2, which in discrete-time is a half-sample delay. Therefore, in a signal flow graph we can replace the delay τ by z−1/2. Since a half-sample delay is difficult to implement (requiring interpolation), we move all lower-branch delays to the upper branch, observing that delay is preserved in any closed branch with this change, as illustrated in Figure 4.18. The resulting delay offset can be compensated at the output [28].

Figure 4.17 Frequency-domain view of discretizing analog filtering by concatenated tubes with equal length: (a) conversion of impulse-sampled continuous-time impulse response of concatenated tubes to discrete time; (b) conversion of impulse-sampled continuous-time glottal input to discrete time.

Image

S4: The final step in the conversion of continuous-time filtering to discrete-time is to multiply the discrete-time frequency responses of the excitation and the vocal tract impulse response to form the frequency response of the discrete-time speech output.

We now step back and view the discretization process from a different perspective. The original continuously spatial-varying vocal tract, assuming that it can be modeled as linear and time-invariant, has an impulse response that we denote by Image. By discretizing the vocal tract spatially, we have in effect sampled the impulse response temporally with sampling interval T = 2τ. The constraint Δx = cτ indicates that spatial and time sampling are intimately connected. The resulting frequency response is generally an aliased version of the (true) frequency response Image because the sampling may not meet the Nyquist criterion. The smaller the sub-tube length (increasing the number of tubes), then the less aliasing and the closer is the discrete-time frequency response to the (true) analog Image. We will see shortly that increasing the number of tubes implies that we increase the number of poles in the discrete model. The following example illustrates an interesting case:

Figure 4.18 Signal flow graph conversion to discrete time: (a) lossless two-tube model; (b) discrete-time version of (a); (c) conversion of (b) with single-sample delays.

SOURCE: L.R. Rabiner and R.W. Schafer,Digital Processing of Speech Signals [28].©1978, Pearson Education, Inc. Used by permission.

Image

Example 4.4       Consider a uniform tube, itself modeled as a concatenation of uniform tubes, but with only a single tube element. The frequency response Image has an infinite number of resonances, as shown in Example 4.2. (Note that Va(Ω ) had previously denoted the true frequency response.) How can we possibly capture all of these resonances in discrete time? If we sample withImage, then we have only “half a resonance” over the discrete-time frequency range [0,π]; therefore, we need to divide the uniform tube into two equal tubes to allow a full resonance over [0,π], i.e., we need to sample spatially twice as fast. This is an intriguing example which gives insight into the relation of the continuous- and discrete-time representations of resonant tubes. (The reader is asked to further elaborate on this example in Exercise 4.18.)Image

Our next objective is to derive a general expression for V(z) in terms of the reflection coefficients. From the discrete-time flow graph in Figure 4.18, we know that

Image

To obtain the function ƒ (rg, r1, r2 … , rL) from the flow graph directly is quite cumbersome. The flow graph, however, gives a modular structure by which the transfer function can be computed [28]. The resulting transfer function can be shown to be a stable all-pole function with bandwidths determined solely by the loss due to Zg(Ω) and Zr(Ω), i.e.,V(z,) is of the form

Image

where

Image

and where the poles correspond to the formants of the vocal tract. Moreover, under the condition rg = 1, i.e., infinite glottal impedance (Zg(Ω) = ∞) so that no loss occurs at the glottis, a recursion from the lips to the glottis can be derived whereby the transfer function associated with each tube junction is introduced recursively [28]. The recursion for the denominator polynomial D (z) is given by

Image

Because the vocal tract tube cross-sections (from which reflection coefficients are derived) Ak > 0, we can show that the poles are all inside the unit circle, i.e., the resulting system function is stable.17 As before, we can imagine the (N + 1)st tube is infinite in length with AN +1 selected so that rN = rL. When Zg(Ω) = ∞, and when Zr = 0 so that rN = rL = 1, then there is a short circuit at the lips. Under this condition, there is no loss anywhere in the system and hence zero bandwidths arise; with Zg(Ω) = ∞, the radiation impedance Zr(Ω) is the only source of loss in the system and controls the resonance bandwidths. The following example illustrates how to choose the number of tube elements to meet a desired bandwidth constraint.

17 When Zg(Ω) = ∞ the recursion is associated with Levinson’s recursion, which will be derived in the context of linear prediction analysis of Chapter 5. Using Levinson’s recursion, we will prove that the poles of D(z) lie inside the unit circle.

Example 4.5       Let the vocal tract length l = 17.5 cm and the speed of sound c = 350 m/s. We want to find the number of tube sections N required to cover a bandwidth of 5000 Hz, i.e., the excitation bandwidth and the vocal tract bandwidth are 5000 Hz. Recall that Image and that Image is the cutoff bandwidth. Therefore we want Image Solving for τ, the delay across a single tube, Image Thus, from above we have Image Since N is also the order of the all-pole denominator, we can model up to Image complex conjugate poles. We can also think of this as modeling one resonance per 1000 Hz.Image

We see that the all-pole transfer function is a function of only the reflection coefficients of the original concatenated tube model, and that the reflection coefficients are a function of the cross-sectional area functions of each tube, i.e., Image. Therefore, if we could estimate the area functions, we could then obtain the all-pole discrete-time transfer function. An example of this transition from the cross-sectional areas Ak to V(z) is given in the following example:

Example 4.6       This example compares the concatenated tube method with the Portnoff numerical solution using coupled partial differential equations [26]. Because an infinite glottal impedance is assumed, the only loss in the system is at the lips via the radiation impedance. This can be introduced, as we saw above, with an infinitely-long (N + 1)th tube, depicted in Figure 4.19 with a terminating cross-sectional area selected to match the radiation impedance, according to Equation (4.35), so that rN = rL. By altering this last reflection coefficient, we can change the energy loss in the system and thus control the bandwidths. For example, we see in Figure 4.19 the two different cases of rN = 0.714 (non-zero bandwidths) and rL = 1.0 (zero bandwidths). This example summarizes in effect all we have seen up to now by comparing two discrete-time realizations of the vocal tract transfer function that have similar frequency responses: (1) A numerical simulation, derived with central difference approximations to partial derivatives in time and space, and (2) A (spatially) discretized concatenated tube model that maps to discretized time.Image

4.4.3 Complete Discrete-Time Model

We have in the previous sections developed a discrete-time model of speech production. The discrete-time model was derived from a set of concatenated lossless tubes approximating the spatially-varying vocal tract, with a glottal input given by an ideal volume velocity source; the radiation load at the concatenated tube output is characterized by a highpass frequency response. In this section, we revisit and extend the discrete-time model of speech production with the three speech sound sources we introduced in Chapter 3, i.e., speech sounds with periodic, noise, and impulsive inputs. Before proceeding, however, we need to account for a missing piece in our discrete-time model.

Looking back to Section 4.3.3, we see that in practice speech pressure is measured at the lips output, in contrast to volume velocity which is represented in the discrete-time transfer function of Section 4.3.1. Using Equation (4.26), showing the conversion from volume velocity to pressure at the lips, the discrete-time transfer function from the volume velocity input to the pressure output, denoted by H(z), is given by

H (z) = V (z) R (z).

Figure 4.19 Comparison of the concatenated tube approximation with the “exact” solution for area function (estimated by Faut [7]) of the Russian vowel /a/ [26],[28]: (a) cross-section A(x) for a vocal tract model with 10 lossless sections and terminated with a 30 cm2 section that does not reflect; (b) reflection coefficients rk for 10 sections; (c) frequency response of the concatenated tube model—the solid curve corresponds to the lossless termination (zero bandwidths) and the dashed curve corresponds to the condition with loss (finite bandwidths); (d) frequency response derived from numerical simulation of Portnoff.

SOURCE: M.R. Portnoff, A Quasi-One-Dimensional Digital Simulation for the Time-Varying Vocal Tract [26]. ©1973, M.R. Portnoff and the Massachusetts Institute of Technology. Used by permission.

Image

R (z) denotes the discrete-time radiation impedance and V(z) is the discrete-time all-pole vocal tract transfer function from the volume velocity at the glottis to volume velocity at the lips. The radiation impedance R (z) = Zr (z) is a discrete-time counterpart to the analog radiation impedance Zr(s). You will show in Exercise 4.20 that R(z) ≈ 1 − z− 1 and thus acts as approximately a differentiation of volume velocity to obtain pressure, introducing about a 6 dB/octave highpass effect. Although R(z) is derived as a single zero on the unit circle, it is more realistically modeled as a zero slightly inside the unit circle, i.e.,

R(z) ≈ 1 − αz−1

with α < 1, because near-field measurements at the lips do not give quite the 6 dB/octave rolloff predicted by a zero on the unit circle [8]. The analog Zr(s) was derived by Flanagan under the assumption of pressure measurements in the far field, i.e., “sufficiently” far from the source [8]. Considering the pressure/volume velocity relation at the lips as a differentiator, the speech pressure waveform in continuous time, x(t), measured in front of the lips can be expressed as

Image

where the gain A controls loudness. (The reader should prove this equality.) The effect of radiation is therefore typically included in the source function; the source to the vocal tract becomes the derivative of the glottal flow volume velocity often referred as the glottal flow derivative, i.e., the source is thought of as18 Image rather than ug(t).

18 Moving the differentiation to the source holds strictly only when the speech production components are linear. We saw in Chapter 2 that components of a nonlinear system are not necessarily commutative.

The discrete-time speech production model for periodic, noise, and impulsive sound sources is illustrated in Figure 4.20. Consider first the periodic (voiced) speech case. For an input consisting of glottal airflow over a single glottal cycle, the z-transform of the speech output is expressed as

X(z) = AυG(z)H(z)
              = AυG(z)V(z)R(z)

where Aυ is again controlling the loudness of the sound and is determined by the subglottal pressure which increases as we speak louder, thus increasing the volume velocity at the glottis. G(z) is the z-transform of the glottal flow input, g[n], over one cycle and which may differ with the particular speech sound (i.e., phone), speaker, and speaking style.R(z) is the radiation impedance that we model as a single zero R(z) = 1 − αz−1, and V(z) is a stable all-pole vocal tract transfer function from the volume velocity at the glottis to the volume velocity at the lips, and which is also a function of the particular speech sound, speaker, and speaking style. We have seen in Chapter 3 that an approximation of a typical glottal flow waveform over one cycle is of the form

Figure 4.20 Overview of the complete discrete-time speech production model.

Image

g[n] = (βnu[−n]) * (βnu[−n]),

i.e., two time-reversed exponentially decaying sequences, that has z-transform

Image

which for real β < 1 represents two identical poles outside the unit circle. This model assumes infinite glottal impedance, i.e., no loss at the glottis [no loss at the glottis allowed us in the previous section to obtain an all-pole model for V(z)]. All loss in the system is assumed to occur by radiation at the lips. For the voiced case, the z-transform at the output of the lips can then be written over one glottal cycle as a rational function with poles inside and outside the unit circle and a single zero inside the unit circle, i.e.,

(4.39)

Image

where we assume Ci pole pairs of V(z) inside the unit circle. This rational function, along with a uniformly-spaced impulse train to impart periodicity to the input, is illustrated in the upper branch of Figure 4.20. Observe in this rational function that V(z) and R(z) are minimum-phase, while G(z), as we have modeled it, is maximum-phase, having two poles outside the unit circle. Referring to our discussion in Chapter 2 of frequency-domain phase properties of a sequence, we can deduce that the glottal flow input is responsible for a gradual “attack” to the speech waveform within a glottal cycle during voicing. We return to this important characteristic in later chapters when we develop methods of speech analysis and synthesis.

If we apply the approximate differentiation of the radiation load to the glottal input during voicing, we obtain a “source” function illustrated in Figure 4.21 for a typical glottal airflow over one glottal cycle. In the glottal flow derivative, a rapid closing of the vocal folds results in a large negative impulse-like response, called the glottal pulse, which occurs at the end of the open phase and during the return phase of the glottal cycle, as shown in Figure 4.21. The glottal pulse is sometimes considered the primary excitation for voiced speech, and has a wide bandwidth due to its impulse-like nature [1]. (Note that we are using the term “glottal pulse” more strictly than in Chapter 3.) We have also illustrated this alternative source perspective in the upper branch of Figure 4.20 by applying R(z) just after G(z).

Two other inputs to the vocal tract are noise and impulsive sources. When the source is noise, as, for example, with fricative consonants, then the source is no longer a periodic glottal airflow sequence, but rather a random sequence with typically a flat spectrum, i.e., white noise, although this noise may be colored by the particular constriction and shape of the oral tract. The output z-transform at the lips is then expressed by

Figure 4.21 Schematic of relation between (a) the glottal airflow and (b) the glottal flow derivative over a glottal cycle.

Image

X(z) = AnU(z)H(z)
              = AnU(z)V(z)R(z)

where U(z) denotes the z-transform of a noise sequence, u[n]. The third source is the “burst” which occurs during plosive consonants and which for simplicity we have modeled as an impulse. The output z-transform at the lips for the impulsive input is given by

X(z) = Ai H(z) = Ai V(z)R(z).

The noise and impulse source are shown in the lower two branches of Figure 4.20. Observe that all sources can occur simultaneously as with, for example, voiced fricatives, voiced plosives, and aspirated voicing. Furthermore, these sources may not be simply linearly combined, as we saw in the voiced fricative model of Example 3.4 of Chapter 3, a possibility that we have represented by the linear/nonlinear combiner element in Figure 4.20.

In the noise and impulse source state, oral tract constrictions may give zeros (absorption of energy by back-cavity anti-resonances) as well as poles. Zeros in the transfer function also occur for nasal consonants, as well as for nasalized vowels. Methods have been developed to compute the transfer function of these configurations [16],[17]. These techniques are based on concatenated tube models with continuity constraints at tube junctions and boundary conditions similar to those used in this chapter for obtaining the transfer function of the oral tract. (A number of simplifying cases are studied in Exercises 4.7 and 4.12.) In these cases, the vocal tract transfer function V(z) has poles inside the unit circle, but may have zeros inside and outside the unit circle, and thus X(z) is of the form

(4.40)

Image

where (1 − akz−1) and (1 − bkz) are zeros inside and outside the unit circle, respectively, and are due to the oral and/or nasal tract configurations. The vocal transfer function is therefore generally mixed phase, i.e., it is not necessarily an all-pole minimum-phase transfer function. The maximum-phase elements of the vocal tract can also contribute (i.e., in addition to the maximum-phase glottal flow alluded to earlier) to a more gradual attack of the speech waveform during voicing or at a plosive than is obtained by a minimum-phase sequence.

The discrete-time model given in this section, sometimes referred to as the “source/filter” model, can perform well in matching a measured spectrum or waveform or when only input/output relations are desired. The model output waveform and its spectrum are good matches to measurements with appropriate selection of the source, vocal tract filter parameters, and radiation. Nevertheless, much of the physics of our original model, as well as physics we have not yet attempted to model, such as effects of nonlinearities, are not represented by the simple source/filter model. Furthermore, different vocal tract shapes, sources, nonlinear effects, and subsystem coupling may yield a similar output, i.e., an output measurement may not be uniquely invertible. In Section 4.5, we look at one of the many fascinating possible model refinements: the nonlinear coupling between the glottal flow velocity and vocal tract pressure during voicing. A further glimpse beyond the simplicity of the source/filter model is saved for Chapter 11.

4.5 Vocal Fold/Vocal Tract Interaction

In the source/filter speech production model of Section 4.4.3, we assumed that the glottal impedance is infinite and that the glottal airflow source is not influenced by, or not “coupled” to, the vocal tract. This allowed us to model the glottal source as an ideal volume velocity that is convolved with a linear vocal tract filter impulse response to produce the speech waveform. In reality, there exists a nonlinear coupling between the glottal airflow velocity and the pressure within the vocal tract. In a qualitative sense, the pressure in the vocal tract cavity just above the glottis “backs up” against the glottal flow and interacts nonlinearly with the flow. In certain wind instruments, as in the trumpet, such a mechanism is even more pronounced and, indeed, is essential in determining the sound of the instrument; the vibration of the “lip reed” is strongly, and nonlinearly, coupled to the resonant frequencies of the cavity [3]. An accurate integrated vocal fold/vocal tract model would preserve the constraints between the glottal airflow volume velocity and pressure within the vocal tract chamber.

We presented in Section 4.3.4 one approach to obtaining a more accurate coupled source/filter model that uses the partial differential wave equation description of pressure and volume velocity in the vocal tract, the differential equations of loss in the tract and from lip radiation, and Flanagan and Ishizaka’s two-mass model of vocal fold vibration (requiring measurement of model parameters) that invokes a nonlinear relation between glottal volume velocity and pressure. The resulting coupled equations have time-varying parameters and are nonlinear, and thus require a numerical solution, as was performed by Portnoff [26]. We saw that one unexpected result from Portnoff’s simulation is that the vocal tract impulse response can abruptly decay within a glottal cycle (Figure 4.13). In this section, we use a simplified model of vocal fold/vocal tract coupling that gives insight into and predicts this “truncation” phenomenon, as well as predicts the modulation of formant frequencies and bandwidths within a glottal cycle. These insights are gained without the need of a numerical simulation and without the requirement of measuring the vocal fold parameters of Flanagan and Ishizaka’s two-mass model [1].

4.5.1 A Model for Source/Tract Interaction

Figure 4.22 shows an electrical analog of a model for airflow through the glottis. In this model, voltage is the analog of sound pressure, and current is the analog of volume velocity. psg at the left is the subglottal (below the glottis) pressure in the lungs that is the power source for speech; during voicing it is assumed fixed. p(t) is the sound pressure corresponding to a single first formant19 in front of the glottis. Zg(t) is the time-varying impedance of the glottis, defined as the ratio of transglottal pressure (across the glottis), ptg(t), to the glottal volume velocity (through the glottis), ug(t), i.e.,

19 Recall that the vocal tract transfer function has many complex pole pairs corresponding to speech formants. Only the first formant is included in Figure 4.22, because for higher formants the impedance of the vocal tract is negligible compared to that at the glottis. Numerical simulations, to be described, initially included multiple formants of the vocal tract and multiple subglottal resonances. These simulations verify that formants above the first formant and subglottal resonances negligibly influence glottal flow [1].

(4.41)

Image

where all quantities are complex. The R, C, and L elements are resistance, capacitance, and inductance, respectively, of an electrical model for the first formant whose nominal frequency and (3 dB attenuation) bandwidth are given by

(4.42)

Image

From Figure 4.22, we see that if Zg(t) is comparable to the impedance of the first-formant model, there will be considerable “interaction” between the source and the vocal tract, in violation of the source/filter assumption of independence. This interaction will induce an effective change in nominal center frequency Ωo and bandwidth Bo of the first formant.

Empirically, and guided by aerodynamic theories, it has been found that Zg(t) is not only time-varying, but also nonlinear. Specifically, it has been shown by van den Berg [31] that the transglottal pressure and glottal volume velocity are related by

(4.43)

Image

Figure 4.22 Diagram of a simple electrical model for vocal fold/vocal tract interaction. Only the first-formant model is included.

SOURCE: C.R. Jankowski, Fine Structure Features for Speaker Identification [10]. ©1996, C.R. Jankowski and the Massachusetts Institute of Technology. Used by permission.

Image

where ρ is the density of air and A(t) is the smallest time-varying area of the glottal slit. (Looking down the glottis, the cross-sectional glottal area changes with depth.) The term k = 1.1 and includes the effect of a nonuniform glottis with depth. Writing the current nodal equation of the circuit in Figure 4.22, and using Equation (4.43), we can show that (Exercise 4.19)

(4.44)

Image

where

p(t) + ptg(t) = psg.

Equations (4.43) and (4.44) represent a pair of coupled nonlinear differential equations with time-varying coefficients, the time variation entering through the changing glottal slit area and the nonlinearity entering through the van den Berg glottal pressure/volume velocity relation.

One approach to simultaneously solving the above Equation pair (4.43), (4.44) is through numerical integration [1]. In this simulation, it was determined that the time-domain skewness of the glottal flow over one cycle (we saw this skewness in Chapter 3) is due in part to the time-varying glottal area function and in part to the loading by the vocal tract first formant; pressure from the vocal tract against the glottis will slow down the flow and influence its skewness. In addition to an asymmetric glottal flow, the numerical simulation also revealed an intriguing sinusoidal-like “ripple” component to the flow. The timing and amount of ripple are dependent on the configuration of the glottis during both the open and closed phases [1],[24],[25]. For example, with folds that open in a zipper-like fashion, the ripple may begin at a low level early into the glottal cycle, and then grow as the vocal folds open more completely. We can think of the ripple as part of the fine structure of the glottal flow, superimposed on the more slowly-varying asymmetric coarse structure of the glottal flow, as illustrated in Figure 4.23, showing the two components in a schematic of a typical glottal flow derivative over one cycle. We describe the separation of coarse- and fine-structure glottal flow components in Chapter 5. Finally, the “truncation” effect over a glottal cycle, alluded to earlier, was also observed.

Figure 4.23 Glottal flow derivative waveform showing coarse and ripple component of fine structure due to source/vocal tract interaction.

SOURCE: M.D. Plumpe, T.F. Quatieri, and D.A. Reynolds, “Modeling of the Glottal Flow Derivative Waveform with Application to Speaker Identification” [25]. ©1999, IEEE. Used by permission.

Image

There is an alternative way of looking at this problem that leads to an equivalent representation of the vocal fold/vocal tract interaction which gives similar observed effects, but more intuition about the interaction [1]. We again simplify the problem by assuming a one-formant vocal tract load. Then, from Equation (4.44) and with ptg(t) = psgp(t) as the pressure across the glottis, we can write

Image

Observing that from the Taylor series expansion of Image we can linearize the square root function as Image and assuming p(t) << psg, we have

Image

Then with some algebra, we have

Image

that can be rewritten as

(4.45)

Image

where

Image

Now differentiating Equation (4.45) with respect to time, we can show that (Exercise 4.19)

(4.46)

Image

which is represented by the Norton equivalent circuit of Figure 4.24 to the original circuit of Figure 4.22, where the equivalent time-varying resistance and inductance are given by

(4.47)

Image

which are in parallel, respectively, with the first-formant resistance, capacitance, and inductance, R, C, and L. The new time-varying volume velocity source is given by usc(t) of Equation (4.45).

Figure 4.24 Transformed vocal fold/vocal tract first-formant interaction model that is Norton equivalent to circuit of Figure 4.22.

SOURCE: C.R. Jankowski, Fine Structure Features for Speaker Identification [10]. ©1996, C.R. Jankowski and the Massachusetts Institute of Technology. Used by permission.

Image

4.5.2 Formant Frequency and Bandwidth Modulation

Equation (4.46) can be interpreted as a linear differential equation with time-varying coefficients that cause a modulation of the formant frequency and bandwidth. To see how this modulation arises, we map Equation (4.46) to the frequency domain.

Standard Laplace transform methods cannot directly be applied because the coefficients of the equation are time-varying. Nevertheless, to obtain an approximate frequency-domain representation of the vocal cord/vocal tract interaction, we assume that at each time instant the glottal impedance is stationary, i.e., we pretend at each time instant that the linear differential equation represents a time-invariant system. Under this assumption, we can obtain a “pseudo-transfer function” from the volume-velocity source to the vocal tract pressure, Image with time-varying coefficients that are functions of the changing formant center frequency and bandwidth [1]. This transfer function is given by (Exercise 4.19)

(4.48)

Image

with denominator polynomial coefficients given by the time-varying formant frequency Ω1(t) and time-varying bandwidth B1(t):

(4.49)

Image

where the nominal frequency and bandwidth, Ωo and Bo, are defined in Equation (4.42), and g0(t) is proportional to the time-varying area of the glottis, as seen in Equation (4.45). Over a glottal cycle, the changing bandwidth B1(t) then follows that of the area function. The formant frequency Ω1(t), being proportional to the derivative of the area function, rises at the onset of the glottal open phase and falls near the termination of this phase [1],[10],[24],[25].

We see in the expression for the bandwidth B1(t) that the nominal bandwidth, Bo, is multiplied by a time-dependent bandwidth multiplier factor, Image. Because A(t) ≥ 0 then B1(t) ≥ Bo, i.e., opening the glottis only increases the bandwidth of the formant. As with bandwidth modulation, the nominal formant frequency, Ωo, is multiplied by a time-dependent factor. Due to the dependence of Ω1(t) on the derivative of the glottal area, during a complete glottal cycle, Ω1(t) will modulate both below and above its nominal value. The following example illustrates these properties:

Example 4.7       Figure 4.25b gives the theoretically calculated B1(t) as a function of time over two glottal cycles for five Russian vowels /a/, /o/, /u/ , /i/, and /e/ for the glottal area function A(t) in Figure 4.25a (from [1],[10]). The increase in bandwidth due to the opening glottis is higher for vowels with higher first formant. Even in the cases of minimum bandwidth modulation, i.e., for vowels with very low first formant (e.g., /i/,/u/), B1(t) is increased by a factor of about three to four.
        Figure 4.25c gives the Ω1(t) multiplier for the five vowels, again for the glottal area function A(t) in Figure 4.25a. The Ω1(t) multiplier ranges from about 0.8 to 1.2. It is not necessarily larger for vowels with high first formant, as with bandwidth modulation. An interesting property of the Ω1(t) multiplier is that at both glottal opening and closure, the change in Ω1(t) is rather instantaneous, and can be from 10–20% of the formant frequency. Note that the glottal area function in Figure 4.25a is a simple approximation; we have seen that an actual glottal area does not increase as rapidly at glottal opening as at closing, due to physical constraints. Image

Figure 4.25 Illustration of time-varying first formant F1 frequency and bandwidth for five Russian vowels [1]: (a) glottal area function; (b) bandwidth Image (c) formant frequency multiplier Image

SOURCE: C.R. Jankowski, Fine Structure Features for Speaker Identification [10]. ©1996, C.R. Jankowski and the Massachusetts Institute of Technology. Used by permission.

Image

The increase of B1(t) within a glottal cycle is responsible for the effect we have been calling “truncation,” and is due to a decrease in the impedance at the glottis as the glottis opens; the reduced glottal impedance diverts the glottal source velocity away from the vocal tract.

Figure 4.26 Illustration of the effect of truncation with synthetic waveforms. Truncated responses are denoted by solid lines and corresponding untruncated responses by dotted lines. The signal frequency is 300 Hz in panels (a) and (c) and 700 Hz in panels (b) and (d). The maximum bandwidth (referred to as truncated BW) during truncation is 200 Hz in panels (a) and (b) and 400 Hz in panels (c) and (d) for a nominal bandwidth of 60 Hz.

SOURCE: C.R. Jankowski, Fine Structure Features for Speaker Identification [10]. ©1996, C.R. Jankowski and the Massachusetts Institute of Technology. Used by permission.

Image

Figure 4.26 shows the result of truncation for numerous synthetic waveforms [10]. The waveforms are generated with an implementation of the Klatt speech synthesizer which has the capability of continuously varying formant frequencies and bandwidths by updating resonator parameters every sample [12]. The bandwidth of the waveform shown is instantaneously changed from a nominal value of 60 Hz. All panels show the truncated response with a solid line, and the corresponding nontruncated response with a dotted line. Panels (a) and (c) are for the formant frequency equal to 300 Hz, while the formant frequency is 700 Hz in panels (b) and (d). The maximum bandwidth during truncation is 200 Hz in panels (a) and (b) and 400 Hz in panels (c) and (d). The truncation effect for an actual speech waveform is shown in Figure 4.27. The effect is shown from both a time- and frequency-domain perspective; a sudden drop in the waveform envelope is accompanied by a rapid reduction in formant energy as revealed in the wideband spectrogram. Certain informal perceptual listening tests have indicated possible difficulty in aurally distinguishing between the presence and absence of truncation [20]. Nevertheless, this difficulty would not rule out its use as a feature by machine; there may exist parameters that are acoustically relevant for automatic recognition tasks, but are perceptually of little use.

Figure 4.27 Illustration of truncation in a speech waveform: (a) segment from /o/ in “pop”; (b) wideband spectrogram of (a) using a 4-ms analysis window and a 0.5-ms frame interval. An abrupt drop in formant energy corresponds to truncation of the waveform. In the frequency domain, the energy drop, along with poor frequency resolution of the wideband spectrogram, prevent a clear view of formant movement.

Image

Alternatively, to obtain an approximate time-domain representation of the effect of the source/vocal tract interaction, Ananthapadmanabha and Fant held the vocal tract filter fixed and mapped all source/vocal tract interaction to the source. Using Equation (4.48), an approximate expression for the glottal flow derivative can be derived as [1],[24],[25]

Image

where the first term on the right is the coarse glottal flow derivative and the second term is the “ripple” component of the fine structure. The ripple has a frequency close to the first formant of the vocal tract, and the function ƒ(t) represents an amplitude modulation controlled by the glottal area function. These relations reveal an approximate duality of ripple in the time domain and formant modulation in the frequency domain [1],[4],[24],[25].

4.6 Summary

In this chapter, we described sound propagation through the vocal tract in speech production. We began with the one-dimensional wave equation that gives a mathematical description of the approximate relation of sound pressure and velocity for longitudinal (plane wave) flow. We used this version of the wave equation to first describe sound propagation for a uniform tube model. We then gave more realism to the model by adding energy loss at the glottal and lip boundaries, as well as internal to the vocal tract, and by generalizing the single-tube case to the concatenation of many uniform tubes of different cross-sectional area. This model assumes that coupling does not occur between the vocal folds and vocal tract, and that the vocal tract is linear and time-invariant. We converted this analog model to a discrete-time representation using the impulse invariant method reviewed in Chapter 2. Analog models with vocal fold/vocal tract nonlinear coupling were also described, their conversion to discrete-time form relying on numerical simulation. Nonlinear coupling can result in behavior not predicted by a linear time-invariant model such as “truncation” of the speech waveform, accompanied by an increase in first-formant bandwidth and modulation of the first-formant frequency that occur during the glottal open phase. Finally, we noted that there are other even more complex factors in sound propagation for speech production that have been postponed to Chapter 11. These factors predict, for example, the movement of air molecules, not as a compression-rarefaction wave, but under certain conditions, as a jet flow through the glottis and vortices along the vocal tract. There currently exists no general theory that can describe the full complexity of this system, or that can predict the importance of all these effects for speech perception.

EXERCISES

4.1 Show that the two traveling velocity and pressure waves in Equation (4.8) satisfy the first- and second-order forms of the wave equation pairs, Equations (4.5) and (4.6), and Equations (4.1) and (4.4), respectively.

4.2 Consider the lossless uniform tube in Section 4.3. Using the wave equation, argue that for a boundary condition at the tube termination of zero pressure p(l, t) = 0, the resulting system is both linear and time-invariant.

4.3 Consider the standing wave solution for a uniform lossless tube given in Equation (4.16). The energy density per unit volume is given by the sum of kinetic energy Image and the potential energy Image along the tube, where u denotes volume velocity, p denotes pressure, ρ is the air density, and c is the speed of sound. Show from Equation (4.16) that the sum of the kinetic and potential energies is constant, while the two energy forms are interchanged both in time and space, i.e., there exists a time-space “orthogonality” with respect to kinetic and potential energy.

4.4 In this problem you are given a speech measurement and are asked to deduce properties of the speech production mechanism. Specifically, a speech waveform x[n] was passed through a nonlinear channel with output

y[n] = x3[n]

Figure 4.28 Measured spectrum from nonlinear channel output.

Image

and the spectral envelope of the output during a vowel was measured (Figure 4.28). Estimate the true resonances of the vocal tract response for the original waveform x[n]. Assume a sampling rate of 10000 samples/s. What is the relation of the bandwidth of the original resonances to that of the bandwidth of the measured resonances?

4.5 Consider the time-varying tube created from a running faucet into a bottle shown in Figure 4.8. The initial length is 17.5 cm and changes as

l(t) = 17.5 − 3.5t

over a 5-second interval. Assume the speed of sound c = 350 m/s. Suppose the water hitting the bottom of the time-varying tube acts as an impulse train with a 1.0-second spacing (more of a thought experiment than what happens in reality). Sketch the spectrogram of the output over a 5000 Hz band and the 5-second interval, assuming that you are given “infinitely good” time and frequency resolution. Use MATLAB if you find it useful. In light of the uncertainty principle, what problems do you encounter in representing the signal properties with a wideband and narrowband spectrogram?

4.6 For the vowel /A/ (as in the word “about” or “up”), the cross-sectional area of the vocal tract is roughly uniform along its length. For an adult male speaker, it can be approximated by a uniform tube of length 17 cm and cross-sectional area 5 cm2, as shown in Figure 4.29. When the vocal folds are vibrating to produce this vowel, the area of the space between the vocal folds is zero during part of the cycle (Figure 4.29a), and has a cross-sectional area of about 0.015 cm2 during the open phase of the vibration (Figure 4.29b). As a consequence of this variation in the glottal opening, there are modulations of the natural frequencies or formant frequencies of the tube. Find the maximum and minimum values of the lowest natural frequency, i.e., first formant F1, corresponding to the open-and closed-glottis condition. In the open configuration, model the system as a concatenation of two tubes, one with cross-sectional area A1 = 0.015 cm2 and the other with cross-sectional area A2 = 5 cm2. In addition, assume the pressure is zero at the left end of the small tube when the glottis is open.

Figure 4.29 Approximation of vocal-tract shapes for the vowel /A/: (a) with vocal folds closed, i.e., with closed termination; (b) with glottis slightly open during the open phase of vocal fold vibration.

Image

4.7 An exceedingly crude approximation to a nasal sound is given by a uniform tube, closed at one end and open at the other (Figure 4.30). The boundary condition of zero volume velocity is imposed at the closed end, i.e.,

u(0,t) = 0.

(a) Using the wave equation general solution Equation (4.8), derive the frequency response describing the airflow volume velocity at the “glottis” location (x = l/2), Ug(Ω), to the airflow volume velocity at the “nasal” output (x = l), UN(Ω); that is, determine

Image

where

uN(t) = u(l, t)

and

ug(t) = u(l/2, t).

(b) Generalize the frequency response Va(Ω) to a transfer function in the s-plane. Then use the impulse invariance method to convert the Laplace transform to a z-transform. Draw a signal flow graph for your z-transform. Is the system stable?

(c) Give the required signal input bandwidth and sampling rate (as a function of the variables l = tube length and c = speed of sound) in a digital simulation of the z-transform of part (b). Hint: Note the periodicity of the continuous-time frequency response.

(d) Given what we have learned about the frequency characteristics of nasal sounds (e.g., pole-zero placement), comment on the effectiveness of this simple model.

Figure 4.30 Simple model of a nasal sound.

Image

4.8 Consider a uniform tube of length l closed at the right end with a volume velocity source ug(0, t) at the left end.

(a) Using the wave equation pair Equation (4.5), (4.6) and the boundary condition u(l, t) = 0 at the closed end, find the volume velocity and pressure in terms of the volume velocity source along the x dimension of the tube. Plot the pressure and volume velocity envelopes along the tube.

(b) Suppose the uniform tube is very large and represents a concert hall with the volume velocity source being a musical instrument on stage and the closed right end the back wall of the balcony. Explain why the serious music listener sits a few rows from the back wall.

(c) Now suppose that the closed tube is very short. Using the acoustic impedance looking into the closed tube, state a condition under which the closed tube acts like an acoustic capacitor.

4.9 Consider a two-tube lossless vocal tract model, depicted by the flow diagram in Figure 4.18a, including radiation and glottal boundary conditions.

(a) Derive the transfer function for the two-tube model given in Equation (4.37), corresponding to Figure 4.18.

(b) Sketch an extension of the flow diagram in Figure 4.18a for the case of three concatenated tubes, again including radiation and glottal boundary conditions.

4.10 Figure 4.31 represents the magnitude of the discrete-time Fourier transform of a steady-state vowel segment which has been extracted using a rectangular window. The envelope of the spectral magnitude, |H(ω)|, is sketched with a dashed line. Note that four formants are assumed, and that only the main lobe of the window Fourier transform is depicted.

(a) Suppose that the sampling rate is 12000 samples/s, set to meet the Nyquist rate. What is the first formant (F1) in Hertz? How long is the rectangular window in milliseconds? How long is the window in time samples?

(b) Suppose F1 = 600 Hz, and you are not given the sampling rate. What is the pitch in Hertz? What is the pitch period in milliseconds? What is the pitch period in time samples?

(c) Suppose you are told that the speech spectrum in Figure 4.31 corresponds to a uniform concatenated tube model. How many uniform tubes make up the model? Is it possible to estimate the cross-sectional areas of each tube with the given information? If not, explain why. If yes, derive estimates of the tube cross-sections, and given a sampling rate of 12000 samples/s, compute the length of the vocal tract for your uniform concatenated tube model. Assume the speech of sound = 350 m/s.

Figure 4.31 |H(ω)| of steady-state vowel segment.

Image

(d) Is it possible that a lossless concatenated tube model of the vocal tract corresponds to the spectrum of Figure 4.31? In explaining your answer, consider glottal and radiation effects.

4.11 Consider a two-tube lossless vocal tract model. Draw a signal flow diagram using reflection coefficients and delay elements for the case in which tube cross-sections and lengths are: A1 = 1 cm2, l1 = 9 cm, A2 = 7 cm2, and l2 = 8 cm. Include glottal and radiation effects in the flow diagram.

4.12 Fricatives are created by constrictions in the vocal tract resulting in both a front and back cavity. Air rushing through the constriction causes turbulence at the constriction and results in a pressure source. Consider a simple case of a front-excited vocal tract; for example, the fricative /f/ where roughly no front cavity occurs since the constriction occurs just behind the lips. We crudely approximate this configuration by a lossless uniform tube, closed at the left, and with a known pressure source at the right (Figure 4.32). The boundary condition at the left end (glottis) gives a volume velocity of zero, i.e.,

ug(t) = u(0, t) = 0

and at the right end (just behind the lips) gives a known pressure

pL(t) = p(l, t) = pL(Ω)ejΩt.

(a) Using the wave equation pair Equation (4.7) and its general solution Equation (4.8), derive the transfer function describing the volume velocity at the constriction to the pressure at the constriction, i.e.,

Image

You should find that H(Ω) has both poles and zeros. Do not, however, explicitly solve for the pole-zero locations. It is interesting to note that analysis of actual fricatives shows pole-zero cancellation causes a “highpass” effect.

(b) Using the radiation load in Equation (4.24), find the pressure in front of the lips at the mouth.

Figure 4.32 Model of fricative.

Image

4.13 If we model the vocal tract as a uniform lossless acoustic tube, its resonant frequencies occur at odd multiples of Image where l is the vocal tract length and c is the speed of sound. However, in spectral analysis of real speech waveforms, we observe a quite different situation. List all the factors that affect the location and shape of resonant frequencies in real speech signals and discuss briefly their role in shaping the speech spectrum. Limit your discussion to speech vowels. Assume the classical theory of speech production.

Figure 4.33 |H(ω)| of steady-state vowel segment.

Image

4.14 Figure 4.33 represents the magnitude of the discrete-time Fourier transform of a steady-state vowel segment which has been extracted using a rectangular window. The envelope of the spectral magnitude, |H(ω)|, is sketched with a dashed line. Note that three formants (resonances) are shown, and that only the main lobe of the window Fourier transform is depicted.

(a) Suppose the sampling rate is 6000 samples/s and was set to meet the Nyquist rate. What is the pitch period in milliseconds? How long is the rectangular window in milliseconds?

(b) If F1 = 750 Hz and the vocal tract is considered to be a single acoustic tube, what is the length of the vocal tract? Assume zero pressure drop at the lips, an ideal volume velocity source, and speed of sound c = 350 m/s.

(c) If the length of the vocal tract were shortened, how would this affect the spacing of the window main lobes that make up the discrete-time magnitude spectrum of the signal? Explain your answer.

(d) Suppose that H(ω) represents the frequency response between the lip pressure and glottal volume velocity. How would the spectral magnitude change if there were no radiation load at the lips?

4.15 Give a qualitative discussion on the limitations of the general discrete-time model of speech production. Proceed from the sound wave equation to the concatenated lossless tube model, and end with the continuous-to-discrete-time conversion process.

4.16 Two agendas for speech signal processing research, sometimes conflicting, are the development of computational speech models and the understanding of the physics of speech production. In this problem, you are asked to consider these two agendas in the context of “helium speech.”

(a) Discuss which aspects of the text thus far fall into each agenda. Consider the plane-wave approximation, the concatenated tube models, and the source/filter convolutional model, as well as any other aspects of the text you deem important for distinguishing the two agendas.

(b) Consider a uniform-tube vocal tract and its volume velocity transfer function in Equation (4.18). For pure helium, the velocity of sound is approximately three times that of air. According to Equation (4.18), how should the formant frequencies change when helium replaces air? Explain your reasoning.

(c) Empirically, it has been found that when helium replaces air, formant frequencies move up less than the factor predicted by Equation (4.18) and in an irregular fashion, i.e., not all formant frequencies change by the same factor. What does this imply about your result in part (b)? What may this inconsistency imply about our current speech models? Keep in mind the warning by R.W. Hamming (originator of the Hamming window): “Beware of finding what you are looking for.”

4.17 Figure 4.34 illustrates an actual voiced speech segment. Why is this speech segment not representative of the output of the linear source/filter model of voiced speech production in which the source consists of a typical periodic glottal airflow volume velocity? That is, what is the anomaly? Keep in mind the reflection by Thomas Kuhn: “Normal science does not aim at novelties of fact or theory, and, when successful, finds none … Discovery commences with the awareness of anomaly, i.e., with the recognition that nature has somehow violated the paradigm-induced expectations that govern normal science” [15].

Figure 4.34 Voiced speech waveform with anomaly.

Image

4.18 Consider a single lossless uniform tube (Figure 4.35) of length l = 17.5 cm with an ideal volume velocity source ug(t) at the left end and zero pressure at the right end, i.e., p(l, t) = 0. Assume the speed of sound c = 350 m/s.

(a) Give the continuous-time transfer function of the volume velocity at the right end to the source, Image. Sketch |Va(Ω)|. How long is the impulse response of the tube?

(b) Consider now a discrete-time representation of the tube, derived from a concatenated-tube model consisting of a single tube of length l. Using the impulse-invariance method, determine the z-transform of the impulse response of the resulting discrete-time system, i.e., the discrete-time transfer function. Note that the sampling interval T = 2τ where τ = l/c, which is the time delay along the tube. How many resonances are represented over the frequency interval [0, π]?

(c) Suppose now we divide the tube into two concatenated tubes of equal length l/2 with the same cross-sectional area. The time delay across each tube is τ′ = l/2c, and the corresponding time sampling interval is T = 2τ′ = l/c. Sketch the discrete-time frequency response from the source to the lips. Deduce the discrete-time transfer function. How many resonances are now represented over the frequency interval [0, π,]?

Figure 4.35 Single uniform lossless tube.

Image

(d) Find a continuous-time impulse response with bandwidth π/T, that, when sampled with sampling interval T of part (c), gives the discrete-time frequency response [of part (c)] over the discrete-frequency interval [−π, π]. How does this continuous-time impulse response differ from the true continuous-time impulse response of the tube? How does it differ from the true impulse response at the time samples nT = n2τ′? How many equal-length tubes (same cross-section) of a concatenated tube model do you need as a basis for the resulting discrete-time model to represent the true impulse response for all time?

(e) Suppose now that energy loss takes place in a spatially-varying tube, and its frequency response, given in Figure 4.36, is bandlimited to 24000 Hz. Suppose all energy loss occurs at the output radiation load, and suppose that the spatially-varying tract is approximated by a concatenated-tube model. How many equal-length tubes are required in a discrete-time model to represent the frequency response over only the first 12000 Hz?

Figure 4.36 Lossy tube frequency response.

Image

4.19 In this problem you will fill in some of the missing steps in Section 4.5 on vocal fold/vocal cord interaction and the “truncation” effect. We saw in Figure 4.22 an electrical analog of a simple model for airflow through the glottis. Psg at the left is the subglottal pressure in the lungs that is the power source for speech, p(t) is the sound pressure corresponding to a single first formant, and Zg(t) is the time-varying impedance of the glottis, defined as the ratio of transglottal pressure across the glottis, ptg(t), to the glottal volume velocity through the glottis, ug(t).

(a) Writing the current nodal equation of the circuit in Figure 4.22, and using Equation (4.43), show that

(4.50)

Image

(b) Differentiate Equation (4.45) with respect to time and show that

(4.51)

Image

Then argue from Equation (4.51) that Figure 4.24 gives the Norton equivalent circuit to the original circuit of Figure 4.22, where the equivalent time-varying resistance and inductance are given by

(4.52)

Image

where

Image

and with the new time-varying volume velocity source given by

Image

(c) Assuming that the time-varying coefficients of Equation (4.51) are fixed at each time instant, map Equation (4.51) to the “pseudo-transfer function” given by Equation (4.48) with denominator polynomial coefficients given by the time-varying formant frequency, Ω1(t), and time-varying bandwidth, B1(t), defined in Equation (4.49).

(d) We saw in part (c) of this problem and in Section 4.5.2 that the time-varying center frequency and bandwidth of the first formant associated with Equation (4.51) are given by

(4.53)

Image

where Ωo and Bo are the nominal center frequency and (3 dB) bandwidth defined in Equation (4.42). For the area function A(t) in Figure 4.25a, argue for the general shapes of the modulated bandwidths and formant multipliers in Figures 4.25b,c.

(e) Suppose an “ideal” pressure response is of the form

Image

and suppose that the actual pressure

Image

where Ω1(t) and B1(t) are given in Equation (4.53) and for an underlying glottal area function shown in Figure 4.25a. Sketch p1(t) for one glottal cycle, showing only the effect of the bandwidth multiplier factor Image in Equation (4.53). Then sketch p1(t) showing only the effect of the frequency multiplier factor Image. Finally, sketch the effect of both multiplier factors, and compare your result with po(t).

(f) Consider an analytic signal representation associated with the vocal tract first formant, x(t) = a(t)ejImage(t), with amplitude a(t) and phase Image(t). Motivated by the bandwidth expression of Equation (2.14) from Chapter 2, i.e.,

(4.54)

Image

(with Image the mean frequency) Cohen [5] proposed a definition of instantaneous bandwidth that is dependent only on the amplitude of x(t) and not its frequency, i.e.,

Image

Show in the special case of a decaying exponential, i.e.,

x(t) = eαt,

that Cohen’s definition of instantaneous bandwidth becomes

(4.55)

Image

A more standard definition of bandwidth is the frequency distance between spectrum samples that are 3 dB lower than at the spectral peak. Show that the expression in Equation (4.55) is one-half of this standard definition. Likewise, again motivated in part by the bandwidth definition in Equation (4.54), as described in Chapter 2, Cohen defined Image as the instantaneous frequency of x(t). Describe qualitatively how the bandwidth and frequency multiplier factors in part (d) of this problem affect the bandwidth defined in Equation (4.54). In your description, consider how a rapid truncation corresponds to a larger instantaneous bandwidth BW(t) and assume that Image. roughly follows the time-varying frequency Ω1(t).

4.20 (MATLAB) The bilinear transformation used in mapping a Laplace transform to a z-transform is given by the s-plane to z-plane mapping of Image and has the property that the imaginary axis in the s-plane maps to the unit circle in the z-plane [22]. The constant T is the sampling interval. Use the bilinear transformation to obtain the z-transform representation of the continuous-time radiation load of Equation (4.24). Specifically, with T = 10−4, c = 350 m/s, Rr = 1.4, and lr = (31.5)10−6, argue that Zr(z) ≈ 1 − z−1 where any fixed scaling is ignored. Use MATLAB to plot the magnitude of Zr(z) on the unit circle and show that Zr(z) introduces about a 6 dB/octave highpass effect. Argue from both a time- and frequency-domain viewpoint that the impulse response associated with Zr(z) acts as a differentiator.

4.21 (MATLAB)20 In his work on nonlinear modeling of speech production, Teager [30] used the following “energy operator” on speech-related signals x[n]:

20 This problem can be considered a preview to Section 11.5.

(4.56)

Image

Kaiser [11] analyzed Image and showed that it yields the energy of simple oscillators that generated the signal x[n]; this energy measure is a function of the frequency as well as the amplitude composition of a signal. Kaiser also showed that Image can track the instantaneous frequency in single sinusoids and chirp signals, possibly exponentially damped. Teager applied Image to signals resulting from bandpass filtering of speech vowels in the vicinity of their formants. The output from the energy operator frequently consisted of several pulses per pitch period, with decaying peak amplitude. Teager suggested that these energy pulses indicate modulation of formants caused by nonlinear phenomena such as time-varying vortices in the vocal tract.

We can interpret these energy pulses by using a frequency modulation (FM) model for the time-varying formants. It is important to emphasize that we do not model speech production with FM, but rather the FM is a mathematical vehicle to model the acoustical consequences of nonlinear mechanisms of speech production. Specifically, consider the following exponentially-damped FM signal with sine modulation:

(4.57)

Image

where Ω0 is the center (or carrier) frequency, Image is the modulation index, Δ is the frequency deviation, and Ωm is the frequency of the modulating sinusoid with phase offset θ. The instantaneous frequency is Image.

(a) Assume Ωm is sufficiently small so that cos(Ωm) ≈ 1 and sin (Ωm) ≈ Ωm, and show that applying the energy operator Image to x[n] yields:

(4.58)

Image

Thus, Image can track the instantaneous frequency Ω[n] of FM-sinewave signals. Using MATLAB, plot the (square root of the) energy operator output Image where x[n] is the FM-sinewave signal in Equation (4.57) with A = 10, a = 0.002, Ω0 = 0.2π, Ωm = 0.02π, θ = 0, and (a) Δ = Ω0, (b) Δ = 0.2Ω0, and (c) Δ = 0.01Ω0.

(b) The outputs of Equation (4.58) due to input synthetic FM-sinewave signals correspond roughly to measurements made on actual speech using Image. The instantaneous frequency Ω[n] plays the role of a time-varying formant. Thus, if we view Ω0 as the center value of a formant, then the operator Image followed by an envelope detector, followed by dividing with the envelope and square of the inverse sine, will yield the time-varying formant (within a single pitch period) as the instantaneous frequency Ω[n] of the FM signal. Figure 4.37 shows (a) a segment of a speech vowel /a/ sampled at Fs = 10 kHz and (b) the output from Image when applied to a bandpass filtered version of (a) extracted around a formant at F0 = 1000 Hz using a filter with impulse response exp(−b2n2) cos(Ω0n), where Ω0 = 2πF0/Fs and b = 800/Fs. In Figure 4.37b there are two pulses per pitch period, and the exponentially-damped sine squared model Equation (4.58) can be used to approximately represent the shape of these measured energy pulses. There are cases where we observe more than two pulses per glottal cycle, but also cases with only one major pulse per pitch period. Explain why these observations can be partially explained from the FM-sine model by comparing against your results in part (a) for different modulation indices.

Figure 4.37 Result of energy operator on speech vowel /a/: (a) original speech; (b) output of energy operator.

Image

Bibliography

[1] T.V. Ananthapadmanabha and G. Fant, “Calculation of True Glottal Flow and Its Components,” Speech Communications, vol. 1, pp. 167–184, 1982.

[2] L.L. Beranek, Acoustics, McGraw-Hill, New York, NY, 1954.

[3] M. Campbell and C. Greated, The Musician’s Guide to Acoustics, Schirmer Books, New York, NY, 1987.

[4] D.G. Childers and C.F. Wong, “Measuring and Modeling Vocal Source-Tract Interaction,” IEEE Trans. Biomedical Engineering, vol. 41, no. 7, pp. 663–671, July 1994.

[5] L. Cohen, Time-Frequency Analysis, Prentice Hall, Englewood Cliffs, NJ, 1995.

[6] P.B. Denes and E.N. Pinson, The Speech Chain: The Physics and Biology of Spoken Language, Anchor Press-Doubleday, Garden City, NY, 1973.

[7] G. Fant, Acoustic Theory of Speech Production, Mouton, The Hague, 1970.

[8] J.L. Flanagan, Speech Analysis, Synthesis, and Perception, Springer-Verlag, New York, NY, 1972.

[9] J.L. Flanagan and K. Ishizaka, “Computer Model to Characterize the Air Volume Displaced by the Vibrating Vocal Cords,” J. Acoustical Society of America, vol. 63, pp. 1559–1565, Nov. 1978.

[10] C.R. Jankowski, Fine Structure Features for Speaker Identification, Ph.D. Thesis, Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, June 1996.

[11] J.F. Kaiser, “On a Simple Algorithm to Calculate the ‘Energy’ of a Signal,” Proc. Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 381–384, Albuquerque, NM, April 1990.

[12] D. Klatt, “Software for a Cascade/Parallel Formant Synthesizer,” J. Acoustical Society of America, vol. 67, no. 3, pp. 971–995, 1980.

[13] D. H. Klatt and L.C. Klatt, “Analysis, Synthesis, and Perception of Voice Quality Variations among Female and Male Talkers,” J. Acoustical Society of America, vol. 87, no. 2, pp. 820–857, 1990.

[14] P.K. Kundu, Fluids Mechanics, Academic Press, New York, NY, 1990.

[15] T. Kuhn, The Structure of Scientific Revolution, University of Chicago Press, 1970.

[16] I.T. Lim and B.G. Lee, “Lossless Pole-Zero Modeling of Speech Signals,” IEEE Trans. Speech and Audio Processing, vol. 1, no. 3, pp. 269–276, July 1993.

[17] I.T. Lim and B.G. Lee, “Lossy Pole-Zero Modeling for Speech Signals,” IEEE Trans. Speech and Audio Processing, vol. 4, no. 2, pp. 81–88, March 1996.

[18] M. Liu and A. Lacroix, “Pole-Zero Modeling of Vocal Tract for Fricative Sounds,” Proc. Int. Conf. Acoustics, Speech, and Signal Processing, vol. 3, pp. 1659–1662, Munich, Germany, 1997.

[19] P.M. Morse and K.U. Ingard, Theoretical Acoustics, Princeton University Press, Princeton, NJ, 1986.

[20] L. Nord, T.V. Ananthapadmanabha, and G. Fant, “Signal Analysis and Perceptual Tests of Vowel Responses with an Interactive Source-Filter Model,” J. Phonetics, vol. 14, pp. 401–404, 1986.

[21] C.L. Nikias and M.R. Raghuveer, “Bispectrum Estimation: A Digital Signal Processing Framework,” Proc. IEEE, vol. 75, no. 1, pp. 869–891, July 1987.

[22] A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1989.

[23] A. Pickles, An Introduction to Auditory Physiology, Second Edition, Academic Press, New York, NY, 1988.

[24] M.D. Plumpe, Modeling of the Glottal Flow Derivative Waveform with Application to Speaker Identification, Masters Thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, Feb. 1997.

[25] M.D. Plumpe, T.F. Quatieri, and D.A. Reynolds, “Modeling of the Glottal Flow Derivative Waveform with Application to Speaker Identification,” IEEE Trans. Speech and Audio Processing, vol.7, no. 5, pp. 569–586, Sept. 1999.

[26] M.R. Portnoff, A Quasi-One-Dimensional Digital Simulation for the Time-Varying Vocal Tract, Masters Thesis, Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, May 1973.

[27] A.D. Pierce, Acoustics: An Introduction to Its Physical Principles and Applications, McGraw-Hill, New York, NY, 1981.

[28] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice Hall, Englewood Cliffs, NJ, 1978.

[29] M.M. Sondhi, “Resonances of a Bent Vocal Tract,” J. Acoustical Society of America, vol. 79, pp.1113–1116, April 1986.

[30] H.M. Teager and S.M. Teager, “Evidence for Nonlinear Production Mechanisms in the Vocal Tract,” chapter in Speech Production and Modeling, W.J. Hardcastle and A. Marchal, eds., NATO Adv. Study Inst. Series D, vol. 55, Bonas, France; Kluwer Acad. Publ., Boston, MA, pp. 241–261, 1990.

[31] J. W. van den Berg, “On the Air Response and the Bernoulli Effect of the Human Larynx,” J. Acoustical Society of America, vol. 29, pp. 626–631, 1957.

[32] K. Wang and S.A. Shamma, “Self-Normalization and Noise Robustness in Early Auditory Representations,” IEEE Trans. Speech and Audio Processing, vol. 2, no. 3, pp. 421–435, July 1995.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset