Chapter 1
Introduction

1.1 Discrete-Time Speech Signal Processing

Speech has evolved as a primary form of communication between humans. Nevertheless, there often occur conditions under which we measure and then transform the speech signal to another form in order to enhance our ability to communicate. One early case of this is the transduction by a telephone handset of the continuously-varying speech pressure signal at the lips output to a continuously-varying (analog) electric voltage signal. The resulting signal can be transmitted and processed electrically with analog circuitry and then transduced back by the receiving handset to a speech pressure signal. With the advent of the wonders of digital technology, the analog-to-digital (A/D) converter has entered as further “transduction” that samples the electrical speech samples, e.g., 8000 samples per second for telephone speech, so that the speech signal can be digitally transmitted and processed. Digital processors with their fast speed, low cost and power, and tremendous versatility have replaced a large part of analog-based technology.

The topic of this text, discrete-time speech signal processing, can be loosely defined as the manipulation of sampled speech signals by a digital processor to obtain a new signal with some desired properties. Consider, for example, changing a speaker’s rate of articulation with the use of a digital computer. In the modification of articulation rate, sometimes referred to as time-scale modification of speech, the objective is a new speech waveform that corresponds to a person talking faster or slower than the original rate, but that maintains the character of the speaker’s voice, i.e., there should be little change in the pitch (or rate of vocal cord vibration) and spectrum of the original utterance. This operation may be useful, for example, in fast scanning of a long recording in a message playback system or slowing down difficult-to-understand speech. In this application, we might begin with an analog recording of a speech utterance (Figure 1.1). This continuous-time waveform is passed through an A/D waveform converter to obtain a sequence of numbers, referred to as a discrete-time signal, which is entered into the digital computer. Discrete-time signal processing is then applied to obtain the required speech modification that is performed based on a model of speech production and a model of how articulation rate change occurs. These speech-generation models may themselves be designed as analog models that are transformed into discrete time. The modified discrete-time signal is converted back to analog form with a digital-to-analog (D/A) converter, and then finally perhaps stored as an analog waveform or played directly through an amplifier and speakers. Although the signal processing required for a high-quality modification could conceivably be performed by analog circuitry built into a redesigned tape recorder,1 current digital processors allow far greater design flexibility. Time-scale modification is one of many applications of discrete-time speech signal processing that we explore throughout the text.

1 Observe that time-scale modification cannot be performed simply by changing the speed of a tape recorder because this changes the pitch and spectrum of the speech.

Figure 1.1 Time-scale modification as an example of discrete-time speech signal processing.

Image

1.2 The Speech Communication Pathway

In the processing of speech signals, it is important to understand the pathway of communication from speaker to listener [2]. At the linguistic level of communication, an idea is first formed in the mind of the speaker. The idea is then transformed to words, phrases, and sentences according to the grammatical rules of the language. At the physiological level of communication, the brain creates electric signals that move along the motor nerves; these electric signals activate muscles in the vocal tract and vocal cords. This vocal tract and vocal cord movement results in pressure changes within the vocal tract, and, in particular, at the lips, initiating a sound wave that propagates in space. The sound wave propagates through space as a chain reaction among the air particles, resulting in a pressure change at the ear canal and thus vibrating the ear drum. The pressure change at the lip, the sound propagation, and the resulting pressure change at the ear drum of the listener are considered the acoustic level in the speech communication pathway. The vibration at the ear drum induces electric signals that move along the sensory nerves to the brain; we are now back to the physiological level. Finally, at the linguistic level of the listener, the brain performs speech recognition and understanding.

The linguistic and physiological activity of the speaker and listener can be thought of as the “transmitter” and “receiver,” respectively, in the speech communication pathway. The transmitter and receiver of the system, however, have other functions besides basic communications. In the transmitter there is feedback through the ear which allows monitoring and correction of one’s own speech (the importance of this feedback has been seen in studies of the speech of the deaf). Examples of the use of this feedback are in controlling articulation rate and in the adaptation of speech production to mimic voices. The receiver also has additional functions. It performs voice recognition and it is robust in noise and other interferences; in a room of multiple speakers, for example, the listener can focus on a single low-volume speaker in spite of louder interfering speakers. Although we have made great strides in reproducing parts of this communication system by synthetic means, we are far from emulating the human communication system.

1.3 Analysis/Synthesis Based on Speech Production and Perception

In this text, we do not cover the entire speech communication pathway. We break into the pathway and make an analog-to-digital measurement of the acoustic waveform. From these measurements and our understanding of speech production, we build engineering models of how the vocal tract and vocal cords produce sound waves, beginning with analog representations which are then transformed to discrete-time. We also consider the receiver, i.e., the signal processing of the ear and higher auditory levels, although to a lesser extent than the transmitter, because it is imperative to account for the effect of speech signal processing on perception.

To preview the building of a speech model, consider Figure 1.2 which shows a model of vowel production. In vowel production, air is forced from the lungs by contraction of the muscles around the lung cavity. Air then flows past the vocal cords, which are two masses of flesh, causing periodic vibration of the cords whose rate gives the pitch of the sound; the resulting periodic puffs of air act as an excitation input, or source, to the vocal tract. The vocal tract is the cavity between the vocal cords and the lips, and acts as a resonator that spectrally shapes the periodic input, much like the cavity of a musical wind instrument. From this basic understanding of the speech production mechanism, we can build a simple engineering model, referred to as the source/filter model. Specifically, if we assume that the vocal tract is a linear time-invariant system, or filter, with a periodic impulse-like input, then the pressure output at the lips is the convolution of the impulse-like train with the vocal tract impulse response, and therefore is itself periodic. This is a simple model of a steady-state vowel. A particular vowel, as, for example, “a” in the word “father,” is one of many basic sounds of a language that are called phonemes and for which we build different production models. A typical speech utterance consists of a string of vowel and consonant phonemes whose temporal and spectral characteristics change with time, corresponding to a changing excitation source and vocal tract system. In addition, the time-varying source and system can also nonlinearly interact in a complex way. Therefore, although our simple model for a steady vowel seems plausible, the sounds of speech are not always well represented by linear time-invariant systems.

Figure 1.2 Speech production mechanism and model of a steady-state vowel. The acoustic waveform is modeled as the output of a linear time-invariant system with a periodic impulse-like input. In the frequency domain, the vocal tract system function spectrally shapes the harmonic input.

Image

Figure 1.3 Discrete-time speech signal processing overview. Applications within the text include speech modification, coding, enhancement, and speaker recognition.

Image

Based on discrete-time models of speech production, we embark on the design of speech analysis/synthesis systems (Figure 1.3). In analysis, we take apart the speech waveform to extract underlying parameters of the time-varying model. The analysis is performed with temporal and spectral resolution that is adequate for the measurement of the speech model parameters. In synthesis, based on these parameter estimates and models, we then put the waveform back together. An objective in this development is to achieve an identity system for which the output equals the input when no speech manipulation is performed. We also investigate waveform and spectral representations that do not involve models, but rather various useful mathematical representations in time or in frequency from which other analysis/synthesis methods can be derived. These analysis/synthesis methods are the backbone for applications that transform the speech waveform into some desirable form.

1.4 Applications

This text deals with applications of discrete-time speech analysis/synthesis primarily in the following areas: (1) speech modification, (2) speech coding, (3) speech enhancement, and (4) speaker recognition (Figure 1.3). Other important application areas for discrete-time signal processing, including speech recognition, language recognition, and speech synthesis from text, are not given; to do so would require a deeper study of statistical discrete-time signal processing and linguistics than can be satisfactorily covered within the boundaries of this text. Tutorials in these areas can be found in [1],[3],[4],[5],[6],[7].

Modification— The goal in speech modification is to alter the speech signal to have some desired property. Modifications of interest include time-scale, pitch, and spectral changes. Applications of time-scale modification are fitting radio and TV commercials into an allocated time slot and the synchronization of audio and video presentations. In addition, speeding up speech has use in message playback, voice mail, and reading machines and books for the blind, while slowing down speech has application to learning a foreign language. Voice transformations using pitch and spectral modification have application in voice disguise, entertainment, and concatenative speech synthesis. The spectral change of frequency compression and expansion may be useful in transforming speech as an aid to the partially deaf. Many of the techniques we develop also have applicability to music and special effects. In music modification, a goal is to create new and exotic sounds and enhance electronic musical instruments. Cross synthesis, used for special effects, combines different source and system components of sounds, such as blending the human excitation with the resonances of a musical instrument. We will see that separation of the source and system components of a sound is also important in a variety of other speech application areas.

Coding — In the application of speech coding, the goal is to reduce the information rate, measured in bits per second, while maintaining the quality of the original speech waveform.2 We study three broad classes of speech coders. Waveform coders, which represent the speech waveform directly and do not rely on a speech production model, operate in the high range of 16–64 kbps (bps, denoting bits per second). Vocoders are largely speech model-based and rely on a small set of model parameters; they operate at the low bit rate range of 1.2–4.8 kbps, and tend to be of lower quality than waveform coders. Hybrid coders are partly waveform-based and partly speech model-based and operate in the 4.8–16 kbps range with a quality between waveform coders and vocoders. Applications of speech coders include digital telephony over constrained-bandwidth channels, such as cellular, satellite, and Internet communications. Other applications are video phones where bits are traded off between speech and image data, secure speech links for government and military communications, and voice storage as with computer voice mail where storage capacity is limited. This last application can also benefit from time-scale compression where both information reduction and voice speed-up are desirable.

2 The term quality refers to speech attributes such as naturalness, intelligibility, and speaker recognizability.

Enhancement — In the third application—speech enhancement—the goal is to improve the quality of degraded speech. One approach is to preprocess the speech waveform before it is degraded. Another is postprocessing enhancement after signal degradation. Applications of preprocessing include increasing the broadcast range of transmitters constrained by a peak power transmission limit, as, for example, in AM radio and TV transmission. Applications of postprocessing include reduction of additive noise in digital telephony and vehicle and aircraft communications, reduction of interfering backgrounds and speakers for the hearing-impaired, removal of unwanted convolutional channel distortion and reverberation, and restoration of old phonograph recordings degraded, for example, by acoustic horns and impulse-like scratches from age and wear.

Speaker Recognition — This area of speech signal processing exploits the variability of speech model parameters across speakers. Applications include verifying a person’s identity for entrance to a secure facility or personal account, and voice identification in forensic investigation. An understanding of the speech model features that cue a person’s identity is also important in speech modification where we can transform model parameters for the study of specific voice characteristics; thus, speech modification and speaker recognition can be developed synergistically.

1.5 Outline of Book

The goal of this book is to provide an understanding of discrete-time speech signal processing techniques that are motivated by speech model building, as well as by the above applications. We will see how signal processing algorithms are driven by both time- and frequency-domain representations of speech production, as well as by aspects of speech perception. In addition, we investigate the capability of these algorithms to analyze the speech signal with appropriate time-frequency resolution, as well as the capability to synthesize a desired waveform.

Chapter 2 reviews the foundation of discrete-time signal processing which serves as the framework for the remainder of the text. We investigate some essential discrete-time tools and touch upon limitations of these techniques, as manifested through the uncertainty principle and the theory of time-varying linear systems that arise in a speech signal processing context. Chapter 3 describes qualitatively the main functions of the speech production mechanism and the associated anatomy. Acoustic and articulatory descriptors of speech sounds are given, some simple linear and time-invariant models are proposed, and, based on these features and models, the study of phonetics is introduced. Implications of sound production mechanisms for signal processing algorithms are discussed. In Chapter 4, we develop a more quantitative description of the acoustics of speech production, showing how the heuristics of Chapter 3 are approximately supported with linear and time-invariant mathematical models, as well as predicting other effects not seen by a qualitative perspective, such as a nonlinear acoustic coupling between the source and system functions.

Based on the acoustic models of Chapters 3 and 4, in Chapter 5 we investigate pole-zero transfer function representations of the three broad speech sound classes of periodic (e.g., vowels), noise-like (e.g., fricative consonants), and impulsive (e.g., plosive consonants), loosely categorized as “deterministic,” i.e., with a periodic or impulsive source, and “stochastic,” i.e., with a noise source. There also exist many speech sounds having a combination of these sound elements. In this chapter, methodologies are developed for estimating all-pole system parameters for each sound class, an approach referred to as linear prediction analysis. Extension of these methods is made to pole-zero system models. For both all-pole and pole-zero analysis, corresponding synthesis methods are developed. Linear prediction analysis first extracts the system component and then, by inverse filtering, extracts the source component. We can think of the source extraction as a method of deconvolution. Focus is given to estimating the source function during periodic sounds, particularly a “pitch synchronous” technique, based on the closed phase of the glottis, i.e., the slit between the vocal cords. This method of glottal flow waveform estimation reveals a nonlinear coupling between the source and the system. Chapter 6 describes an alternate means of deconvolution of the source and system components, referred to as homomorphic filtering. In this approach, convolutionally combined signals are mapped to additively combined signals on which linear filtering is applied for signal separation. Unlike linear prediction, which is a “parametric” (all-pole) approach to deconvolution, homomorphic filtering is “nonparametric” in that a specific model need not be imposed on the system transfer function in analysis. Corresponding synthesis methods are also developed, and special attention is given to the importance of phase in speech synthesis.

In Chapter 7, we introduce the short-time Fourier transform (STFT) and its magnitude for analyzing the spectral evolution of time-varying speech waveforms. Synthesis techniques are developed from both the STFT and the STFT magnitude. Time-frequency resolution properties of the STFT are studied and application to time-scale modification is made. In this chapter, the STFT is viewed in terms of a filter-bank analysis of speech which leads to an extension to constant-Q analysis and the wavelet transform described in Chapter 8. The wavelet transform represents one approach to addressing time-frequency resolution limitations of the STFT as revealed through the uncertainty principle. The filter-bank perspective of the STFT also leads to an analysis/synthesis method in Chapter 8 referred to as the phase vocoder, as well as other filter-bank structures. Also in Chapter 8, certain principles of auditory processing are introduced, beginning with a filter-bank representation of the auditory front-end. These principles, as well as others described as needed in later chapters, are used throughout the text to help motivate various signal processing techniques, as, for example, signal phase preservation. The analysis stage of the phase vocoder views the output of a bank of bandpass filters as sinewave signal components. Rather than relying on a filter bank to extract the underlying sinewave components and their parameters, an alternate approach is to explicitly model and estimate time-varying parameters of sinewave components by way of spectral peaks in the short-time Fourier transform. The resulting sinewave analysis/synthesis scheme, described in Chapter 9, resolves many of the problems encountered by the phase vocoder, e.g., a characteristic phase distortion problem, and provides a useful framework for a large range of speech applications, including speech modification, coding, and speech enhancement by speaker separation.

Pitch and a voicing decision, i.e., whether the vocal tract source is periodic or noisy, play a major role in the application of speech analysis/synthesis to speech modification, coding, and enhancement. Time-domain methods of pitch and voicing estimation follow from specific analysis techniques developed throughout the text, e.g., linear prediction or homomorphic analysis. The purpose of Chapter 10 is to describe pitch and voicing estimation, on the other hand, from a frequency-domain perspective, based primarily on the sinewave modeling approach of Chapter 9.

Chapter 11 then deviates from the main trend of the text and investigates advanced topics in nonlinear estimation and modeling techniques. Here we first go beyond the STFT and wavelet transforms of the previous chapters to time-frequency analysis methods including the Wigner distribution and its variations referred to as bilinear time-frequency distributions. These distributions, aimed at undermining the uncertainty principle, attempt to estimate important fine-structure speech events not revealed by the STFT and wavelet transform, such as events that occur within a glottal cycle. In the latter half of this chapter, we introduce a second approach to analysis of fine structure whose original development was motivated by nonlinear aeroacoustic models for spatially distributed sound sources and modulations induced by nonacoustic fluid motion. For example, a “vortex ring,” generated by a fast-moving air jet from the glottis and traveling along the vocal tract, can be converted to a secondary acoustic sound source when interacting with vocal tract boundaries such as the epiglottis (false vocal folds), teeth, or inclusions in the vocal tract. During periodic sounds, in this model secondary sources occur within a glottal cycle and can exist simultaneously with the primary glottal source. Such aeroacoustic models follow from complex nonlinear behavior of fluid flow, quite different from the small compression and rarefraction perturbations associated with acoustic sound waves in the vocal tract that are given in Chapter 4. This aeroacoustic modeling approach provides the impetus for the high-resolution Teager energy operator developed in the final section of Chapter 11. This operator is characterized by a time resolution that can track rapid signal energy changes within a glottal cycle.

Based on the foundational Chapters 2–11, Chapters 12, 13, and 14 then address the three application areas of speech coding, speech enhancement, and speaker recognition, respectively. We do not devote a separate chapter to the speech modification application, but rather use this application to illustrate principles throughout the text. Certain other applications not covered in Chapters 12, 13, and 14 are addressed sporadically for this same purpose, including restoration of old acoustic recordings, and dynamic range compression and signal separation for signal enhancement.

1.6 Summary

In this chapter, we first defined discrete-time speech signal processing as the manipulation of sampled speech signals by a digital processor to obtain a new signal with some desired properties. The application of time-scale modification, where a speaker’s articulation rate is altered, was used to illustrate this definition and to indicate the design flexibility of discrete-time processing. We saw that the goal of this book is to provide an understanding of discrete-time speech signal processing techniques driven by both time- and frequency-domain models of speech production, as well as by aspects of speech perception. The speech signal processing algorithms are also motivated by applications that include speech modification, coding, and enhancement and speaker recognition. Finally, we gave a brief outline of the text.

Bibliography

[1] J.R. Deller, J.G. Proakis, and J.H.L. Hansen, Discrete-Time Processing of Speech, Macmillan Publishing Co., New York, NY, 1993.

[2] P.B. Denes and E.N. Pinson, The Speech Chain: The Physics and Biology of Spoken Language, Anchor Press-Doubleday, Garden City, NY, 1973.

[3] F. Jelinek, Statistical Methods for Speech Recognition, The MIT Press, Cambridge, MA, 1998.

[4] W.B. Kleijn and K.K. Paliwal, eds., Speech Coding and Synthesis, Elsevier, 1995.

[5] D. O’Shaughnessy, Speech Communication: Human and Machine, Addison-Wesley, Reading, MA, 1987.

[6] L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, NJ, 1993.

[7] M.A. Zissman, “Comparison of Four Approaches to Automatic Language Identification of Telephone Speech,” IEEE Trans. on Speech and Audio Processing, vol. 4, no. 1, pp. 31–44, Jan. 1996.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset