Audio Primer

Sound occurs because of a vibration of molecules that arrives at our ears as a wave. Typically the molecules vibrating are air, but sound also propagates through other mediums including liquids and solids.

The rate at which the molecules vibrate determines the pitch of the sound, whereas the amount (amplitude) of vibration determines the volume. The rate of vibration is known as the frequency and is measured in hertz. One hertz represents one complete cycle or vibration per second. A person with unimpaired hearing is able to perceive sound from around 20Hz to around 20KHz (20,000 Hertz). However, human perception isn't evenly distributed across that frequency range: Far more attention or emphasis is given to the lower frequency range that, perhaps not surprisingly, matches the frequency contained in human speech. Transforming the frequency scale to a log representation provides a reasonable first approximation of the “weight” given to different frequencies by our hearing system. Figure 7.4 (shown earlier) shows the perceptually critical bands of hearing.

Nearly every sound, with the exception of pure tones generated musically or automatically, is a complex amalgam of vibrations at different frequencies. It is the sum of these individual vibrations and their amplitudes (strengths or volumes) that make up the sound. Thus, not only can a sound be described, but also composed or generated by detailing the individual frequencies (and their amplitudes) that compose it. Similarly, a sound can be altered by changing the frequency or amplitude of one or more of the pure tones that compose it. This type of functionality is available in some of the more sophisticated audio studio applications.

Normal sounds such as speech, music, and much of what we consider noise (for example, traffic or office sounds) aren't static and unvarying but constantly changing in their component frequency and amplitude characteristics. Indeed it is that fundamental time varying property that allows us to generate speech as a sequence of sounds (phonemes) and music as a sequence of notes.

Sound, arriving as it does, is inherently an analogue quantity. Digitization is the process of transforming an analogue sound into a digital representation. Dedicated hardware, such as a PC's soundcard, is required to perform this task of analogue-to-digital (A-to-D) conversion as well as the inverse digital-to-analogue (D-to-A) conversion when a digital sound is to be presented (sent to speakers).

In performing digitization, two choices must be made, which both significantly impact the quality of the recorded (in the computer) sound and the size of the resulting media object (file if it is saved or conversely bandwidth required if it is being transmitted). These are the sampling frequency and the quantization level.

The first choice is the sampling rate (frequency)—the number of times per second that the sound will be captured (turned into a number). It is vital for the sound to be sampled frequently enough to capture its ever-changing nature and the frequency of the individual components of each sound. The Nyquist Theorem exactly describes this relationship between sampling frequency and frequency of the signal being captured. If a signal is being sampled at frequency fn, only signals up to fn/2 will be accurately represented. For instance, the sampling rate used for audio CDs is 44.1KHz (44,100 Hertz), meaning that all sounds up to 22.05KHz will be reliably captured: quite sufficient for the human ear. However if a lower sampling rate is used (as is often done), the higher frequency components of the sound won't be represented correctly. For instance, if sampling at 11,025Hz (a submultiple of 44.1KHz that is often used), nothing above 5.5KHz would be correctly represented. Not only could this result in the loss of an important part of the sound, but also it tends to adversely affect perceptions of naturalness because nearly all sounds have resonances that extend into the higher frequencies.

Signal frequencies above the Nyquist frequency (half sampling rate) aren't lost but folded back into the lower frequency domain in a process similar to taking the modulus of a number. This is known as aliasing. It is a familiar visual phenomenon with the rotors on helicopters and planes, and even the spokes on wheels, appearing to be stationary or going backward on film (the interaction of the frequency of the rotation of the rotor, blade, or spoke and the much lower sampling frequency at which the film was shot). Such an outcome will result in significant corruption of the signal, manifesting as a hiss or other noise, if there were strong signals above the Nyquist frequency. For this reason, low-pass filters are normally used to eliminate these high frequency components prior to sampling.

The second choice in the digitization process is the quantization level: the number of bits used to represent each sample. The greater the number of bits employed, the better dynamic range or sound resolution that occurs because of being able to more accurately define the amplitude at that point in time. The choice of an adequate number of bits (for example, 16 that is used in CD audio) will ensure that quieter passages aren't lost. Too few bits make the audio signal sound fuzzy such as through a poor telephone.

Choices of sampling rate and quantization not only affect the quality of the resulting audio, but also directly determine the bandwidth (size) of that audio object. This is a very important factor when considering streaming audio over a network with a bandwidth limitation. The following formula illustrates the relationship whereas Table 7.1 shows the bandwidth for one second of audio at some of the more common sampling rate and quantization level combinations:

Bits per Second = # Channels x Sampling rate x quantization level

Table 7.1. Bandwidth Requirements for Audio at Different Sampling Rates and Quantization Levels
GuidelineExamples of Quality Sampling Rate Quantization Level Number of Channels Kilobytes/Second
CD Audio 44.1KHz 16 2 (stereo) 176.4
FM Radio 22.05KHz 16 2 (stereo) 88.2
Stereo 1 - Acceptable 11.025KHz 16 2 (stereo) 44.1
AM Radio 11.025KHz 16 1 (mono) 22.05
Stereo 2 - Grainy 11.025KHz 8 2 (stereo) 22.05
Old hand-held game machine 11.025KHz 8 1 (mono) 11.025

Clearly, a choice of lower sampling rates, quantization levels, and the number of channels can significantly reduce bandwidth requirements, but at the expense of a potentially significant reduction in quality. Some of the most commonly employed codecs (compression schemes) for audio coding will be discussed in the next section. These audio codecs can significantly reduce the bandwidth requirements.

Speech and Music

The two most commonly processed forms of audio data are speech and music, each of which has its own unique characteristics.

Speech is produced by the human vocal apparatus in which placement of the articulators— the lips, tongue, jaw, and velum (soft palate that includes the uvula)—form the shape of the passage through which air flows. The shape of this passage determines its resonant frequencies and hence the sound produced from the lips as air escapes.

One property of speech sounds lends itself well to compression—most of the signal's energy is concentrated in a frequency range from 100Hz to under 5KHz (varies depending on the sound and speaker). This isn't to say that higher frequency components to the sound don't exist because they certainly do. Rather, most of the information that people use to determine the sound as well as other information such as speaker gender and identity can be found in this region. This can be exploited by sampling at a frequency to capture the vital information, but not to preserve the total sound. Although the digitized speech might not sound exactly like the original, most of the vital information will still be preserved, and at a considerable bandwidth saving. For example, speech sampled at 11KHz is still easily intelligible.

Music is such an encompassing category, being dependent on the form of music and type of instruments used, that it is difficult to make generalizations about its properties. However, music is far more likely to cover a wider frequency range than speech and, hence, suffer more from sampling at lower frequencies.

Figure 7.11 shows the time waveform of a short passage of a speech, The JMF; an API for Handling Time-based Media, from an adult male in the top plot, as well the first cords of Smoke on the Water played on an electric guitar in the bottom plot.

Figure 7.11. Contrast between two sound waveforms.


An alternative form of encoding music is known as MIDI (pronounced “mid-ee”) for Musical Instruments Digital Interface. This is a digital format for recording the instruments and notes that are being played in a piece of music—not the sounds themselves. As such, it is an extremely compact format when contrasted with digitized sound. On the down side, MIDI doesn't guarantee the same level of fidelity in reproduction that sampling can—it is dependent on the quality of the playback instrument (often a computer soundcard but originally synthesizers and other such instruments) in its capability to use its voices (different sampled or synthesized instruments) to reproduce the sound appropriately.

Content Types, Formats, and Codecs

The origins of the three major audio content types are associated with a particular computer platform—Wave (WAV) from the Windows platform, AIFF from the Macintosh platform, and AU from Unix. All three have grown in parallel such that they roughly provide similar functionality in terms of supported formats. The JMF provides support for all three as well as MIDI, GSM, and the various MPEG schemes.

Until recently, the dominant codecs in the audio arena have had their origins in the telecommunication area, being codecs for compressing speech over telephone lines. Among this group are codecs such as ADPCM, A-Law, and U-Law. A common approach among such codecs is known as companding. The basis of companding is to use a non-linear quantization scale: fewer bits are allocated to the higher values (somewhat analogous to transforming to the log domain).

New codecs have appeared based on perceptual compression and designed for music (more challenging than speech). The MPEG schemes are the best known of these. In particular, MP3 (MPEG Layer 3, not MPEG-3 because such an entity doesn't exist) is famous because of its use to encode music on the Internet. The MPEG compression scheme is frequency domain based. Sound is transformed into a number of (for example, 32) frequency channel values. The frequency dependence of the threshold of hearing (minimum volume for a sound to be heard) is combined with masking effects (loud sounds at one frequency raise the hearing threshold for other frequencies) so that the minimum number of bits are used to encode each channel and hide quantization noise. The MP3 scheme is well known for achieving roughly a 10:1 compression while also maintaining a (perceptually) high quality.

The following lists some of the better-known audio codecs:

ADPCM (Adaptive Differential Pulse Code Modulation)— A temporal based compression scheme that looks at the difference between successive samples. The scheme is further strengthened (but complicated) by predicting what the next sample should be and transmitting or storing only the difference between the predicted and actual difference. A non-linear scheme is employed to record this value. ADPCM is supported by the JMF.

A-Law— A companding compression scheme, A-Law is a standard from the ITU that is closely related to G7.11 (U-Law), and used in those countries where U-Law isn't found. Based on the compression of speech over phone lines, it is able to reduce 12-bit samples to 8-bit quantization. A-Law is supported by the JMF.

G.711 (U-Law)— A companding compression scheme, G7.11 is an ITU standard employed in Japan and North America, as well as being used commonly on the Web and by Sun and NeXT machines. It is related to the A-Law scheme. G.711 compression is supported by the JMF.

GSM— An international standard for mobile digital telephones, GSM is based on linear predictive coding: the prediction of future samples based on (a weighted sum of) those that have already been seen. The scheme achieves significant compression but at a noticeable loss of quality. GSM is supported by the JMF.

MPEG Layer I, II, and III— From the MPEG-1 and MPEG-2 standards, the three layers represent an increasingly (from 1 to 3) sophisticated compression scheme based on perception (see the previous discussion). Layer I corresponds to a data rate of 192Kbps, Layer II a data rate of 128Kbps, and Layer III (MP3: the most famous and widely used) corresponds to an upper-bound on data rate of 64Kbps. The JMF supports all three layers.

RealAudio— From Real Networks and famous because of its widespread exposure and usage on the Internet. RealAudio is a codec designed to support the real-time streaming of audio. RealAudio is a proprietary codec.

To illustrate the differences in terms of degree of compression and audio quality between different codecs, the book's Web site (www.samspublishing.com) has a number of versions of the same audio sample. The audio sample is a short piece in four segments. The first segment is a speech from an adult male speaker of Australian English and serves as an introduction. The three remaining segments are all instrumental music. The first musical piece is an organ playing a few bars of “California Dreaming.” The second musical piece is a guitar playing a few, well known, bars of “Smoke on the Water.” The third and final musical piece is a five second segment of a Didgeridoo being played: a traditional woodwind instrument of the Australian Aborigine. The same original audio sample has been transcoded using a number of different codecs so that they can be contrasted. The name of each file identifies the codec, sampling rate, and quantization level used:

<codec>_<sampling rate>_<quantisation>.<content_type>

For instance, GSM_8_16.wav is a Wave file encoded using GSM at 8KHz sampling and a quantization level of 16 bits. Sampling rates that aren't exact multiples of one thousand (for example, 22.05KHz and 11.025KHz) are rounded as such for the purposes of filenames only. Thus, Linear_22_16.wav is a Wave file with linear encoding sampled at 22.05KHz with 16-bit quantization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset