Chapter 6
Amplitude Envelope and Audio Edit Points

In Chapter 4 we discussed audio signal amplitude envelope processing with the use of compressors and expanders dynamics processing. In this chapter we explore amplitude envelope and technical ear training from a slightly different perspective: from that of audio editing software.

The process of digital audio editing, especially with classical or acoustic music using a source-destination method, offers an excellent opportunity for ear training. Likewise, the process of music editing requires an engineer to have a keen ear for transparent splicing of audio. Music editing involves making transparent connections or splices between takes of a piece of music, and it often requires specifying precise edit locations by ear. In this chapter we will explore how aspects of digital editing can be used systematically as an ear training method, even out of the context of an editing session. The chapter describes a software tool based on audio editing techniques that is an effective ear trainer offering benefits that transfer beyond audio editing.

6.1 Digital Audio Editing: The Source-Destination Technique

Before describing the software and method for ear training, it is important to understand some digital audio editing techniques used in classical music postproduction. Classical music requires a high level of precision—perhaps more so than other types of music—to achieve the level of transparency required.

Based on hundreds of hours of classical music editing, I have found that the process of repeatedly adjusting the location of edit points to create smooth cross-fades by ear not only results in a clean final recording, but the process also seems to promote improved listening skills that translate to other areas of critical listening. Through highly focused listening required for audio editing, with the goal of matching edit points from different takes, we participate in an effective form of ear training.

Digital audio editing systems allow us to see a visual representation of our audio waveforms, and to move, insert, copy, or paste audio files to any location along a visual timeline. Source-destination editing, also known as four-point editing, offers a slightly different work-flow than simply moving clips or regions around and trimming them. Not all DAWs offer the ability to do source-destination editing, but Pyramix by Merging Technologies and Sequoia by Magix are two notable titles that do.

When I edit classical music using the source-destination method, I listen through to find the musical note on which I will make a splice or edit from one take to another. I mark the rough timeline location and then narrow in on the precise placement of my edit point location by ear. Waveform views in a digital audio workstation help in the rough placement of the initial marker for an edit, but it is often more efficient and more accurate to find the precise location of an edit by ear, rather than by looking for some visual feature.

During the editing process, I work from a list of takes from a recording session and I assemble a complete piece of music using the best takes from each section of a musical score. Through source-destination editing, I build a complete musical performance (the destination) by taking the best excerpts from a list of recording session takes (the source) and piecing them together.

In source-destination editing, we find an edit location by listening to a recorded take while following the musical score. Then we place a marker at a chosen edit point in the DAW waveform timeline. As an editing engineer, I usually audition a short excerpt—typically 0.5 to 5 seconds in length—of a recorded take, up to the specific musical note where I want to make an edit. Next, we listen to the same musical excerpt from a different take and we compare it to the previous take. Usually we try to place an edit point precisely at the onset of a musical note, so that the transition from one take to another will be transparent. Source-destination editing allows us to audition a few seconds of audio leading up to an edit point marker in each take and have the audio stop precisely at the marker. Our goal as editing engineers is to focus on the sonic characteristics of the note onset that occurs during the final few milliseconds of an excerpt and match the sound quality between takes by adjusting the location of the edit point (i.e., the end point of the excerpt). The edit point marker may appear as a movable bracket on the audio signal waveform, as in Figure 6.1. It is our focus on the final milliseconds of an audio excerpt that is critical to finding an appropriate edit point. When we choose a musical note onset as an edit point, it is important to set the edit point such that it actually occurs sometime during the very beginning of a note attack. Figure 6.1 shows a gate (square bracket indicating the edit point) aligned with the attack of a note.

When we audition an audio clip up to a note onset, we hear only the first few milliseconds or tens of milliseconds of the note. By stopping playback immediately at the note onset, it is possible to hear a transient, percussive sound at the truncated note onset. The specific sound of the cut note will vary directly with the amount of the incoming note that is sounded before being cut. Figure 6.2 illustrates, in block form, the process of auditioning source and destination program material.

Once the audio clip cutoff timbres are matched as closely as possible between takes, we make the edit with a cross-fade from one take into another and audition the edit cross-fade to check for sonic anomalies. Figure 6.3 illustrates, in block form, a composite version (the destination) of three different source takes of the same musical program material.

Figure 6.1 A typical view of a waveform in a digital editor with the edit point marker that indicates where the edit point will occur and the audio will cross-fade into a new take. The location of the marker, indicated by a large bracket, is adjustable in time (left/sooner or right/later). The arrow indicates simply that the bracket can slide to the left or right. We listen to the audio up to this large bracket with a predetermined pre-roll time that usually ranges from 0.5 to 5 seconds.

Figure 6.1 A typical view of a waveform in a digital editor with the edit point marker that indicates where the edit point will occur and the audio will cross-fade into a new take. The location of the marker, indicated by a large bracket, is adjustable in time (left/sooner or right/later). The arrow indicates simply that the bracket can slide to the left or right. We listen to the audio up to this large bracket with a predetermined pre-roll time that usually ranges from 0.5 to 5 seconds.

Figure 6.2 The software module presented here re-creates the process of auditioning a sound clip up to a predefined point and matching that end point in a second sound clip. We audition the source and destination audio excerpts up to a chosen edit point, usually placed at the onset of a note or strong beat. In an editing session, the two audio clips (source and destination) would be of identical musical material but from different takes. One of our goals is to match the sound of the source and destination clip end points (as defined by the edit marker locations in each clip). The greater the similarity between the two cutoff timbres, the more successful the edit will be.

Figure 6.2 The software module presented here re-creates the process of auditioning a sound clip up to a predefined point and matching that end point in a second sound clip. We audition the source and destination audio excerpts up to a chosen edit point, usually placed at the onset of a note or strong beat. In an editing session, the two audio clips (source and destination) would be of identical musical material but from different takes. One of our goals is to match the sound of the source and destination clip end points (as defined by the edit marker locations in each clip). The greater the similarity between the two cutoff timbres, the more successful the edit will be.

Figure 6.3 Source and destination waveform timelines are shown here in block form along with an example of how a set of takes (source) might fit together to form a complete performance (destination). In this example takes 1, 2, and 3 are the same musical program material, and therefore a composite version could be produced of the best sections from each take to form the destination.

Figure 6.3 Source and destination waveform timelines are shown here in block form along with an example of how a set of takes (source) might fit together to form a complete performance (destination). In this example takes 1, 2, and 3 are the same musical program material, and therefore a composite version could be produced of the best sections from each take to form the destination.

During the process of auditioning a cross-fade, we must also pay close attention to the sound quality of each cross-fade, whose length may range from a few milliseconds to several hundred milliseconds depending on the context. That is, we typically use short cross-fades for transient sounds or notes with fast onsets, and longer cross-fades for editing during sustained notes.

The process of listening back to a cross-fade and adjusting the cross-fade parameters such as length, position, and shape also offers an opportunity to improve critical listening skills. For example, when doing editing for any kind of audio source material, the goal is to make the composite edited audio seamless, with no audible edits. Classical music recordings can contain huge numbers of edits—some recordings are in the range of 10 or more edits per minute—but as we listen to the finished recordings it is nearly impossible to hear even a single edit if they are done well. Here are some cross-fade artifacts that we should listen for when editing and that we can listen for in other commercial recordings:

  • a slight, momentary dip in level
  • a slight, momentary increase in level
  • a note onset or speech syllable that is cut off
  • a timing mismatch—maybe the musical note following the edit feels rushed or delayed slightly
  • a sudden change in ambience or reverberation, such as when an edit is made at a cold start midway through a piece and there is no lingering sound in the take into which we are going
  • a low-frequency thump
  • a doubling, chorus, or phasing effect, especially with longer cross-fades
  • a shift in the stereo image
  • a change in timbre
  • an abrupt change in loudness after the edit
  • a singer or speaker’s breath sound gets cut off
  • a click—if the cross-fade is very short

6.2 Software Exercise Module

All of the “Technical Ear Trainer” software modules are available on the companion website: www.routledge.com/cw/corey.

Based on source-destination editing, the associated technical ear training software module was designed to mimic the process of comparing the final few milliseconds of two short clips of identical music from different takes. The advantage of the software practice module is that it promotes critical listening skills without requiring an actual editing project. The main difference between the practice module and an editing project is that the practice software will work with only one “take,” that being any linear PCM sound file loaded in, whereas in an actual editing project we work with different takes of the same material. Because of this difference, the two clips of audio are actually the same audio, and therefore it is possible to find identical-sounding end points. The benefit of working this way is that the software has the ability to judge if the sound clips end at precisely the same point.

To start, we load any audio file that is at least 10 seconds in duration. The software randomly chooses a short excerpt or clip (which is called clip 1 or the “reference”) from any stereo music recording loaded into the software. The exact duration of clip 1 is not revealed, but we can listen to it by pressing the number 1 in the interface. The software also randomizes the excerpt lengths, which range from 500 milliseconds to 2 seconds, to ensure that we are not simply being trained to identify the duration of the audio clips. We can compare clip 1 to clip 2 (“your answer”) as many times as we want. The duration of clip 2 is displayed in the interface.

The goal of the exercise is to adjust the duration of clip 2 (your answer) until its cutoff point is exactly the same as clip 1 (reference). By focusing on the final few milliseconds before the cutoff point and listening to the amplitude envelope, timbre, and musical content, we compare the two clips and adjust the length of clip 2 so that the sound of its cutoff point matches clip 1. By pursuing a cycle of auditioning, comparing, and adjusting the length of clip 2, we can learn the sound of the clip 1 cutoff point and adjust the length of clip 2 until its cutoff point sounds exactly the same.

We can adjust the length of clip 2 by “nudging” the end point either earlier or later in time. We can choose from different nudge time step sizes: 5, 10, 15, 25, 50, or 100 milliseconds. The smaller the nudge step size, the more difficult it is to hear a difference from one step to another.

Figure 6.4 shows the waveforms of four sound clips of increasing length from 825 ms to 900 ms in steps of 25 ms. This particular example shows how the end of the clip can vary significantly depending on the length chosen. Although the second (850 ms) and third (875 ms) waveforms in Figure 6.4 look similar, there is a noticeable difference in the perceived percussive or transient sound at the cutoff point. With smaller step or nudge sizes, the difference between steps is less obvious and would require more training for correct identification.

Figure 6.4 Clips of a music recording of four different lengths: 825 ms, 850 ms, 875 ms, and 900 ms. This particular example shows how the end of the clip can vary significantly depending on the length chosen. We should focus on the quality of the transient sound at the cutoff point of the clip to determine the one that sounds most like the reference. The 825-ms duration clip contains a faint percussive sound at the end of the clip, but because the note (a drum hit in this case) that begins to sound is almost completely cut off, it comes out as a short click. In this example, we can focus on the percussive quality, timbre, and envelope of the incoming drum hit at the clip cutoff to determine the correct sound clip length.

Figure 6.4 Clips of a music recording of four different lengths: 825 ms, 850 ms, 875 ms, and 900 ms. This particular example shows how the end of the clip can vary significantly depending on the length chosen. We should focus on the quality of the transient sound at the cutoff point of the clip to determine the one that sounds most like the reference. The 825-ms duration clip contains a faint percussive sound at the end of the clip, but because the note (a drum hit in this case) that begins to sound is almost completely cut off, it comes out as a short click. In this example, we can focus on the percussive quality, timbre, and envelope of the incoming drum hit at the clip cutoff to determine the correct sound clip length.

After deciding on a clip length, press the “Check Answer” button to find out the correct duration. You can continue to audition the two clips for that question once you know the correct duration. The software indicates whether the response for the previous question was correct or not, and if incorrect, it indicates whether clip 2 was too short or too long and the size of the error. Figure 6.5 shows a screenshot of the software module.

There is no view of the waveform as we would typically see in a digital audio editor because the goal of this training is to create an environment where we rely solely on what we hear with minimal visual information about the audio signal. There is, however, a green bar that increases in length over a timeline, tracking the playback of clip 2 in real time, as a visual indication that clip 2 is being played. Also, the play buttons for the respective clips turn green briefly while the audio is playing and then return to gray when the audio stops.

With this ear training method, our goal is to compare one sound to another and attempt to match them. There is no need to translate the sound feature to a verbal descriptor, but instead the focus lies solely on our perception of the features of the audio signal. Although there is a numeric display indicating the length of the sound clip, this number serves only as a reference for keeping track of where the end point is set. The number has no bearing on the sound features heard other than for a specific excerpt. For instance, a 600-ms randomly chosen clip will have different cutoff point features than other randomly chosen 600-ms clips.

Figure 6.5 A screenshot of the training software. The large squares with “1” and “2” are playback buttons for clips 1 and 2, respectively. Clip 1 (the reference) is of unknown duration, and the length of clip 2 must be adjusted to match clip 1. Below the clip 2 play button are two horizontal bars. The top one indicates, with a white circle, the duration of clip 2, in the timeline from 0 to 2000 milliseconds. The bottom bar increases in length (from left to right) up to the circle in the top bar, tracking the playback of clip 2, to serve as a visual indication that clip 2 is being played.

Figure 6.5 A screenshot of the training software. The large squares with “1” and “2” are playback buttons for clips 1 and 2, respectively. Clip 1 (the reference) is of unknown duration, and the length of clip 2 must be adjusted to match clip 1. Below the clip 2 play button are two horizontal bars. The top one indicates, with a white circle, the duration of clip 2, in the timeline from 0 to 2000 milliseconds. The bottom bar increases in length (from left to right) up to the circle in the top bar, tracking the playback of clip 2, to serve as a visual indication that clip 2 is being played.

I recommend that you start with the less challenging exercises that use large step sizes of 100 ms and progress through to the most challenging exercises, where the smallest step size is 5 ms.

Almost any stereo recording in the format of linear PCM (AIFF or WAV) can be used with the training software, as long as it is at least 30 seconds in duration.

6.3 Focus of the Exercise

The main goal of the training program I describe in this chapter is to focus on the amplitude envelope of a signal at a specific point in time—that being the end of a short audio excerpt or cutoff point. Although the audio is not being processed in any way other than that there is a fast fade-out, the location of the cutoff point determines how and at what point a musical note may get cut. In this exercise, focus on the final few milliseconds of the first clip, hold the end sound in memory, and compare it to the second clip.

Because the software randomly chooses the location of an excerpt, a cutoff point can occur almost anywhere in an audio signal. Nonetheless, I will note two specific cases where the location of a cut is important to describe: cutoff points that occur during the onset of a strong note or beat and those that occur during a sustained note, between strong beats.

First, let us explore a cutoff occurring at the beginning of a strong note or beat. If the cut occurs during the attack portion of a musical note, the cutoff may produce a transient signal whose characteristics vary depending on the precise location of the cut relative to the note’s amplitude envelope. We can then match the resulting transient sound by adjusting the cutoff point. Depending on how much of a note or percussive sound gets cut off, the spectral content of that particular sound will vary with the note’s modified duration. Generally a shorter note segment will have a higher spectral centroid than a longer segment and have a brighter sound quality. An audio signal’s spectral centroid is the average frequency of the signal’s frequency spectrum. The spectral centroid is a single number that indicates where the center of mass of a spectrum is located, which gives us some indication of the timbre. If there is a click at the end of an excerpt—produced as a result of the cutoff point—it can serve as a cue for the location of the end point. We can assess the spectral quality of the click and match the cutoff location based on the click’s duration.

Next we can discuss a cutoff that occurs during a more sustained or decaying audio signal. For this type of cut, we should focus on the duration of the sustained signal and match its length. This might be analogous to adjusting the hold time of a gate (dynamics processor) with a very short release time. With this type of matching, we may shift our focus to musical qualities such as tempo and timing to determine how long a final note is held before being cut off.

With any end point location, our goal is to track the amplitude envelope and spectral content of the end of the clip. The hope is that the skills learned in this exercise can be generalized to an increased hearing acuity, which might facilitate our ability to hear subtle details in a sound recording that were not apparent before spending extensive time doing digital editing. When practicing with this exercise, we may begin to hear details of a recording that may not have been as apparent when listening through to the entire musical piece. I have found that by listening to short excerpts out of context of an entire musical piece, I begin to hear sounds within the recording in new ways, as some sounds become unmasked and thus more audible. Listening to clips allows us to focus on features that may be partially or completely masked when heard in context (i.e., much longer excerpts) or features that are simply less apparent in the larger context. When listening to a full piece, our auditory systems are trying to follow musical lines, timbres, and spatial characteristics, so it seems as though our attention is constantly being pulled through the piece and we are not given the time to focus on every aspect of what is a complex stimulus. When we take a short clip out of context, we gain the ability to repeat it quickly while it remains in our short-term memory and therefore start to unpack details that may have eluded us while we listened to the full piece. When we repeat a clip out of context of an entire recording, we may experience a change in the perception of an audio signal. Similarly, if you repeat a single word over and over, the meaning of the word starts to fade momentarily and we begin to focus on the timbre of the word rather than its meaning. It is common for composers (especially of minimalist music) to take short musical phrases or excerpts of recordings and repeat them to create a new type of sound and perceptual effect, allowing listeners to hear new details in the sound that may not have been apparent before. For an example, listen to “It’s Gonna Rain” by Steve Reich, which uses a recording of a person saying the words “it’s gonna rain” played back on two analog tape machines. In the piece, Reich loops those three words or portions of the three words to create rhythmic, spatial, and timbral effects through a gradually increasing delay between the two tape machines. He takes advantage of the human auditory system’s natural tendency to find patterns in sound and lose the meaning of words that are repeated over and over.

The audio clip edit ear training method may help us focus on quieter or lower-level features (in the midst of louder features) of a given program material. Quieter features of a program may be partly or mostly masked, perceptually less prominent, or considered in the background of a perceived sound scene or sound stage. Examples might include the following (those listed earlier are included here again):

  • reverberation and delay effects for specific instruments.
  • artifacts of dynamic range compression for specific instruments.
  • specific musical instrument sound quality, such as a drummer’s brush sounds or the articulation of acoustic double bass on a jazz piece.
  • specific features of each musical voice/instrument, such as the temporal nature or spatial location of amplitude envelope components (attack, decay, sustain, and release).
  • definition and clarity of elements within the sound image, width of individual elements.

Sounds taken out of context start to give a new impression of the sonic quality and also the musical feel of a recording. Additional detail from an excerpt is often heard when a short clip of music is played repeatedly, detail that would not necessarily be heard in context.

As I was creating this practice module, I chose, perhaps arbitrarily, the jazz-bossa nova recording “Desafinado” by Stan Getz, João Gilberto, and Antônio Carlos Jobim (from the 1964 album Getz/Gilberto) as a sound file to test my software development. The recording features prominent vocals and saxophone, acoustic bass, acoustic guitar, piano, and drums played lightly. Through my testing and extensive listening, I have gained new impressions of the timbres and sound qualities in the recording that I was not previously aware of. Even though this might seem like a fairly straightforward recording from a production point of view—all acoustic instruments with minimal processing—I began to uncover subtle details with reverb, timbre, and dynamics. In this recording, the drums are fairly quiet and more in the background, but if an excerpt falls between vocal phrases or guitar chords, the drum part may perceptually move to the foreground as the matching exercise changes our focus. It also may be easier to focus on characteristics of the drums, such as their reverberation or echo, if we can hear that particular musical part more clearly. Once we identify details within a short excerpt, it can make it easier to hear these features within the context of the entire recording and also generalize our ability to identify these types of sound features to other recordings.

Summary

This chapter outlines an ear training method based on the source-destination audio editing technique. Because of the critical listening required to perform accurate audio editing, the process of finding and matching edit points can serve as an effective form of ear training. With the interactive software exercise module, the goal is to practice matching the length of one sound excerpt to a reference excerpt. By focusing on the timbre and amplitude envelope of the final milliseconds of the clip, the end point can be determined based on the nature of any transients or length of sustained signals. By not including verbal or meaningful numeric descriptors, the exercise is focused solely on the perceived audio signal and on matching the end point of the audio signals.

In any audio project, try to listen to:

  • the quality of transients—are they sharp and clear or broad and smeared?
  • the shape of any cutoffs or fade-outs that are present
  • the amplitude envelope of every signal
  • lower-level and background elements such as reverb

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset