5

Dialogue

Game dialogue is an area of game audio that has seen significant improvements in recent years. For some strange reason, in the past many game producers and designers appeared to be under the impression that they were born with an innate ability to write good dialogue, rather than appreciating that, like any other skill, it takes years of practice and application to master.

Trying to avoid clumsy game dialogue is particularly difficult, as there is often a huge burden put on dialogue as the sole driver of a game’s narrative. Perhaps not yet as developed in terms of visual storytelling as film, where a couple of shots may convey several key story elements, games often rely on dialogue to not only move the narrative forward but also to provide a great deal of instructional information to the player. Games also tend to operate within well-established genres (action/space opera/Tolkienesque fantasy) where it is hard to avoid cliché. Even if you are an experienced writer this is challenging.

Concepts and Planning

The first thing to appreciate about implementing dialogue for games is that it is, to a large degree, an asset-management problem. You need to approach the whole exercise with excellent organization and management skills if you are going to avoid some painful mistakes.

Obviously your dialogue production process cannot start until the story, characters, and mission details have been completed and signed off by all parties. At this stage, the type of dialogue required may fall into one of these broad categories.

Types of speech

Informational Dialogue

This dialogue provides players with the information they need to play the game. This might be instructions on how to use a piece of equipment or designating a rendezvous point. The key points with the dialogue are that it needs to be (1) clear and (2) heard. It is usually clear and articulate and in complete sentences, very different from natural dialogue. This dialogue might be panned to the center speaker or “radioized” over distance (so that when the attenuation over distance means that it falls below a given volume level, it becomes heard as if over a radio headset) as it is vital that it is heard.

Character Dialogue

This dialogue conveys the thoughts and emotions of the speaker. It is more naturalistic than the informational style because its emphasis is on conveying the emotions of the words. This is more likely to be spatialized in the game world but may still also be “bled” into the center speaker for clarity (see SoundClasses in Chapter 6).

Ambient/Incidental Chatter

This dialogue helps to set the scene for the location or situation you are in. Although crucial for establishing the ambience, the information conveyed by the words is not necessarily vitally important. This would be spatialized to originate from the characters in the game.

Nonspeech Vocal Elements

These are the expressions of pain, dying, and screaming that can make a day in the editing room such a pleasure. Again, these would be spatialized in the game.

The ideal aim is for the informational category to be subsumed into character so that any instructions necessary emerge naturally out of the character’s speech. Given the relative lack of dialogue in games and the nature of what are often very specific goal-based narratives, this is often difficult to achieve.

Repetition and Variation

Earlier in the book we discussed how repetition destroys immersion in games. This is a particularly acute problem for dialogue. Since you were a baby, the human voice has been the most important sound in the world to you. It is the thing you pay most attention to and you have spent your life understanding the subtle nuances of frequency, volume, and timbre that can change a compliment into a sarcastic swipe. Our hearing is most acute for the range within which human speech falls. In other words, we are very sensitive to speech. Even if a character says the same line, in the same way (i.e., playing back the same wav) several hours apart from the first time the character said it, the chances are that we will notice and this will momentarily destroy our immersion in the game world.

The nature of games is that we are often in similar situations again and again, and are consequently receiving similar dialogue (“enemy spotted”/“I’ve been hit”). This indicates that for the most heavily used dialogue for each character, we need to record lots of different versions of the same line.

X 3500000

In addition to needing multiple versions of the same line, you may also need the same line said by different characters. If you have three characters in your group, then you may need three versions of each line, as each character might be in a position where he or she needs to say the line. For example, in our generic military shooter, an enemy might be spotted in a particular position by any of the characters, depending on who got to the location first.

At the start of a game you may be able to choose your character, or at least choose between a male and female character. This will mean at least two versions of every line of your dialogue for the entire game.

The game may have a branching dialogue system where the character’s next line will vary according to which response the player chooses. You only have to go a short way along such a dialogue tree to see the number of lines implied. In this example, four sentences are exchanged; however, the number of sound files necessary for all the permutations is 25.

image

Localization

Once you’ve got your script with all 30,000 lines worked out, it might dawn on you that most of the world does not actually speak English. The nine foreign-language versions need their dialogue “localized” (if you’re lucky it will just be French, Italian, German, and Spanish). Localization can be a huge job requiring specific skills, which is why many outsourcing companies have sprung up to provide this service. These recording, editing, and organizational skills generally fall outside of the implementation remit of this book. However, they are the principles with which you should become familiar:

1. For the in game cut scenes or full motion videos (FMVs), the foreign language dialogue will need to sync to the existing English-language animations. It would cost too much to reanimate to match the timing of the new language. Many foreign-language voice Actors are very experienced in dubbing for films and so are expert at matching the rhythm and flow of the new lines to the visuals. As you will appreciate, however, particularly with close-ups on the face, this is difficult to do. You can use the English recordings as a template for time reference so that the local Actors can attempt to approximate the same timing. Given that some languages will require considerably more words or syllables to communicate the same sentence, this is never going to be ideal, particularly if your English performance is fast.

2. The new language files need to match the exact length of the existing English language assets so that the overall file size does not differ.

The good news is that you may not need to replace the non-word-based vocals such as screams and expressions of pain, although we believe in some regions “aiieee!” is more popular than “arrgghhh!”.

Localization is achieved in UDK by keeping your [SoundCue] in separate packages from the wavs they reference. This way the wavs package can have a three-letter language code suffix and the appropriate one chosen according to the game’s language settings—for example, _KOR (Korean), _DEU (German), or_JPN (Japanese). See the UDK support files on the UDK website for more details.

Casting

Your game characters have the potential to be speaking off screen on many occasions, so you will want to cast voices that are distinctive and easily differentiated from each other. You’ll note that despite the reality, where an army platoon would often be from a similar geographical location, in games they are rather more international or at least have distinct localized accents so that the player finds it easier to identify them. Their character is also part of individualizing their voices, so if you have a team then also think about whether you might have the “serious” one, the “joker,” the “rookie,” and the like (or perhaps something more imaginative). The main characters should be particularly distinctive from the others (it has been known in film that if the supporting character’s voice is too similar to that of the lead Actor, for their voice to be completely overdubbed with a new voice in order to help the audience differentiate them).

Despite the continuing delusions of certain producers, there is no evidence to support the premise that having a moderately successful Actor playing the voice of Zarg in your latest game will actually lead to any increase in sales, so it’s recommended that you choose a cheaper Actor with some understanding of games. Any budget spent on experienced voice Actors (apart from ones who have done one too many TV commercials) will be wisely invested and may help to preserve your sanity through the long hours ahead.

When specifying the voice talent you are seeking, you should obviously outline the sex, the “voice age” you are looking for (which does not necessarily match their actual age), and the accent. In addition to using well-known voices as references, you may also want to describe the qualities you’re after such as the pitch (high/low), timbre (throaty/nasal), and intonation (how the pitch varies—usually closely linked to accent). It may also be useful to refer to aspects of character and personality. You may wish to consider the Actor’s flexibility, as a good Actor should be able to give you at least four different characters, often many more.

Recording Preparation

This is the key time to get your organizational strategies in place to save time later on. (You might approach this stage differently if you were recording a session for concatenated dialogue; see the ‘Concatenation, Sports, and How Not to Drive People Nuts : Part 2’ section below.)

Script Format and Planning

Although Actors are used to a very specific format for their scripts, it is often essential to use a spreadsheet for tracking game dialogue. From this document it is relatively easy to increase the font size so that it’s easier to read or if time allows (or if it speeds things up with the Actors), you can copy out the script into a movie format (see the online Bibliography for this chapter) from here.

Your script needs to contain the lines together with key information about the line. Obviously a dialogue line can be read in many different ways so it’s important that you, and the Actor, know the context of the line and the tone with which it should be read. A classic mistake would be to have the Actor whispering a line and later realize that this line actually takes place just as he’s supposed to be leaping onto a helicopter to escape from an erupting volcano. The line might sound good in the studio, but when put into the game you just would not be able to hear it.

Your spreadsheet should contain at least the following information. (See the downloadable template on the website.)

image

Character. Name of the character.

Line/cue. The actual dialogue you want the Actor to say.

Actor name. The Actor who’s playing the character.

Area/location. Informational and character dialogue are often linked to a specific game location, so these items would be tagged with the mission, area, and character name. Non-specific ambient and incidental dialogue could be named “generic” in the mission column.

Context/situation. This should include the setting and scene description to give as much detail on the scenarios as possible (i.e., distance from player, other details about the sound environment). You might even consider bringing screenshots or video footage from the game. This is simple to do and could save you time. Your characters may be saying the same line under different circumstances, so the Actors will need to know the context.

Inflection. Conversational, angry, sarcastic, panicked, fearful, whisper, shout. (You should agree on a predetermined preamp setting for lines tagged “whisper” or “shout” and make sure this is consistently applied to all lines.)

Effect. Any postprocessing effects, such as “radioize”, that will be applied to the recording later.

Filename. You need to come up with a meaningful system for naming your files (see Appendix A).

Line I.D. You will need some kind of number for each line for identification.

Take or variation number. You’ll usually try to get at least two different versions of the line, even if the first is fine, to give you flexibility later on. In the case of incidental dialogue that may be needed repeatedly, you would try to get many versions.

Keep? It may be immediately obvious that the line is a “keeper” (i.e., good) so it’s worth noting this at the time to save trawling through alternative takes later.

There are three other elements you should try to have with you in the studio if at all possible:

1.  The writer. This person knows the context and the intention of the dialogue better than anyone.

2.  A character biography sheet. This should include some background about the character together with any well-known voices you want to reference (e.g., “like a cross between Donald Duck and David Duchovny”).

3.  A pronunciation sheet. If dialogue is going to be part of your life, it’s worth spending some time researching phonemes (the individual sounds from which words are formed—discussed further later in the chapter). This will allow you to read phonetically so that you can remember and consistently re-create the correct pronunciation for each word. If your game is based around the city of Fraqu (pronounced with a silent q, “fraoo”), then make sure someone knows the correct way to say it and get your Actors to say it consistently.

Session Planning

Before booking the Actors and recording sessions you are going to need to make a decision regarding your approach to recording. Are your sessions going to be planned around using individual Actors or groups (ensembles) of Actors? There are advantages and disadvantages to both approaches.

The advantage with ensemble recordings is that the dialogue will flow much more naturally as the characters talk to each other, rather than one person simply making statements to thin air. In addition to allowing you to record all sides of a conversation at the same time, you can also get the kind of chemistry between the Actors that can make dialogue really come to life. The downsides are that this chemistry can also work in reverse. More egos in the room means more pressure on the Actors, more potential for problems or misunderstandings, and, of course, more people to make mistakes and therefore more retakes.

Getting the script to the Actors in advance and putting some time aside for rehearsal can alleviate some of these problems. You are in a different (cheaper) space, so the pressure of the red recording light is off and the Actors can get to know each other, the script, and you. If you go down the ensemble route, you can probably expect to plan for 50 to 80 lines per hour for typical (average 10-word) dialogue lines. For an Actor working alone you could sensibly plan for around 100 lines per hour (with an average of 2 takes per line) although this could be up to 200 with an experienced Actor, or as low as 50 for complex emotional scenes.

In your spreadsheet it’s as easy as sorting by the character’s name column to see all the lines for that individual, then the order in which to record them could be decided. If you’re working on a narrative-based game, it would be beneficial for the Actor if you could group them into some kind of linear order so that the Actors can track what their characters are going through in terms of the story. It’s more difficult (and hence more time consuming) to skip around in mood from one line to the next. If there are lines where the character is shouting or exerting him or herself, these can easily strain an Actor’s voice, so you should plan to have these lines toward the end of your session.

In addition to the documentation, you should prepare and bring the following items to your session to help make your voice Actors more comfortable and therefore more productive:

  Pencils

  Warm and cold water (avoid milk, tea, and coffee as these can build up phlegm in the throat)

  Chewing gum (gets saliva going for dry mouths)

  Green apples (helps to alleviate “clicky” mouths)

  Tissues (for all the spittle)

Good studios are expensive places, so an investment of time and effort into this planning stage will save a lot of money and effort in the long run.

Recording

Wild Lines

The most common recording to be done for a game is a “wild line,” which is a line that is not synced to any kind of picture or pre-existing audio. The dialogue will be recorded first and the animation and lip-sync will then be constructed to match the previously recorded dialogue.

Studio Prep and Setup

You may have to record dialogue in multiple studios, sometimes with the same Actor many months apart, yet you need to ensure that the final lines still have the same characteristics so that one does not pop out as sounding obviously different from another. To ensure consistency between multiple takes and different locations, accurately note your setup so that you can reproduce it again. Avoid using any compression or EQ at the recording stage, as this allows you greater flexibility later on. As noted in the discussion of the spreadsheet, consider establishing particular preamp settings for getting the best signal-to-noise ratio for normal, whispered, and shouted lines.

Make a note of all the equipment you are using (mic, preamp, outboard FX) and take pictures and make measurements of the positions of this equipment within the room and relative to the Actor. Note these in your DAW session.

You will need a music stand or equivalent to hold the script just below head height so that the Actors can read and project their voices directly forward. If the script is positioned too low, they will bow their heads and the dialogue will come out muffled. It’s common to put the mic above the Actor’s heads and angled slightly toward their mouths. In addition to being out of the way of the script, this also helps, in combination with a pop shield, to alleviate the “popping” of the microphone that can occur with certain plosive consonants that result in a sudden rush of air toward the microphone’s diaphragm (such as “p” and “t”).

You might also consider placing a second mic at a greater distance from the Actor, which can come in useful for lines that require the voices to sound more distant (see the discussion of outdoor dialogue in the “Voices at a Distance” section).

Performance

The most important thing about any studio session is not the equipment, it’s the performance of the Actor. Post-Production editing can cover a multitude of sins, but it cannot make a flat performance come to life.

The ideal is to have a professional Director. These people know how to communicate with Actors and will get you the very best results. If it’s down to you, then it is about knowing what you want (which your script planning should have provided for you), communicating it effectively to the Actors, and making them comfortable and relaxed. Few things are going to upset an Actor more than being directed by someone who can’t decide what they want. It’s fine to try different approaches, but it’s important to exude the confidence that you know what you’re doing (even if you don’t).

  Talk about the character and context with the Actor.

  Have a good idea of what you want, but be prepared to be surprised by any original angle the Actor may bring—be flexible enough to run with it if it works.

  Be clear and specific in your comments.

  Try to avoid reading the line and getting the Actor to imitate it. Most Actors don’t like “feeding,” and this should be used only as a last resort.

  Get and track multiple takes.

Near and Far (Proximity Effect and Off-Axis)

With the script and microphone in place, it’s unlikely that your Actors will be doing so much moving around that they affect the quality of the sound. However, it’s worth considering the nature of the microphone’s pickup pattern for a moment, as the distance and direction from which the sound source comes to the microphone can have a dramatic effect on the sound. A “normal” distance could be anything from 8 to 18 inches depending on the mic’s characteristics, but you may want to go outside of this range for specific effects.

Proximity Effect

You will no doubt be familiar with the rumbling boom of the dramatic movie trailer voice, or the resonant soothing lows of the late night radio DJ. Both are aided by the proximity effect of directional microphones. When you are very close to a microphone, the bass frequencies in your voice are artificially increased. Although this may be great for your dramatic “voice of God” sequence, you should avoid it for normal dialogue.

Although the distances where this effect becomes apparent can vary from mic to mic, the chances are that any voices within three inches will start to be affected. In addition to helping with “pops,” your correctly positioned pop shield is also a useful tool for keeping an overenthusiastic Actor at a safe distance from the mic.

Voices at a Distance

It is notoriously difficult to get voices that have been recorded in a studio environment to sound like they are outside. Because this is a frequent requirement for game audio, we need to examine the possible solutions.

The first and most important thing is to get actors to project their voices as if they are outside. This takes some practice and skill to overcome the immediately obvious environment they’re in. You can help them by turning down the volume of their own voice in their ears, or even playing back some noisy wind ambience that they feel they need to shout over the top of. You certainly need to avoid the proximity effect, but you should also experiment with placing the microphone at different distances from the Actor and at different angles (your room needs to be properly acoustically treated and “dead” for this). The “off-axis” frequency response of a microphone is significantly different from that when facing it straight on and can be effective, together with the performance, in creating the impression of outside space.

(Other techniques you can try, after the session, to make things sound like they are outside are discussed in the online Appendix E.)

ADR Lines and Motion Capture Recordings

In the film world, “automated” or “automatic” dialogue replacement (ADR) (sometimes referred to as “looping”), is used when the dialogue recorded on the set is of unusable quality. The Actors will come into the studio, watch their original performances on the screen, and attempt to synchronize their new recording with the original dialogue.

On occasion you will also need to use this process for game audio cut scenes. Cut scene animations are either animated in the engine by hand or are based on animation information gathered during motion capture sessions (usually a combination of both). The most natural results in terms of the dialogue are of course derived from a group of Actors working together on the motion capture stage.

Through the use of wireless headset mics or booming (using a boom microphone) the Actors it is possible to get good results, but not all motion capture stages are friendly places for sound, so this situation may require ADR.

Even if you know that the dialogue recorded on the stage is not going to be of sufficient quality, you should record it anyway. Listening back to the original “scratch track” can help the Actors remember their original performances and synchronize more easily to the video.

Editing and Processing

A good dialogue editor should be able to nearly match the production rate in terms of productivity (50 to 100 lines per hour). To get the dialogue ready to go into the game, there are a number of processes you may need to apply to the recorded tracks.

Editing

Get rid of any silence at the start and end of the line (top “n” tail). Remember, silence takes up the same amount of memory as sound.

High-Pass Filter

If you recorded in an appropriate environment you likely won’t have any problems like hiss or highend noise, but you may want to get rid of any low elements like mic stand or feet movements by using your EQ to filter off the low end. You can normally filter off anything below around 100 to 150 Hz for male voices and 150 to 200 Hz for female voices without significantly affecting the natural timbre of the voice.

Clicks and Pops

Despite your care with the pop shield, the odd pop may occasionally get through. You can zoom right in and edit the waveform manually or use a click/pop eliminator tool (normally found in the noise reduction/restoration menu of your DAW).

Breaths

What you do with the breaths will depend on the circumstances within which the dialogue will be used. Leaving breaths in will give a more intimate feel, whereas taking them out will leave the dialogue sounding more formal or official. Sometimes they are part of the performance, so you should leave well alone but experiment with the settings on your expander processor that will drop the volume of any audio below its given threshold by a set amount. (Extreme settings of an expander serve as a noise gate.)

De-Essing

There are a number of strategies for reducing the volume of sibilant consonants like “s” and “z,” which can be unnaturally emphasized through the recording process, but your DAW’s de-esser will usually do the trick by using a compressor that responds only to these sibilant frequencies.

Levels

Dialogue can vary hugely in dynamic range from a whisper to a shout, so you will often need to use some processing to keep these levels under some kind of control. No one is going to be able to do this better than an experienced engineer riding a fader level, but this is rarely possible given the amount of time it would take to sit through all 350,000 lines. Automatic volume adjustments based on average values of the wav file are rarely successful, as perceived volume change depends on more than simply a change in the sound pressure level (db). (See Appendix E for a recommended software solution.)

Rapid Prototyping for Dialogue Systems

Often you will want to get dialogue into the game quickly for testing and prototyping reasons. In addition to simply recording it yourself, you could also try using your computer’s built-in speech synthesizer or one of the many downloadable text-to-speech (TTS) programs. Many of these programs will allow you to batch-convert multiple lines to separate audio files which will allow you to get on with prototyping and testing, and will give you a general indication of the potential size of the final assets. Fortunately for us, UDK has its own built-in text-to-speech functionality that we’ll be using in the next example.

Implementation

Branching Dialogue

500 Branching Dialogue

A guard is blocking the door to Room 500. You have to persuade him to let you through to the next area.

This is a simple example of branching dialogue. The game is paused while an additional interface is overlaid that allows you to choose your responses. Different lines are synthesized using the text-to-speech (TTS) function available within a [SoundCue] to react appropriately to the different responses you might give.

image

The system uses UDK’s Scaleform GFx functionality which allows you to build user interfaces in Flash. These provide the buttons for the player to click on in order to interact and navigate the conversation. (See the UDK documentation for more on GFx.) Depending on how polite you are, the conversation will end with the bot either graciously standing to one side to allow you through or attempting to annihilate you with it’s mighty weapons of death.

image

The branching dialogue tree is shown below. The boxes in bold indicate that the outcome is that the bot will attack; the lines boxed with a hashed line indicate where the bot has decided to let you through.

image

Although this dialogue tree relies on the simple playback of dialogue lines you can appreciate how it can create a narrative problem (given the complexities that can arise), an asset problem (given the number of sound files it may require), and an organizational problem (to keep track of this number of lines). The real solution is to work with your programmer to call your [SoundCue]s from a database, as trying to build a visual system (such as the Kismet one here) quickly becomes nightmarish.

The speech itself is created using TTS functionality within the [SoundCue]. After importing a fake (very short) SoundNodeWave into the [SoundCue], the Use TTS checkbox is ticked and the dialogue entered in the Spoken Text field. You’ll note that the duration of the SoundNodeWave updates to reflect the length of this dialogue, and the actual SoundNodeWave you imported is not played at all. Using a combination of the duration and sample rate, you can record these figures to give you a good indication of the likely size of your files when the final dialogue goes in.

image

You have to be aware that although the duration appears to have changed, the [PlaySound] object will still be flagged as finished after the duration of the original imported SoundNodeWave, so you should add a [Delay] of the approximate length of your line as we have in the Kismet system for this example.

AI, Variation, and How Not to Drive People Nuts: Part 1

Dialogue is very effective in making your game AI look stupid. A non-player character (NPC) saying “It’s all quiet” during a pitched battle or saying nothing when you crash an aircraft immediately behind him will instantly break the believability of the game. Dialogue is often an indicator for a change in AI state (“I think I heard something”), and you feel more greatly involved in the game when the dialogue comments on your actions. You are reliant to a great extent on the effectiveness of these AI systems to ensure that your dialogue is appropriate to the current situation. The more game conditions there are, then the more specific you can be with the dialogue. Obviously situational “barks” that occur frequently will need the greatest number of variants. You also need to work with your programmers to determine the frequency with which certain cues are called (or if they’re not listening, put some empty slots into those cues).

501 Abuse

image

In this room there is a poor bot just waiting to be repeatedly shot. As the bot is shot, he responds with a vocal rebuke. Because this incident (shooting a bot) may occur many times in your game, you want to ensure that the dialogue does not repeat the same line twice in quick succession.

image

The [SoundCue] for the bot’s responses uses the [Random] node where the 15 dialogue lines are equally weighted (and therefore equally likely to be chosen). Within the [Random] node you’ll see that the additional option “Randomize Without Replacement” has been chosen. This tracks which input has been triggered and then randomly chooses from the remaining options. This selection results in a random order of choices but at the same time ensures that the same line will be chosen at the longest possible time interval from when it last played.

image

The system in Kismet also uses a [Gate] (see image above) so that a dialogue line, once begun, is not interrupted by another. When each line is finished, the [PlaySound] object opens the [Gate] to allow the next line through.

Exercise 501_00 I think I heard a shot!

To get into the safe, you need to shoot the door 10 times. Each time you do, the guard in the next room thinks he hears something. If you pause for a few seconds, he will then return to normal duty. If you don’t wait, he will be alerted and you will have failed the mission.

Tips

1.  Put a [DynamicTriggerVolume] around the safe. (These are easier to set up for link gun weapon damage than a normal [TriggerVolume]).

2.  In the Collision menu of the properties of this Actor (F4) make sure the blockrigidbody flag is checked and make sure the Collision type drop down menu refers to COLLIDE_BlockWeapons. Then within the Collision Component menu check the first flag named Block Complex Collision Trace.

3.  With the [DynamicTriggerVolume] selected in the level right-click in Kismet to create a New Event Using [DynamicTriggerVolume] (***) Take Damage. In the properties of this event (Seq Event_Take Damage) set the damage threshold to 1.0.

4.  Send the out of this event to a [SoundCue] that has “Randomize Without Replacement” on at least 10 lines that convey the message “I think I heard something”. Also pass it through a [Gate] (ctrl + G) to a [Delay] (Ctrl +G) object. This will set the delay before the guard thinks the threat has passed.

5.  After this [Delay] go to another [SoundCue] that indicates the bot has given up and is back to normal (“Oh well, must have been the wind.”)

6.  Link the out of the [Take Damage] event to an [Int Counter] to count the number of shots. Set the value B to 2. If the player shoots twice in quick succession then A=B. Link this to a [SoundCue] to play an alert sound. (Also link it to the guard spawn command that we’ve provided in the exercise.)

7.  Use a looping [Delay] of around 3 seconds to send an Action/[Set Variable] of zero to value A of the [Int Counter] so that the player who waits can then fire again without being detected.

8.  Use another [Int Counter] that you don’t reset to also count the shots fired. When this =10 send this output to the explosion [Toggle] and [Matinee] provided that opens the safe.

(If you just want to focus on the dialogue then open EX501_00_Done where we’ve provided the whole system for you !).

Concatenation, Sports, and How Not to Drive People Nuts: Part 2

In terms of dialogue and sound, sports games present huge challenges. Dialogue is there for authenticity, reinforcement, information, and advice. At the same time as fulfilling all these functions the speech and the crowd, camera, and graphic overlays all need to match and make sense.

Some elements can be highly repeatable. We’ve seen how informational sounds that carry meaning are less offensive when repeated because we’re interested in the message they convey rather than the sound itself. Likewise, in sports games we can get away with unmemorable expressions of words like “Strike!” that simply acknowledge an oft-repeated action. Other dialogue cues such as remarks about specific situations or jokes will be much more readily remembered. Your system also needs to show an awareness of time and memory by suppressing certain elements if they repeat within certain time frames. In natural speech we would not continually refer to the player by name but would, after the first expression, replace the name with a pronoun, “he” or “she.” However, if we came back to that player on another occasion later on, we would reintroduce the use of his or her name. As you might begin to appreciate this is a complex task.

Sports dialogue can be broadly separated into two categories, “play by play” and “color” commentary.

The play-by-play commentary will reflect the specific action of the game as it takes place. The color commentator will provide background, analysis, or opinion, often during pauses or lulls in the play-by-play action and often reflecting on what has just happened. Obviously some events demand immediate comment from the play-by-play commentator, so this will override the color commentary. After an interruption, a human would most likely come back to the original topic but this time phrasing it in a slightly different way (“As I was saying”). To replicate this tendency in games we’d have to track exactly where any interruption occurred and have an alternative take for the topic (a “recovery” version). However, where the interruption occurred in the sentence would affect how much meaning has already been conveyed (or not). We’d need to examine how people interrupt themselves or change topic midsentence and decide whether to incorporate natural mistakes and fumbles into our system.

In sports games, everything tends to be much more accelerated than it would be in reality. This makes life in terms of repetition more difficult, but more importantly events happen faster than your speech system can keep up. A list of things to talk about will build up, but once they get a chance to play they may no longer be relevant (e.g., play has continued and somebody else now has the ball). You need a system of priorities and expiry times; if a second event has overtaken the first or if some things haven’t been said within a certain period, then they are no longer worth saying.

All these challenges would be significant in themselves but are heightened by (1) our innate sensitivity to speech and language and (2) the huge amount of time that many players spend playing these games. The commentator dialogue will be a constant presence over the many months or even years that these games are played. Rarely are games audio systems exposed to such scrutiny as this.

Let’s examine a typical scenario for a sports game to look at a particular challenge regarding the speech. In the English Premier football league there are 20 teams. In a season each team will play each other team twice, once at home and once away, giving each team 38 games. At the start of the match we’d like the commentator to say: “Welcome to today’s match with Team A playing at home/away against Team B. This promises to be an exciting match.”

To ensure that this sentence states each of the possible matches in the football season, we’d need to record this statement 760 times to have a version for each of the possible combinations of teams. Now imagine that we have 11 players on each team (not including subs) and each player might intercept a ball that has been passed by another player (or pass to, or tackle, or foul). Similarly in a baseball league there might be hundreds of named players, so our combinatorial explosion has returned once again!

You can perhaps appreciate from the football/soccer sentence above that it’s only actually certain words within the sentence that need replacing. We could keep the sentence structure the same each time but just swap in/out the appropriate team names or the word “home” or “away.” This method is called concatenation or “stitching.” We’ve already come across the [Concatenator] node in the [Soundcue] that will string together its inputs end to end, and this is what we want to happen to our sentence elements.

Sentences usually comprise one or more phrases. By using a phrase-based approach, you can create a compound sentence by stitching together an opening, middle, and end phrase. This can work well with speech of a relatively relaxed, low intensity. The challenge here is to get the flow of pitch and spacing over the words feeling right across all versions. It is often the spacing of words that stops the phrases from sounding natural and human. (If you want to hear some examples of bad speech concatenation, try phoning up your local utility service or bank.) You can also appreciate how a recording session for a concatenated system would differ from a normal session in that you’d want to construct a script so that you get the maximum number of the words you want in the least amount of time.

502 Football/Soccer

In Room 501 you can see a prototype of a football kick around (five-a-side) for illustration purposes. When you press the button you can see that there is a pass made between random players (represented by the colored spheres).

image

This system works very simply by tracking which player has passed to which player via two variables, StartingBot and EndingBot. When the player passes the ball (press the E key), the opening phrase “Nice pass from” is played. Then the first variable StartingBot is compared against the variables for each player (1 to 5) and the appropriate player name [PlaySound] is chosen. The linking word “to” is played by the central [PlaySound], and then the EndingBot variable is compared to see which player name should be used to complete the sentence.

image

Nice pass from”            “ ” “to” “ ”

Although functional, you can hear that this does not really sound in any sense natural. In a real match these comments would also vary in inflection depending on the circumstance; for example, a pass of the ball a little way down the field would be different from a pass that’s occurring just in front of the goal of the opposing team with only 30 seconds left in the match. It might be that you record two or three variations to be chosen depending on the current game intensity level. The problem with more excited speech is that we tend to speak more quickly. Even in normal speech our words are often not actually separate but flow into each other. If you say the first part of the phrase above “Nice pass from player 01” you’ll notice that in “nice pass” the “s” sound from “nice” leads into the “p” sound of “pass.” In this case it is not too much of a problem because it is unvoiced (i.e., does not have a tone to it). The “m” of “from,” however, is voiced; even when your mouth is closed the “m” is a tone that then leads into the tone of “p” from “player.” An unusual jump in this tone, produced for example by the words “from” and “player” resulting from two different recording sessions or a different take of the same session, will again feel unnatural.

More “seamless” stitching can be achieved, but it requires a deep understanding of aspects of speech such as phonemes and prosody that are beyond the remit of this introductory book. See the bibliography of this chapter for further reading suggestions.

Crowds

A reasonable approximation of a crowd can be created simply by layering in specific crowd reactions against a constant randomized bed of general crowd noise. This is the nature of sports crowds themselves and so can work well, particularly if linked with some variable fade in/fade outs within the [PlaySound] objects.

image

For more complex blending of sourced crowd recordings, we need to have control over the volume of these sounds while they are playing in order to be able to fade them up or down depending on game variables. This is similar to the approach we used for interactive music layers. By using variables to read through arrays containing volume information, you can control the curves of how the different [SoundCues] respond to the variables. The good news is that as this isn’t music, you won’t need to worry about aligning them in time.

image

image

Exercise 502_00 Concatenation

In this exercise you have a fairground game where you have to shoot the colored ducks.

The game allows the player to shoot three targets (each of which are worth a different number of points) in a row, and then it will recap what the player hit and the total points scored. You have to add the dialogue.

This is the playback system.

“You shot (***), (***) and (***)”

“You scored (***)”

Tips

1.  Consider the elements of dialogue you need.

a.  “You shot” and “You scored”

b.  Ducks. The colors of the Ducks hit (“Red” – worth 15 points, “Green” – worth 10 points, and “Blue” – worth 5 points)

c.  Scores (work out the possible scores you can get from all the combinations of these three numbers: 5, 10, and 20)

2.  The system has been set up for you so you just need to add your [SoundCue]s. The challenge lies in (1) recording and editing the dialogue for the smoothest possible delivery of each line and (2) seeing if you can improve upon the current system.

In-Engine Cut Scenes: A Dialogue Dilemma

Your approach to dialogue in cut scenes will depend on whether they are FMV (Full Motion Video) or in-game. If they are FMV, then you will add the dialogue as you would in a normal piece of linear video such as an animation or a film. This would conventionally be panned to the center speaker, but some panning could work depending on the setup of the particular scene. If the cut scene takes place in game, then you have the same problem that we first came across when discussing cut scenes in Chapter 3. The sound is heard from the point of view of the camera, which can result in some extreme and disconcerting panning of sounds as the camera switches positions. The solution proposed in Area 305 was to mute the game sounds (via SoundModes) and to re-create these via the use of SoundTracks in Matinee. The additional consideration for dialogue here is that since the dialogue will no longer be emitted from an object in the game, then it will not be affected by any attenuation or reverberation effects. If you are going to use this method, you will have to apply these to the sound files themselves in your DAW so that the dialogue sounds like it is taking place within the environment rather than playing the acoustically dry studio recording.

Conclusions

It’s pretty clear from the inordinate amount of effort involved in recording dialogue, and the complexities involved in developing effective concatenation systems, that this is an area of game audio desperate for further technological solutions. We can only hope that it’s only a matter of time before more effective synthesis and processing techniques help to ease some of this often laborious and repetitive work. That way we can get on and concentrate on making it sound good.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset