Chapter 18. Audio, Video, and Speech


In This Chapter

Image Audio

Image Video

Image Speech


This chapter covers the areas of rich media that have been increasingly important to software over the past decade: audio, video, and speech (the latter of which could be considered a very special kind of audio). In all three of these areas, Windows Presentation Foundation significantly lowers the bar of difficulty compared to previous Windows desktop technologies. (Audio, video, and speech are also similar in that it’s difficult to demonstrate them in a book with static pictures!) So, although you might not have considered incorporating these feature areas in the past, you might change your mind after reading this chapter!

Audio

The audio support in WPF is simple to use. But unlike most of WPF, it’s not revolutionary or next-generation, nor does it exploit the latest advances in hardware. Instead, it’s a thin layer over existing functionality in Win32 and Windows Media Player that covers the most common audio needs. You won’t be able to build a professional audio application solely using WPF, but you can easily enhance an application with music and sound effects!

As with many other tasks in WPF, you can accomplish playing audio in multiple ways, each with its own pros and cons. The choices for audio are represented by several different classes:

Image SoundPlayer

Image SoundPlayerAction

Image MediaPlayer

Image MediaElement and MediaTimeline

SoundPlayer

The easiest way to play audio files in a WPF application is to use the same mechanism used by non-WPF applications: the System.Media.SoundPlayer class. SoundPlayer, a part of the .NET Framework since version 2.0, is a simple wrapper for the Win32 PlaySound API. This means that it has a bunch of limitations, such as the following:

Image It only supports .wav audio files.

Image It has no support for playing multiple sounds simultaneously. (Any new sound being played interrupts a currently playing sound.)

Image It has no support for varying the volume of sounds.

It is, however, the most lightweight approach for playing a sound, so it’s very appropriate for simple sound effects. The following code shows how to use SoundPlayer to play a sound:

SoundPlayer player = new SoundPlayer("tada.wav");
player.Play();

The string passed to SoundPlayer’s constructor can be any filename or a URL. Starting with version 3.5 of the .NET Framework, you can use any appropriate relative or absolute pack URI, as with controls such as Image. Therefore, the sound file can be included your project like other WPF binary resources (with a Resource or Content build action), or it can be loose at the site of origin.

Calling Play plays the sound asynchronously, but you can also call PlaySync to play it on the current thread, or PlayLooping to make the sound repeat asynchronously until you call Stop (or until any other sound is played from any instance of SoundPlayer, or even direct calls to the underlying Win32 API).

For performance reasons, the audio file isn’t loaded until the first time the sound is played. But this behavior could cause an unwanted pause, especially if you’re retrieving a large audio file over the network. Therefore, SoundPlayer also defines Load and LoadAsync methods for performing the loading at any point prior to the first playing.

If you want to play a familiar system sound without worrying about its filename and path on the target computer, the System.Media namespace also contains a SystemSounds class with static Asterisk, Beep, Exclamation, Hand, and Question properties. Each property is of type SystemSound, which has its own Play method (for asynchronous nonlooping playing only). However, I would use sounds from this class sparingly (if at all) to avoid annoying users with sounds that they expect to come only from Windows itself!

SoundPlayerAction

If you want to use SoundPlayer to add simple sound effects to user interface events such as hovering over or clicking a Button, you can easily define the appropriate event handlers that use SoundPlayer in their implementation. However, WPF defines a SoundPlayerAction class (which derives from TriggerAction) that enables you to use SoundPlayer without writing any procedural code.

The following XAML snippet adds EventTriggers directly to a Button that play an audio file when the Button is clicked or the mouse pointer enters its bounds:

<Button>
<Button.Triggers>
  <EventTrigger RoutedEvent="Button.Click">
  <EventTrigger.Actions>
    <SoundPlayerAction Source="click.wav"/>
  </EventTrigger.Actions>
  </EventTrigger>
  <EventTrigger RoutedEvent="Button.MouseEnter">
  <EventTrigger.Actions>
    <SoundPlayerAction Source="hover.wav"/>
  </EventTrigger.Actions>
  </EventTrigger>
</Button.Triggers>
</Button>

SoundPlayerAction simply wraps SoundPlayer in a trigger-friendly way, so it has all the same limitations. Actually, it has even more limitations because you can’t customize how it interacts with SoundPlayer. SoundPlayerAction internally constructs a SoundPlayer instance with its Source value and calls Play whenever the action is invoked. You can’t play the sound synchronously (but why would you want to?), make it loop, or preload the audio file.

MediaPlayer

If the limitations of SoundPlayer and SoundPlayerAction are not acceptable, you can use the WPF-specific MediaPlayer class in the System.Windows.Media namespace. It is built on top of Windows Media Player, so it supports all of its audio formats (.wav, .wma, .mp3, and so on). Multiple sounds can be played simultaneously (although via different instances of MediaPlayer), and the volume can be controlled by setting its Volume property to a double between 0 and 1 (with 0.5 as the default value).

But MediaPlayer has even more features for giving you a lot of control over the audio:

Image You can pause the audio with its Pause method (if CanPause is true).

Image You can mute the audio by setting its IsMuted property to true.

Image You shift the balance toward the left or right speaker by setting its Balance property to a value between -1 and 1. -1 means that all the audio is sent to the left speaker, 0 (the default) means that all the audio is sent to both speakers, and 1 means that all the audio is sent to the right speaker.

Image For audio formats that support it, you can speed up or slow down the audio (without affecting its pitch) by setting its SpeedRatio property to any nonnegative double value. 1.0 is the default value, so a value less than 1.0 slows it down, whereas a value greater than 1.0 speeds it up.

Image You can get the length of the audio clip with its NaturalDuration property (which is unaffected by SpeedRatio) and get the current position with the Position property.

Image If the audio format supports seeking, you can even set the current position with the Position property.

Here is the simplest way to use MediaPlayer to play an audio file:

MediaPlayer player = new MediaPlayer();
player.Open(new Uri("music.wma", UriKind.Relative));
player.Play();

A single instance can play multiple audio files, but only one at a time. After you open a file with Open, methods such as Play, Pause, and Stop apply to that file. You can also call Close to release the file (which also stops the audio if it’s currently playing). The file is always played asynchronously, so you would not want to call Close immediately after the preceding code because you wouldn’t hear anything play!


Tip

For more details and quirks related to MediaPlayer, be sure to read the upcoming “Video” section, even if you have no intention of using WPF’s video support.


MediaElement and MediaTimeline

MediaPlayer gives you a lot more flexibility than SoundPlayer, but it is designed for procedural code only. (Its main functionality is exposed through methods, its properties are not dependency properties, and its events are not routed events.) Somewhat like how SoundPlayerAction wraps SoundPlayer for declarative use, WPF provides a MediaElement class that wraps MediaPlayer for declarative use.

MediaElement is a full-blown FrameworkElement in the System.Windows.Controls namespace, so it’s meant to be embedded in a user interface, it participates in layout, and so on. (This sounds odd until you realize that MediaElement is also used for video, as discussed in the next section.) MediaElement exposes most of the properties and events of MediaPlayer as dependency properties and routed events.

You can set MediaElement’s Source property to the URI of an audio file, but it would play as soon as the element is loaded. Instead, to declaratively play sounds at arbitrary times, you should set Source on the fly using animation with a MediaTimeline.

Just like the earlier example that uses SoundPlayerAction, the following XAML shows how to use MediaElement and MediaTimeline to play an audio file when a Button is clicked or the mouse pointer enters its bounds:

<MediaElement x:Name="audio"/>
...
<Button>
<Button.Triggers>
  <EventTrigger RoutedEvent="Button.Click">
  <EventTrigger.Actions>
    <BeginStoryboard>
      <Storyboard>
        <MediaTimeline Source="click.wma" Storyboard.TargetName="audio"/>
      </Storyboard>
    </BeginStoryboard>
  </EventTrigger.Actions>
  </EventTrigger>
  <EventTrigger RoutedEvent="Button.MouseEnter">
  <EventTrigger.Actions>
    <BeginStoryboard>
      <Storyboard>
        <MediaTimeline Source="hover.wma" Storyboard.TargetName="audio"/>
      </Storyboard>
    </BeginStoryboard>
  </EventTrigger.Actions>
  </EventTrigger>
</Button.Triggers>
</Button>

In addition to the BeginStoryboard action, you can use the same Storyboard with the PauseStoryboard, ResumeStoryboard, SeekStoryboard, and StopStoryboard actions to pause, resume, seek, and stop the audio.


Tip

To create continuously looping background audio, you can set MediaTimeline’s RepeatBehavior to Forever and use it in a trigger on MediaElement’s Loaded event. Here’s an example:

<MediaElement x:Name="audio">
<MediaElement.Triggers>
  <EventTrigger RoutedEvent="MediaElement.Loaded">
  <EventTrigger.Actions>
    <BeginStoryboard>
      <Storyboard>
        <MediaTimeline Source="music.mp3" Storyboard.TargetName="audio"
          RepeatBehavior="Forever"/>
      </Storyboard>
    </BeginStoryboard>
  </EventTrigger.Actions>
  </EventTrigger>
</MediaElement.Triggers>
</MediaElement>

Unfortunately, a slight pause might be heard every time the audio reaches the end, before it is played again from the beginning. One (weird) workaround for this is to create a video with the desired audio then replace the Source with the video file (and keep the MediaElement hidden from view). This works because WPF has tighter integration with video and supports seamless looping in this case.


Video

WPF’s video support is built on the same MediaPlayer class described in the previous section, and its companion classes, such as MediaElement and MediaTimeline. Therefore, all file formats supported by Windows Media Player (.wmv, .avi, .mpg, and so on) can be easily used in WPF applications as well. In addition, much of the discussion in this section also applies to playing audio with MediaPlayer and/or MediaElement.


Warning: WPF’s audio and video support requires Windows Media Player 10 or higher!

Without at least Windows Media Player version 10 installed, the use of MediaPlayer (and related classes) throws an exception. This only affects versions of Windows prior to Windows Vista.



Warning: Prior to Windows Vista, Windows Media Player is 32-bit only!

The 64-bit versions of Windows prior to Windows Vista contain only a 32-bit version of Windows Media Player. Because WPF’s video (and richer audio) support is built on Windows Media Player, you can’t use it from a 64-bit application running on these platforms. Instead, you must ensure that your application runs as 32-bit. In this case, your application can automatically use the 32-bit version of the .NET Framework (which is installed alongside the 64-bit version).


Controlling the Visual Aspects of MediaElement

Like Viewbox and Image, MediaElement has Stretch and StretchDirection properties that control how the video fills the space given to it. Figure 18.1 shows the three different Stretch values operating on a MediaElement placed directly inside a Window:

<Window xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation">
  <MediaElement Source="C:UsersPublicVideosSample Videosutterfly.wmv"
    Stretch="XXX"/>
</Window>

Image

FIGURE 18.1 MediaElement in a window with three different Stretch settings.

Of course, the neat thing about MediaElement is that it enables video to be manipulated in richer ways, like most other FrameworkElements. The following XAML, rendered in Figure 18.2, places two instances of a video on top of each other, both half-transparent, both clipped with a circle, and one rotated 180°:

<Canvas>
  <MediaElement Source="C:UsersPublicVideosSample Videosutterfly.wmv"
    Opacity="0.5">
  <MediaElement.Clip>
    <EllipseGeometry Center="220,220" RadiusX="220" RadiusY="220"/>
  </MediaElement.Clip>
  <MediaElement.LayoutTransform>
    <RotateTransform Angle="180"/>
  </MediaElement.LayoutTransform>
  </MediaElement>

  <MediaElement Source="C:UsersPublicVideosSample Videosutterfly.wmv"
    Opacity="0.5">
  <MediaElement.Clip>
    <EllipseGeometry Center="220,220" RadiusX="220" RadiusY="220"/>
  </MediaElement.Clip>
  </MediaElement>
</Canvas>

Image

FIGURE 18.2 Clipped, rotated, and half-transparent video inside two MediaElements.

Furthermore, by placing MediaElement inside a VisualBrush, you can easily use video just about anywhere—as a background for a ListBox, as a material on a 3D surface, and so on. Just be sure to measure the performance implications before going overboard with VisualBrush and video!


Image FAQ: How do I take snapshots of individual video frames?

You can set the Position of video to a specific point to “freeze frame” it. But if you want to persist that frame as a separate Image, you render a MediaElement into a RenderTargetBitmap (just like any other Visual). Here’s an example:

MediaElement mediaElement = ...;
Size desiredSize = ...;
Size dpi = ...;
RenderTargetBitmap bitmap = new RenderTargetBitmap(desiredSize.Width,
  desiredSize.Height, dpi.Width, dpi.Height, PixelFormats.Pbgra32);
bitmap.Render(mediaElement);
Image image = new Image();
image.Source = BitmapFrame.Create(bitmap);

If you are working with MediaPlayer rather than MediaElement, you could create a DrawingVisual to pass to RenderTargetBitmap’s Render method, as follows:

DrawingVisual visual = new DrawingVisual();
MediaPlayer mediaPlayer = ...;
Size desiredSize = ...;
using (DrawingContext dc = visual.RenderOpen())
{
  dc.DrawVideo(mediaPlayer, new Rect(0, 0, desiredSize.Width,
    desiredSize.Height));
}

The key to this code is DrawingContext’s DrawVideo method, which accepts an instance of MediaPlayer and a Rect. In fact, MediaElement uses DrawVideo inside its OnRender method to do its own video rendering!


Controlling the Underlying Media

The previous two XAML snippets use the simple approach of setting MediaElement’s Source directly. This causes the media to play immediately when the element is loaded. It’s more likely that you’ll want to play, pause, and stop the video at specific times. As in the “Audio” section, the following XAML accomplishes this with a trigger that uses MediaTimeline. It also contains triggers that use PauseStoryboard and ResumeStoryboard to provide the functionality for a simple media player:

<Grid>
<Grid.Triggers>
  <EventTrigger RoutedEvent="Button.Click" SourceName="playButton">
  <EventTrigger.Actions>
    <BeginStoryboard Name="beginStoryboard">
      <Storyboard>
        <MediaTimeline Source="C:UsersPublicVideosSample Videosutterfly.wmv"
          Storyboard.TargetName="video"/>
      </Storyboard>
    </BeginStoryboard>
  </EventTrigger.Actions>
  </EventTrigger>
  <EventTrigger RoutedEvent="Button.Click" SourceName="pauseButton">
  <EventTrigger.Actions>
    <PauseStoryboard BeginStoryboardName="beginStoryboard"/>
  </EventTrigger.Actions>
  </EventTrigger>
  <EventTrigger RoutedEvent="Button.Click" SourceName="resumeButton">
  <EventTrigger.Actions>
    <ResumeStoryboard BeginStoryboardName="beginStoryboard"/>
  </EventTrigger.Actions>
  </EventTrigger>
</Grid.Triggers>

  <MediaElement x:Name="video"/>
  <StackPanel Orientation="Horizontal" VerticalAlignment="Bottom">
    <Button x:Name="playButton" Background="#55FFFFFF" Height="40">Play</Button>
    <Button x:Name="pauseButton" Background="#55FFFFFF" Height="40">Pause</Button>
    <Button x:Name="resumeButton" Background="#55FFFFFF" Height="40">Resume
    </Button>
  </StackPanel>
</Grid>

The user interface includes three translucent Buttons for controlling the video playing underneath them, as shown in Figure 18.3.

Image

FIGURE 18.3 A simple video player, with Buttons that use storyboards to control the video.


Tip

When combining a MediaTimeline with other animations inside the same Storyboard, you might want to customize the way in which these animations are synchronized. Playing media often has an initial delay from loading and buffering, causing it to fall behind other animations. And if you give a Storyboard a fixed duration, it might cut off the end of the media because of such delays.

To change this behavior, you can set Storyboard’s SlipBehavior property to Slip rather than its default value, Grow. This causes all animations to wait until the media is ready before doing anything.


Although the default behavior for media specified as the Source of a MediaElement is to begin playing when the element is loaded, you can change this behavior with MediaElement’s LoadedBehavior and UnloadedBehavior properties, both of type MediaState. MediaState is an enumeration with the values Play (the default for LoadedBehavior), Pause, Stop, Close (the default for UnloadedBehavior), and Manual.

If you want to control the media from procedural code, MediaElement exposes the methods of the MediaPlayer it wraps (Play, Stop, and so on), but you can call these only when LoadedBehavior and UnloadedBehavior are set to Manual. In addition, you can set the Position and SpeedRatio properties only when the element is in this manual mode.

Note that manual mode is applicable only when you don’t have any MediaTimelines in triggers attached to the MediaElement. When MediaElement is an animation target, its behavior is always driven by an animation clock (exposed as its Clock property of type MediaClock) and can’t be altered manually unless you interact with the clock.


Tip

To include streaming audio or video in an application, you can simply set Source to a streaming URL. Any encoding supported by Windows Media Player works, such as ASF-encoded .wmv files. If you want to include a live video feed from a local webcam (which doesn’t have a URL you can point to), see Chapter 19, “Interoperability with Non-WPF Technologies,” which shows a way to accomplish this.



Warning: Media files can’t be embedded resources!

The URIs given as Source values to MediaPlayer, MediaElement, and MediaTimeline are not as general-purpose as the URIs used elsewhere in WPF. They must be paths understood by Windows Media Player, such as absolute or relative file system paths or a URL. This means that there’s no built-in support for referencing a media file embedded as a resource. Ironically, the only mechanism discussed in this chapter that supports specifying media as an arbitrary stream is the otherwise very limited SoundPlayer/SoundPlayerAction!

This also means you can’t refer to files at the site of origin using the pack://siteOfOrigin syntax. Instead, you can hard-code the appropriate path or URL or programmatically retrieve the site of origin by using ApplicationDeployment.CurrentDeployment.ActivationUri (in the System.Deployment.Application namespace defined in System.Deployment.dll) and then prepend it to a filename to form a fully qualified URI.



Tip

To diagnose any errors when using MediaPlayer or MediaElement, you should attach an event handler to the MediaFailed event defined by both classes. This could look like the following:

<MediaElement Source="nonExistentFile.wmv" MediaFailed="OnMediaFailed"/>

where the OnMediaFailed code-behind method is defined as follows:

void OnMediaFailed(object o, ExceptionRoutedEventArgs e)
{
  MessageBox.Show(e.ErrorException.ToString());
}

If the Source file doesn’t exist, you’ll now see the following exception rather than silent failure:

System.IO.FileNotFoundException: Cannot find the media file. --->
System.Runtime.InteropServices.COMException (0xC00D1197):
Exception from HRESULT: 0xC00D1197

Most people are surprised when they learn that you need to opt in to this behavior rather than get such exceptions by default. But because of the asynchronous nature of media processing, a directly thrown exception might not be catchable anywhere outside a global handler.



Image FAQ: How can I get metadata associated with audio or video, such as artist or genre?

WPF does not expose a way to retrieve such metadata. Instead, you must use unmanaged Windows Media Player APIs to access this information.


Speech

The speech APIs in the System.Speech namespace make it easy to incorporate both speech recognition and speech synthesis. They are built on top of Microsoft SAPI APIs and use W3C standard formats for synthesis and recognition grammars, so they integrate very well with existing engines.

Although these System.Speech APIs were introduced with WPF, they are not tied to WPF; you won’t find any dependency properties, routed events, the built-in ability to animate voice, and so on. Therefore, you can easily use them in any .NET desktop application, whether WPF based, Windows Forms based, or even console based.

Speech Synthesis

Speech synthesis, also known as text-to-speech, is the process of turning text into audio. This requires a “voice” to speak the text. Recent versions of Windows have great voices installed by default. Microsoft’s SAPI SDK (a free download at http://microsoft.com/speech) includes other voices and can be installed on just about all versions of Windows.

Bringing Text to Life

To get started with speech synthesis, add a reference to System.Speech.dll to your project. The relevant APIs are in the System.Speech.Synthesis namespace. Getting text to be spoken is as simple as this:

SpeechSynthesizer synthesizer = new SpeechSynthesizer();
synthesizer.Speak("I love WPF!");

The text is spoken synchronously, using the voice, rate, and volume settings chosen in the Text to Speech area of Control Panel. To have text spoken asynchronously, you can call SpeakAsync instead of Speak:

synthesizer.SpeakAsync("I love WPF!");

You can change the rate and volume of the spoken text by setting SpeechSynthesizer’s Rate and Volume properties. They are both integers, but Rate has a range of -10 to 10, whereas Volume has a range of 0 to 100. You can also cancel pending asynchronous speech by calling SpeakAsyncCancelAll.

If you have multiple voices installed, you can change the voice at any time by calling SelectVoice:

synthesizer.SelectVoice("Microsoft David");

You can enumerate the voices with GetInstalledVoices or even attempt to select a voice with a desired gender and age (which, for some reason, seems a little creepy):

synthesizer.SelectVoiceByHints(VoiceGender.Female, VoiceAge.Adult);

You can even send its output to a .wav file rather than to speakers with the SetOutputToWaveFile method:

synthesizer.SetOutputToWaveFile("c:UsersAdamDocumentsspeech.wav");

This affects any subsequent calls to Speak or SpeakAsync. You can point the synthesizer back to the speakers by calling SetOutputToDefaultAudioDevice.

SSML and PromptBuilder

You can do a lot by passing simple strings to SpeechSynthesizer and using its various members to change voices, rate, volume, and so on. But SpeechSynthesizer also supports input in the form of a standard XML-based language known as Speech Synthesis Markup Language (SSML). This enables you to encapsulate complex speech in a single chunk and have more control over the synthesizer’s behavior. You can pass SSML content to SpeechSynthesizer directly via its SpeakSsml and SpeakSsmlAsync methods, but SpeechSynthesizer also has overloads of Speak and SpeakAsync that accept an instance of PromptBuilder.

PromptBuilder is a handy class that makes it easy to programmatically build complex speech input. With PromptBuilder, you can express most of what you could accomplish with an SSML file, but it’s generally simpler to learn than SSML.


Tip

Speech Synthesis Markup Language (SSML) is a W3C Recommendation published at http://w3.org/TR/speech-synthesis.


The following code builds a simple dialog with PromptBuilder and then speaks it by passing it to SpeakAsync:

SpeechSynthesizer synthesizer = new SpeechSynthesizer();
PromptBuilder promptBuilder = new PromptBuilder();

promptBuilder.AppendTextWithHint("WPF", SayAs.SpellOut);
promptBuilder.AppendText("sounds better than WPF.");

// Pause for 2 seconds
promptBuilder.AppendBreak(new TimeSpan(0, 0, 2));

promptBuilder.AppendText("The time is");
promptBuilder.AppendTextWithHint(DateTime.Now.ToString("hh:mm"), SayAs.Time);

// Pause for 2 seconds
promptBuilder.AppendBreak(new TimeSpan(0, 0, 2));

promptBuilder.AppendText("Hey Zira, can you spell queue?");

promptBuilder.StartVoice("Microsoft Zira");
promptBuilder.AppendTextWithHint("queue", SayAs.SpellOut);
promptBuilder.EndVoice();

promptBuilder.AppendText("Do it faster!");

promptBuilder.StartVoice("Microsoft Zira");
promptBuilder.StartStyle(new PromptStyle(PromptRate.ExtraFast));
promptBuilder.AppendTextWithHint("queue", SayAs.SpellOut);
promptBuilder.EndStyle();
promptBuilder.EndVoice();

// Speak all the content in the PromptBuilder
synthesizer.SpeakAsync(promptBuilder);

After you instantiate a PromptBuilder, you keep appending different types of content. The preceding code makes use of AppendTextWithHint to spell out some words (which produces a better pronunciation of WPF) and to pronounce a string representing time (such as “08:25”) more naturally. You can also surround chunks of content with StartXXX/EndXXX methods that change the voice or style of the surrounding text, and you can denote where paragraphs and sentences begin and end. These chunks can be nested, just like the XML elements you would create if you were writing raw SSML.


Tip

SpeechSynthesizer even supports playing .wav audio files! You can do this in two easy ways. One is using PromptBuilder’s AppendAudio method:

promptBuilder.AppendAudio("sound.wav");

(You can also include the equivalent directive in an SSML file and pass it to SpeakSsml or SpeakSsmlAsync.)

Another way is to use an overload of Speak or SpeakAsync that accepts a Prompt instance such as FilePrompt. With FilePrompt, you can speak content of a file, whether it’s a plain-text file, an SSML file, or a .wav file:

synthesizer.SpeakAsync(new FilePrompt("text.txt", SynthesisMediaType.Text));
synthesizer.SpeakAsync(new FilePrompt("content.ssml", SynthesisMediaType.Ssml));
synthesizer.SpeakAsync(new FilePrompt("sound.wav", SynthesisMediaType.WaveAudio));


Speech Recognition

Speech recognition is exactly the opposite of speech synthesis. Recognition is all about extracting speech sounds from an audio input and turning it into text.


Tip

For speech recognition to work, you need to have a speech recognition engine installed and running. Windows Vista or later comes with one, and Office XP or later comes with one as well. You can also install a free one from http://microsoft.com/speech. You can start the built-in Windows engine by selecting Windows Speech Recognition from the Start menu under Accessories, Ease of Access.


Converting Spoken Words into Text

To use speech recognition, you must add a reference to System.Speech.dll to your project (just as with speech synthesis). This time, the relevant APIs are in the System.Speech.Recognition namespace. The simplest form of recognition is demonstrated by the following code, which instantiates a SpeechRecognitionEngine, loads a grammar, and attaches an event handler to its SpeechRecognized event:

SpeechRecognitionEngine engine = new SpeechRecognitionEngine();
engine.LoadGrammar(new DictationGrammar());
engine.SetInputToDefaultAudioDevice();

// Keep going until RecognizeAsyncStop or RecognizeAsyncCancel is called:
engine.RecognizeAsync(RecognizeMode.Multiple);

// You can use the same event handler defined previously:
engine.SpeechRecognized +=
  new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);

You must manually configure SpeechRecognitionEngine’s input source (such as the default audio device, an audio stream, or a .wav file on disk) and tell it when to start listening via a call to Recognize or RecognizeAsync. If you call RecognizeAsync with RecognizeMode.Multiple, recognition will continually operate in the background until either RecognizeAsyncStop or RecognizeAsyncCancel is called. RecognizeAsyncStop terminates after the current recognition action finishes, whereas RecognizeAsyncCancel terminates recognition immediately.

DictationGrammar, the only grammar shipped in the .NET Framework, is suitable for generic speech recognition. SpeechRecognized is raised whenever spoken words or phrases are converted to text, so a simple implementation could be written as follows:

void recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
  if (e.Result != null)
    textBox.Text += e.Result.Text + " ";
}

This approach is adequate for dictating text into a TextBox. However, this is unnecessary on Windows Vista or later because you already get that functionality for free! For example, if you enable Windows speech recognition via the Control Panel and give any WPF TextBox focus, the words you speak into the microphone automatically appear, as shown in Figure 18.4. This works because the Windows speech recognition system integrates with the UI Automation interfaces exposed by WPF elements. You can even invoke actions such as clicking on Buttons by speaking their automation names! (This is not specific to WPF, but also true of Windows Forms or any other user interface frameworks with built-in integration with Windows accessibility.)

Image

FIGURE 18.4 Dictating content into a WPF TextBox using the Windows Speech Recognition program.

Speech recognition is typically used to add custom spoken commands to a program that are more sophisticated than the default functionality exposed through accessibility. Such commands typically consist of a few words or phrases that an application knows in advance. To handle this efficiently, you need to give SpeechRecognitionEngine more information about your expectations. That’s where SRGS comes in.

Specifying a Grammar with SRGS

If you want to programmatically act on certain words or phrases, writing a SpeechRecognized event handler is tricky if you don’t constrain the input. You need to ignore irrelevant phrases and possibly pick out relevant words from larger phrases that can’t be easily predicted. For example, if one of the words you want to act on is go, do you accept words such as goat, assuming that the recognizer simply misunderstood the user?

To avoid this kind of grunt work and guesswork, SpeechRecognitionEngine supports specifying a custom grammar based on the Speech Recognition Grammar Specification (SRGS). With a grammar that captures your possible valid inputs, the recognizer can automatically ignore meaningless results and improve the accuracy of its recognition.


Tip

Speech Recognition Grammar Specification (SRGS) is a W3C Recommendation published at http://w3.org/TR/speech-grammar.


To attach a custom grammar, you can call the same LoadGrammar method shown earlier. SRGS-based grammars can be described in XML, so the following code loads a custom grammar from an SRGS XML file in the current directory:

SpeechRecognitionEngine engine = new SpeechRecognitionEngine();
SrgsDocument doc = new SrgsDocument("grammar.xml");
engine.LoadGrammar(new Grammar(doc));

SrgsDocument (and other SRGS-related types) are defined in the System.Speech.Recognition.SrgsGrammar namespace.

An SrgsDocument can also be built in-memory using a handful of APIs. The following code builds a grammar that allows only two commands, stop and go:

SpeechRecognitionEngine engine = new SpeechRecognitionEngine();
SrgsDocument doc = new SrgsDocument();
SrgsRule command = new SrgsRule("command", new SrgsOneOf("stop", "go"));
doc.Rules.Add(command);
doc.Root = command;
engine.LoadGrammar(new Grammar(doc));

You can express much more intricate grammars, however. The following example could be used by a card game, enabling a user to give commands such as three of hearts or ace of spaces to play those cards:

SpeechRecognitionEngine engine = new SpeechRecognitionEngine();
SrgsDocument doc = new SrgsDocument();
SrgsRule command = new SrgsRule("command");
SrgsRule rank = new SrgsRule("rank");
SrgsItem of = new SrgsItem("of");
SrgsRule suit = new SrgsRule("suit");
SrgsItem card = new SrgsItem(new SrgsRuleRef(rank), of, new SrgsRuleRef(suit));
command.Add(card);
rank.Add(new SrgsOneOf("two", "three", "four", "five", "six", "seven",
  "eight", "nine", "ten", "jack", "queen", "king", "ace"));
of.SetRepeat(0, 1);
suit.Add(new SrgsOneOf("clubs", "diamonds", "spades", "hearts"));
doc.Rules.Add(command, rank, suit);
doc.Root = command;
engine.LoadGrammar(new Grammar(doc));

This grammar defines the notion of a card as “rank of suit” where rank has 13 possible values, suit has 4 possible values, and “of” can be omitted (hence the SetRepeat call that allows it to be said zero or one time).

Specifying a Grammar with GrammarBuilder

Specifying grammars with the APIs in System.Speech.Recognition.SrgsGrammar or with an SRGS XML file (whose syntax is not covered here) can be complicated. Therefore, the System.Speech.Recognition namespace also contains a GrammarBuilder class that exposes the most commonly used aspects of recognition grammars via much simpler APIs. Grammar (the type passed to LoadGrammar) has an overloaded constructor that accepts an instance of GrammarBuilder, so it can easily be plugged in wherever you can use an SrgsDocument.

For example, here’s the first grammar from the previous section, reimplemented using GrammarBuilder:

SpeechRecognitionEngine engine = new SpeechRecognitionEngine();
GrammarBuilder builder = new GrammarBuilder(new Choices("stop", "go"));
engine.LoadGrammar(new Grammar(builder));

And here’s the reimplemented card game grammar:

SpeechRecognitionEngine engine = new SpeechRecognitionEngine();
GrammarBuilder builder = new GrammarBuilder();
builder.Append(new Choices("two", "three", "four", "five", "six", "seven",
  "eight", "nine", "ten", "jack", "queen", "king", "ace"));
builder.Append("of", 0, 1);
builder.Append(new Choices("clubs", "diamonds", "spades", "hearts"));
engine.LoadGrammar(new Grammar(builder));

GrammarBuilder doesn’t expose all the power and flexibility of SrgsDocument, but it’s often all that you need. In the card game example, the user can speak “two clubs” or perhaps something that sounds like “too uh cubs,” and the SpeechRecognized event handler should receive the canonical “two of clubs” string. You can get even fancier in your grammars and tag pieces with semantic labels so that the event handler can pick out concepts such as the rank and suit without having to parse even the canonical string.

Summary

WPF’s support for audio, video, and speech rounds out its rich media offerings. The audio support is limited but is enough to accomplish the most common tasks. The video support is only a subset of what’s provided by the underlying Windows Media Player APIs, but the seamless integration with the rest of WPF (so you can transform or animate video just as you can any other content) makes it extremely compelling. WPF’s standards-based speech synthesis and recognition support is state of the art and easy to use, even though it’s mainly just a wrapper on top of the unmanaged Microsoft SAPI APIs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset