Chapter 14. Speech Synthesis

Kevin A. Lenzo

Talking computers are ubiquitous in science fiction; machines speak so often in the movies that we think nothing of it. From an alien robot poised to destroy the Earth unless given a speech key (“Klaatu barada nikto” in The Day the Earth Stood Still and echoed in Army of Darkness) to the terrifyingly calm HAL 9000 in 2001: A Space Odyssey, machines have communicated with people through speech. The computer in “Star Trek” has spoken and understood speech since the earliest episodes. Speech sounds easy, because it’s natural to us—but it’s not natural for computers.

Let’s ignore the problem of getting computers to think of things worth saying, and consider only turning word sequences into speech. Let’s ignore prosody, too. Prosody—the intonation, rhythm, and timing of speech—is important to how we interpret what’s said, as well as how we feel about it, but it can’t be given proper care in the span of this article; partly because it’s almost completely unsolved. The input to our system is stark and minimal, as is the output—there are no lingering pauses or dynamics, no irony or sarcasm. (Not on purpose, anyway.)

If we are given plain text as input, how should it be spoken? How do we make a transducer that accepts text and output audio? In this article, we’ll walk through a series of Perl speech synthesizers. Many of the ideas here apply to both natural and artificial languages, and are recurrent themes in the synthesis work at Carnegie Mellon University, where I spend my days (and nights). The code in this article is available on the web site for this book (http://www.oreilly.com/catalog/tpj3) and at http://www.cs.cmu.edu/~lenzo/tpj/synthesis_examples.html. Audio samples of each method are included.

Pre-Recorded Sentences

As a first attempt, we could use pre-recorded speech for everything we might want our computer to say. It wouldn’t be very flexible, but it would sound great. For every possible input, we’d need a recording. Your phone company probably does this—“What city, please?,” or “What listing?”

Spokesvoices like James Earl Jones are recorded as expertly as possible to show off the sound quality, and greet you with phrases like “Welcome to Bell Atlantic.” This sort of thing is pretty easy to do—just play the right sound file at the right time:

while (<>) {
    say $speech{$_};
}

%speech is a hash of speech samples, perhaps tied to a database using tie or dbmopen. say is a subroutine that knows about sending audio to something that can play it, and $_ is, of course, Perl’s default scalar variable.

As limited as this might seem, it is the perfect solution for applications that require only a small vocabulary—in games like WarCraft, for instance, where the characters have a few randomized responses they utter as you order them about—“Yes, Master?” or “Of course, Master.” If you need any more flexibility, though, you’ll be stuck—and if you ever need to record new material, you have to be careful to emulate the same recording conditions.

Lexical Synthesis in One s///

Let’s look some variations of the expression s/($text)/speech $1/eg (which was the original title of this article as it appeared in TPJ):

while (<>) {
    s/($text)/$speech{$1}/g;
    say $_;
}

Here’s guide to this code snippet:

  • $text is a regular expression that accepts words

  • %speech is a hash of audio waveforms, one per word

  • $1 holds each match of $text (within the parentheses)

  • say is a subroutine that sends a waveform to an audio device.

The s///g replaces all of the words in $text with hash entries from %speech; this tokenizes the input, and turns each token into an audio waveform. More formally, $text is an expression that matches words in the language defined by %speech. Whole words are replaced with their audio entries. Words not in the language have no entry in %speech, and therefore produce no output. So, for our purposes, the expression in $text can be quite simple:

$text = 'S+|s+'

s+ matches one or more whitespace characters, and S+ matches one or more non-whitespace characters. Clearly, punctuation is a part of the word in this tokenization. For example, if the previous sentence were the input text, “Clearly,” and “tokenization” would become tokens, as would all the other words; if the sentence you’re reading now were the input, we’d even have quoted words as tokens. If we remove punctuation, and fold the uppercase letters to lowercase so that “Clearly” and “clearly” are the same word, we can generalize our tokenization a bit. With a few changes we can eliminate some punctuation, in much the same way whitespace was separated from the words. For instance, we could do this:

$text = '[.!?"]|S+|s+'

which turns some of the punctuation into separate tokens. We can convert to lowercase with the lc operator:

while (<>) {
    s/($text)/say $speech{lc($1)}/eg;
}

The sample is spoken immediately, rather than explicitly passed to say after the entire line is processed. The /e on the end of the s/// tells Perl to evaluate say $speech{lc($1)} as a full-fledged Perl expression.

This can work nicely for a small, constrained vocabulary, but you need a sample for every word ever to be spoken. The result sounds unnatural, because the words are spoken without regard to their context in the sentence. For instance, words at the end of an utterance are often spoken more slowly, and with a final drop in pitch. This technique doesn’t handle that.

The Out-of-Vocabulary Problem: Synthesis in One s///e

How do we handle “out-of-vocabulary” words? If we wanted our system to speak a word for which there is no presampled audio, we could break it down into smaller parts: phonemes. Just like our written words are composed of an alphabet of letters, our spoken words are composed of an alphabet of phonemes. So we’d need to make a mapping from a sequence of text words to phonemes, and ultimately to audio waveforms.

Now, the regex $text remains the same, but we change the substitution:

while (<>) {
    s/($text)/map { say $speech{$_} } text2phones lc($1)/eg;
}

The %speech hash now contains phonemes. More precisely, it contains phones, which are concrete, context-dependent realizations of abstract phonemes. For instance, the phoneme /T/ can be spoken as the phone [D] in words like “butter,” or as [T] if you pronounce it in isolation (such as at a spelling bee). The audio depends on more than just the phoneme; it also depends on the setting and context.

A couple of points: Notice that the language is still a “regular” grammar—the sort that can be recognized by a vanilla regular expression. (Perl’s so-called “regular expressions” are more powerful than we need, because they can handle irregular grammars as well.) (A regular grammar deals with patterns of symbols composed with only concatenation (“and”), disjunction (“or”), and closure (zero or more iterations of a pattern). Perl regular expressions allow things like look-ahead assertions, which makes them able to parse context-sensitive languages, and means that Perl’s regular expressions aren’t regular at all.)

Text-to-Phoneme Conversion

What magic lies hidden in text2phones? English spelling encodes a great deal of history, but the pronunciations of words are not always obvious—for instance, words such as “night,” “knight,” “ought,” and “cough” don’t sound like they have a “gh” in them; well, at least not the “g” as in “great” or the “h” as in “history.” Languages such as Spanish and Greek have closer relationships between their written and spoken forms; Chinese, on the other hand, has no relationship whatsoever.

There are a number of possible strategies for tackling this problem. Many speech synthesizers use a two-tiered method—first, check if it’s in a small dictionary of pronunciations, and, if not, resort to a text-to-phoneme conversion strategy. Any word pronounced properly by the rules is removed from the dictionary, so as to keep the total conversion package reasonably small. Festival, a free, open source synthesizer from the University of Edinburgh, uses this strategy, as does my own Perl-based synthesis, phonebox.

Looking things up in a dictionary is good work when you can get it, but you’ll always run into out-of-vocabulary words for which there’s no predefined pronunciation. Even ignoring the fact that human languages change all the time by adding words and changing usages, the number of entries will get quite large if every word is included. Building generic text-to-phoneme transducers is an interesting problem that requires generalization to unforeseen data; decision trees, borrowed from the discipline of machine learning, work reasonably well. A decision tree is an organization of nodes in which each node represents a decision; you can download Perl code to build dictionary decision trees from http://www.cs.cmu.edu/~lenzo/t2p and from the web site for this book. The technique is described there, and it includes the letter-to-phoneme converter as well as some sample data.

At CMU we train decision trees for text-to-phoneme conversion using pronouncing dictionaries such as the CMU dictionary, the Oxford Advanced Learner’s dictionary, the Moby lexicon, or any number of others. Given a set of words and their pronunciations, a set of alignments between the letters and phones is produced, generating a mapping from letters to phonemes. For example, here are some possible alignments of the letters in the word “algebraically” with a string of phonemes:

 a  l   g   e  b r   a   i  c  a l l   y
AE  L  JH  AH  B R  EY  IH  K  _ L _  IY

There are some tricky issues in generating good alignments; for this example alone there are 78 possible alignments. This number comes from C(13,11]—pronounced “thirteen choose eleven”—which is the number of possible ways you can choose eleven items from a set of thirteen. In this case, there are thirteen letters and eleven phonemes, so there are eleven slots that get phonemes, and two that don’t. The ones that don’t get a phoneme end up aligning with a null. C(n,k) is equal to n!/(k!(n-k)!), where n! is n factorial, or nn-1 • n-2 • … • 1.) Good alignments are critical for getting acceptable results from any learning method. NetTalk, a neural network for speech output that is successful despite relatively poor performance of its text-to-phoneme rules, uses this sort of encoding of letters in a window of neighboring letters. Methods for generating these alignments are discussed in [Black, Lenzo, and Pagel] and [Pagel, Lenzo, and Black].

Once the alignments have been generated, they are exploded into feature vectors, one per letter, that include some context: three letters on each side. Each feature (neighboring letter, in this case) is given a name; “L” for the letter itself, “L1” for one to the left, “R1” for one to the right, and so on. The word “absurdities,” for instance, ends up producing eleven feature vectors, as shown in Table 14-1.

Table 14-1. The eleven feature vectors of “absurdities”

L3

L2

L1

L

R1

R2

R3

Phoneme

-

-

-

a

b

s

u

AH

-

-

a

b

s

u

r

B

-

a

b

s

u

r

d

S

a

b

s

u

r

d

i

ER

b

s

u

r

d

i

t

_

s

u

r

d

i

t

i

D

u

r

d

i

t

i

e

AH

r

d

i

t

i

e

s

T

d

i

t

i

e

s

-

IY

i

t

i

e

s

-

-

_

t

i

e

s

-

-

-

Z

These vectors are taken together to build a decision tree that tests the features and produces an output phoneme (or no phoneme at all, denoted by the underscore in the fifth and tenth vectors). The result is an example of a finite state transducer, the core of several speech synthesis systems, such as the Lucent synthesizer. Here’s one subtree trained from the CMU dictionary:

if ($feature{'L'} eq 'C') {
    if ($feature{'R1'} eq 'A') {
        if ($feature{'L1'} eq 'A') {
            if ($feature{'L2'} eq 'F') {
                return 'S';
            }
        if ($feature{'L2'} eq 'R') {
            if ($feature{'L3'} eq 'U') {
                return 'S';
            }
            return 'K';
     }
     return 'K';
}
...

This snippet implies that the letter C (tested in the order L, R1, L1, L2), should become the S phoneme (as in the Americanized “façade”) if it has the context shown in Table 14-2.

Table 14-2. The ç in façade

Feature

L3

L2

L1

L

R1

R2

R3

Phoneme

Letter

?

F

A

C

A

?

?

S

Testing order

-

4

3

1

2

-

-

 

If the second letter to the left were an R instead of an F, the subtree would output the S phoneme if there’s an R to the left of it (“Curaçao”), and the K phoneme otherwise. You can see that these trees actually represent a great deal of the dictionary directly.

Once the text has been converted into a phoneme sequence using these letter contexts, each phoneme can be immediately replaced with its audio waveform, but only if we ignore the context of the phoneme.

More Context: Two Substitutions

You’ll quickly find that simply stitching together phoneme sequences sounds pretty bad; examples of this are on both the book’s web site and my own. Part of the reason is that the actual sounds of a language depend upon the neighboring sounds—they’re context-dependent. Just as the letter transducer uses context to decide which phoneme to speak, the phone depends upon the neighboring phonemes. We need the phonemes to know who their neighbors are, but as long as we have just one s/// that does the whole job, we can’t do that.

For an example of why context is necessary, consider the words “pat” and “spat.” In “pat,” the p is aspirated—it’s spoken with a puff of air. In “spat,” there is no aspiration. Also, when an R comes after a vowel, it colors the vowel in a way that’s different from an R that’s before a vowel; L does the same thing, revealing the two kinds of [L] in the word “lateral.”

Capturing the context and using it to pick the sample can be accomplished easily with Perl’s irregular expressions:

s/($text)/phones lc($1)/eg;
s/(?=(^|$phoneme))s*($phoneme)s*(?=($phoneme|$))/say $speech{"$2($1,$3)"}/eg;

Here, the (?=) items are assertions that must be satisfied but aren’t considered part of the matching text.

The first s/// turns a word sequence into a phoneme sequence, and the second speaks a phoneme in the context of its left and right neighbors; it captures some of the coarticulation effects that smear sounds into one another. This would be fine if you had every possible entry for every possible phoneme in context listed in %speech. That number can be quite large—for instance, if the size of the phonemic inventory is, say, 51 sounds, that gives a total of 132,651 (51 cubed) entries in the table. It’s difficult to get complete coverage of these units from a single talker, let alone with consistent quality, so we turn that last hash table lookup into a subroutine. A phoneme in context like this is often called a triphone, meaning one phoneme with its left and right neighbors fixed.

s/($text)/phonemes lc($1)/eg;
s/(?=(^|$phoneme))s*($phoneme)s*(?=($phoneme|$))/say best_triphone $2, $1, $3/eg;

sub best_triphone {
    my ($phoneme, $left, $right) = @_;
    return $speech{"$phoneme($left,$right)"}     ||
                    $speech{"$phoneme(,$right)"} ||
                    $speech{"$phoneme($left,)"}  ||
                    $speech{"$phoneme(,)"};
}

So the hash of samples, %speech, needs to contain some entries for the “backed-off” triphones, such as P(,R), which is the phoneme P with an R on the right and anything on the left. It will also need P alone, in case neither context matches.

This is effectively an implementation of optimality theory—it finds the best non-empty subset given a partially ordered set of constraints. The “unwrapping” of the single phoneme into itself plus context is another example of the two-level conversion we used in the letter-to-phoneme rules: a finite context-sensitive language is expressed with an equivalent regular grammar, which lets us parse in a particularly easy way—with finite state machines. With these context-sensitive rewrite rules, we have a triphone synthesizer. It doesn’t consider prosody, and it performs no analysis beyond one level of neighboring context, but it works.

What Else?

Even with everything described here, there’s much still missing intonation, stress, voice quality: everything that really makes a voice sound like an individual. A complete speech synthesis system is more complex than what you’ve seen here, but the basic techniques here can be applied to any system.

Now that speech-interactive systems are actually out there, on the phone and on the desktop, we’re seeing just how unsolved many of these synthesis problems really are. They don’t sound very good; there’s not a lot of individuality in the voices; they certainly don’t sound interested in what they’re doing; but the work goes on. I like to think we’re making progress—the Earth does not stand still. Gort! Klaatu barada nikto.

References

The Festival Speech Synthesis System. http://www.cstr.ed.ac.uk/projects/festival.html.

Phonebox text-to-speech synthesis. http://www.cs.cmu.edu/~lenzo/phonebox/.

t2p text-to-phoneme converter. http://www.cs.cmu.edu/~lenzo/t2p/.

CMU Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

Oxford Advanced Learner’s Dictionary. http://www.speech.cs.cmu.edu/comp.speech/Section1/Lexical/cuvolad-dict.html.

Moby Lexicon. http://www.dcs.shef.ac.uk/research/ilash/Moby/.

NetTalk data. http://www.boltz.cs.cmu.edu/benchmarks/nettalk.html.

Black, Lenzo, and Pagel, “Issues in Building General Letter to Sound Rules,” for the 1998 ESCA Speech Synthesis Workshop, Jenolan Caves, Blue Mountains, Australia. http://www.cs.cmu.edu/~lenzo/areas/papers/ESCA98/ESCA98_lts.ps.

Friedl, Jeffrey. Mastering Regular Expressions. O’Reilly & Associates, Inc., 1997.

Pagel, Lenzo, and Black. “Letter to Sound Rules for Accented Lexicon Compression”, for the 1998 International Conference on Spoken Language Processing, Sydney, Australia. http://www.cs.cmu.edu/~lenzo/areas/papers/ICSLP98/ICSLP98_lts.ps.

Sejnowski, T.J., and Rosenberg, C.R. Parallel networks that learn to pronounce English text. In Complex Systems, 1, 145-168. 1987.

Sproat, Richard, editor. Multilingual Text-to-Speech Synthesis, the Bell Labs Approach. Kluwer Academic Publishers, 1998.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset