WaveNet

WaveNet is a deep generative network that is used to generate raw audio waveforms. Sounds waves are generated by WaveNet to mimic the human voice. This generated sound is more natural than any of the currently existing text-to-speech systems, reducing the gap between system and human performance by 50%.

With a single WaveNet, we can differentiate between multiple speakers with equal fidelity. We can also switch between individual speakers based on their identity. This model is autoregressive and probabilistic, and it can be trained efficiently on thousands of audio samples per second. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning the speaker identity.

As shown in the movie Her, the long-standing dream of human-computer interaction is to allow people to talk to machines. The computer's ability to understand voices has increased tremendously over the past few years as a result of deep neural networks (for example, Google Assistant, Siri, Alexa, and Cortana). On the other hand, to generate speech with computers, a process referred to as speech synthesis or text to speech is followed. In the text-to-speech method, a large database of short sound fragments are recorded by a single speaker and then combined to form the required utterances. This process is very difficult because we can't change the speaker.

This difficulty has led to a great need for other methods of generating speech, where all the information that is needed for generating the data is stored in the parameters of the model. Additionally, using the inputs that are given to the model, we can control the contents and various attributes of speech. When speech is generated by adding sound fragments together, attribution graphs are generated.

The following is the attribution graph of speech that is generated in 1 second:

The following is the attribution graph of speech that is generated in 100 milliseconds:

The following is the attribution graph of speech that is generated in 10 milliseconds:

The following is the attribution graph of speech that is generated in 1 millisecond:

The Pixel Recurrent Neural Network (PixelRNN) and Pixel Convolutional Neural Network (PixelCNN) models from Google ensure that it's possible to generate images that include complex formations – not by generating one pixel at a time, but by an entire color channel altogether. At any one time, a color channel will need at least a thousand predictions per image. This way, we can alter a two-dimensional PixelNet into a one-dimensional WaveNet; this idea is shown in the following diagram:

The preceding diagram displays the structure of a WaveNet model. WaveNet is a full CNN, in which the convolutional layers include a variety of dilation factors. These factors help the receptive field of WaveNet to grow exponentially with depth, and it also helps to cover thousands of time steps.

During training, the human speaker records the input sequences to create waveforms. Once the training is complete, we generate synthetic utterances by sampling the network. A value is taken from the probability distribution which is computed by the network at each step of sampling. The value that's received is fed as the input for the next step, and then a new prediction is made. Building these samples at each step is expensive; however, it's necessary to generate complex and realistic-sounding audio.

More information about PixelRNN can be found at https://arxiv.org/pdf/1601.06759.pdf, while information about Conditional Image Generation with PixelCNN Decoders can be found at https://arxiv.org/pdf/1606.05328.pdf.

Table of Contents for WaveNet

Create new playlist

Sign In

Sign Up

Table of Contents for
WaveNet