WaveNet, in brief

The WaveNet paper was presented in 2016, and it showed results that outperformed the classical TTS approaches. Basically, WaveNet is an audio generative model. It takes a sequence of audio samples as input and predicts the most likely following audio sample. By adding an extra input, it can be conditioned to accomplish more tasks. For instance, if the audio transcript is additionally provided during training, WaveNet can turn it into a TTS system.

WaveNet uses many interesting ideas to train very deep neural networks. The main concept involves dilated causal convolutions (check out the paper to learn more about them).

In the paper, TTS is tackled, among other tasks, and the model is not directly fed with raw text, but with engineered linguistic features that require extra domain knowledge. Thus, WaveNet is not an end-to-end TTS model. Besides, the architecture is quiet complex, and it requires a lot of tuning as well as a tremendous amount of computational power to get decent results in a decent amount of time.

WaveNet is evaluated on both a North American English and a Mandarin Chinese dataset. The MOS is given and compared with a concatenative (HMM-based) and a parametric (LSTM-based) system:

	North American English MOS	Mandarin Chinese MOS
WaveNet	4.21 ± 0.081	4.08 ± 0.085
Parametric	3.67 ± 0.098	3.79 ± 0.084
Concatenative	3.86 ± 0.137	3.47 ± 0.108

Since we are more interested in end-to-end models for TTS, we will focus on one of the most popular ones: Tacotron.

Table of Contents for WaveNet, in brief

Create new playlist

Sign In

Sign Up

Table of Contents for
WaveNet, in brief