WaveNet, in brief

The WaveNet paper was presented in 2016, and it showed results that outperformed the classical TTS approaches. Basically, WaveNet is an audio generative model. It takes a sequence of audio samples as input and predicts the most likely following audio sample. By adding an extra input, it can be conditioned to accomplish more tasks. For instance, if the audio transcript is additionally provided during training, WaveNet can turn it into a TTS system.

WaveNet uses many interesting ideas to train very deep neural networks. The main concept involves dilated causal convolutions (check out the paper to learn more about them).

In the paper, TTS is tackled, among other tasks, and the model is not directly fed with raw text, but with engineered linguistic features that require extra domain knowledge. Thus, WaveNet is not an end-to-end TTS model. Besides, the architecture is quiet complex, and it requires a lot of tuning as well as a tremendous amount of computational power to get decent results in a decent amount of time.

WaveNet is evaluated on both a North American English and a Mandarin Chinese dataset. The MOS is given and compared with a concatenative (HMM-based) and a parametric (LSTM-based) system:

North American English MOS

Mandarin Chinese MOS

WaveNet

4.21 ± 0.081

4.08 ± 0.085

Parametric

3.67 ± 0.098

3.79 ± 0.084

Concatenative

3.86 ± 0.137

3.47 ± 0.108

 

Since we are more interested in end-to-end models for TTS, we will focus on one of the most popular ones: Tacotron.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset