Limitations

The main contribution of Tacotron is undoubtedly the provision of an end-to-end deep learning-based TTS system that is intelligible and decently natural. Indeed, Tacotron's naturalness is assessed through the MOS, and is compared with a state-of-the-art parametric system and a state-of-the-art concatenative system (the same as in the WaveNet paper). Even though it remains less natural than the latter, it beats the former. Note that the MOS was assessed on a North American English dataset:

North American English MOS

Tacotron

3.82 ± 0.085

Parametric

3.69 ± 0.109

Concatenative

4.09 ± 0.119

However, it is important to keep in mind that Tacotron is characterized by many simplistic design choices. For instance, the Griffin-Lim reconstruction algorithm is light and straightforward, but it is also known to cause artifacts that can negatively impact naturalness. Replacing this block of the pipeline with a more powerful technique could potentially further increase the MOS. Many other parts can be tuned and improved: the model's hyperparameters, the attention mechanism, the learning rate schedule, the loss functions, and more.

Now that we have a strong understanding of how Tacotron works, it is time to get our hands dirty and implement it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset