The main contribution of Tacotron is undoubtedly the provision of an end-to-end deep learning-based TTS system that is intelligible and decently natural. Indeed, Tacotron's naturalness is assessed through the MOS, and is compared with a state-of-the-art parametric system and a state-of-the-art concatenative system (the same as in the WaveNet paper). Even though it remains less natural than the latter, it beats the former. Note that the MOS was assessed on a North American English dataset:
North American English MOS |
|
Tacotron |
3.82 ± 0.085 |
Parametric |
3.69 ± 0.109 |
Concatenative |
4.09 ± 0.119 |
However, it is important to keep in mind that Tacotron is characterized by many simplistic design choices. For instance, the Griffin-Lim reconstruction algorithm is light and straightforward, but it is also known to cause artifacts that can negatively impact naturalness. Replacing this block of the pipeline with a more powerful technique could potentially further increase the MOS. Many other parts can be tuned and improved: the model's hyperparameters, the attention mechanism, the learning rate schedule, the loss functions, and more.
Now that we have a strong understanding of how Tacotron works, it is time to get our hands dirty and implement it.