Traditional techniques – concatenative and parametric models

Before the rise of deep learning in TTS tasks, either concatenative or parametric models where used.

To create concatenative models, one needs to record high quality-audio content, split it into small chunks, and then recombine these chunks to form new speech. With parametric models, we have to create the features with signal processing techniques, which requires some extra domain knowledge. 

Concatenative models tend to be intelligible, but lack naturalness. They require a huge dataset that takes into account as many human-generated audio units as possible. Therefore, they usually take a long time to develop. 

In general, parametric models perform worse than concatenative models. They may lack intelligibility, and do not sound particularly natural, either. This is due to the fact that the feature generation process is based on how we humans think speech works. The way we model speech is probably biased and restrictive. However, with a deep learning approach, there are very few preconceptions about what speech is. The model learns features that are inherent to the data. That is where the potential of deep learning lies. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset