Naturalness versus intelligibility 

The quality of a TTS system is traditionally assessed through two criteria: naturalness and intelligibility. This is motivated by the fact that people are not only sensitive to what the audio content is, but also to how that content is delivered. Basically, we want a TTS system that can produce clear audio content in a human-like way. More precisely, intelligibility is about the audio quality or cleanness, and naturalness is about communicating the message with the proper pronunciation, timing, and range of emotions.

With a highly intelligible system, it is effortless for the user to distinguish between  different words. On the other hand, when  intelligibility is low, some words might be confused with others or difficult to identify, and the separation between words might be unclear. In most scenarios, intelligibility is the more important parameter of the two. That is because conveying a clear and unambiguous message to the user is often the priority, whether it sounds natural or not. If a user can't understand the generated audio, it is a failure. Therefore, it is necessary to have a minimum level of intelligibility, before we try to optimize the naturalness of the generated speech. 

When a TTS algorithm has a high-level of naturalness, the produced content is so smooth that the user feels like another human being is talking to them. It is hardly possible to tell that the speech was artificially created. On the other hand, a discontinuous, monotonous, and lifeless intonation is typical of unnatural speech.  

Note that these are relatively subjective criteria. Therefore, they are not measured with objective metrics. Indeed, because of the nature of the problem, a TTS system can only be evaluated by humans.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset