How is the performance of a TTS system evaluated?

A subjective measure of sound quality, the mean opinion score (MOS), is one of the most commonly used tests for assessing the performance of a TTS algorithm. Usually, several native speakers are asked to give a score of naturalness, from 1 (bad quality) to 5 (excellent quality), and the mean of those scores is the MOS. Audio samples recorded by professionals typically have an MOS of around 4.55, as shown in the WaveNet: A Generative Model for Raw Audio paper that will be presented later in this chapter (https://arxiv.org/abs/1609.03499).

This way of benchmarking TTS algorithms is not entirely satisfactory, however. For instance, it does not allow for a rigorous comparison of different algorithms presented in different papers. Indeed, algorithm A is not necessarily evaluated by the same sample of listeners as algorithm B. Since different individuals are likely to have different standards, more or less, regarding what a natural sound is, if A has an MOS score of 4.2 and B has an MOS score of 4.1, it does not necessarily mean that A is better than B (unless they are evaluated within the same study, by the same group of individuals). Besides, the sample size as well as the population from which the sample of listeners is selected are difficult to standardize, and might make a difference.

Table of Contents for How is the performance of a TTS system evaluated?

Create new playlist

Sign In

Sign Up

Table of Contents for
How is the performance of a TTS system evaluated?