Audio samples from "Combining speakers of multiple languages to improve quality of neural voices"
Authors: Javier Latorre, Charlotte Bailleul, Tuuli Morrill, Alistair Conkie, Yannis Stylianou
Abstract: In this work, we explore multiple architectures and training pro- cedures for developing a multi-speaker and multi-lingual neural TTS system with the goals to a) improve the quality when the available data in the target language is limited and b) enable cross-lingual synthesis. We report results from a large experi- ment using 30 speakers in 8 different languages across 15 differ- ent locales. The system is trained on the same amount of data per speaker. Compared to a single-speaker model, when the suggested system is fine tuned to a speaker, it produces signifi- cantly better quality in most of the cases while it only uses less than 40\% of the speaker’s data used to build the single-speaker model. In cross-lingual synthesis, on average, the generated quality is within 80\% of native single-speaker models, in terms of Mean Opinion Score.
The following samples show the quality of the different models in inlingual synthesis (voice language equal to text language)
Language
Voice
Recording reference
Unit selection (USEL)
Single speaker model (SingSpkr)
Fine-tune polyglot model without speaker embedding or residual VAE (FT)
Fine-tune polyglot model with residual VAE (FTres)
Fine-tune polyglot model with residual VAE and speaker embedding (FTresSE)
Non fine-tuned polyglot model with residual VAE and speaker embedding (resSE)
Danish
Lower Pitch Voice
Higher Pitch Voice
German
Lower Pitch Voice
Higher Pitch Voice
Australian English
Lower Pitch Voice
Higher Pitch Voice
British English
Lower Pitch Voice
Higher Pitch Voice
Iris English
Lower Pitch Voice
Higher Pitch Voice
Indian English
Lower Pitch Voice
Higher Pitch Voice
American English
Lower Pitch Voice
Higher Pitch Voice
South African English
Lower Pitch Voice
Higher Pitch Voice
Castillian Spanish
Lower Pitch Voice
Higher Pitch Voice
Mexican Spanish
Lower Pitch Voice
Higher Pitch Voice
Canadian French
Lower Pitch Voice
Higher Pitch Voice
France French
Lower Pitch Voice
Higher Pitch Voice
Italian
Lower Pitch Voice
Higher Pitch Voice
Dutch
Lower Pitch Voice
Higher Pitch Voice
Brazilian Portuguese
Lower Pitch Voice
Higher Pitch Voice
The following samples show the quality in cross-lingual synthesis (voice language different from text language)
References from a native single speaker model
Voice
American English
Mexican Spanish
France French
Germany German
Lower Pitch Voice
Higher Pitch Voice
Crosslingual samples from polyglot models
Voice Language
Voice
Text Language
Fine-tune polyglot model without speaker embedding or residual VAE (FT)
Fine-tune polyglot model with residual VAE (FTres)
Fine-tune polyglot model with residual VAE and speaker embedding (FTresSE)
Non fine-tuned polyglot model with residual VAE and speaker embedding (resSE)