Audio samples from "Combining speakers of multiple languages to improve quality of neural voices"


Authors: Javier Latorre, Charlotte Bailleul, Tuuli Morrill, Alistair Conkie, Yannis Stylianou

Abstract: In this work, we explore multiple architectures and training pro- cedures for developing a multi-speaker and multi-lingual neural TTS system with the goals to a) improve the quality when the available data in the target language is limited and b) enable cross-lingual synthesis. We report results from a large experi- ment using 30 speakers in 8 different languages across 15 differ- ent locales. The system is trained on the same amount of data per speaker. Compared to a single-speaker model, when the suggested system is fine tuned to a speaker, it produces signifi- cantly better quality in most of the cases while it only uses less than 40\% of the speaker’s data used to build the single-speaker model. In cross-lingual synthesis, on average, the generated quality is within 80\% of native single-speaker models, in terms of Mean Opinion Score.


The following samples show the quality of the different models in inlingual synthesis (voice language equal to text language)


Language Voice Recording reference Unit selection (USEL) Single speaker model (SingSpkr) Fine-tune polyglot model without speaker embedding or residual VAE (FT) Fine-tune polyglot model with residual VAE (FTres) Fine-tune polyglot model with residual VAE and speaker embedding (FTresSE) Non fine-tuned polyglot model with residual VAE and speaker embedding (resSE)
Danish Lower Pitch Voice
Higher Pitch Voice
German Lower Pitch Voice
Higher Pitch Voice
Australian English Lower Pitch Voice
Higher Pitch Voice
British English Lower Pitch Voice
Higher Pitch Voice
Iris English Lower Pitch Voice
Higher Pitch Voice
Indian English Lower Pitch Voice
Higher Pitch Voice
American English Lower Pitch Voice
Higher Pitch Voice
South African English Lower Pitch Voice
Higher Pitch Voice
Castillian Spanish Lower Pitch Voice
Higher Pitch Voice
Mexican Spanish Lower Pitch Voice
Higher Pitch Voice
Canadian French Lower Pitch Voice
Higher Pitch Voice
France French Lower Pitch Voice
Higher Pitch Voice
Italian Lower Pitch Voice
Higher Pitch Voice
Dutch Lower Pitch Voice
Higher Pitch Voice
Brazilian Portuguese Lower Pitch Voice
Higher Pitch Voice


The following samples show the quality in cross-lingual synthesis (voice language different from text language)

References from a native single speaker model

Voice American English Mexican Spanish France French Germany German
Lower Pitch Voice
Higher Pitch Voice

Crosslingual samples from polyglot models

Voice Language Voice Text Language Fine-tune polyglot model without speaker embedding or residual VAE (FT) Fine-tune polyglot model with residual VAE (FTres) Fine-tune polyglot model with residual VAE and speaker embedding (FTresSE) Non fine-tuned polyglot model with residual VAE and speaker embedding (resSE)
Danish Lower Pitch Voice American English
Mexican Spanish
France French
Germany German
Higher Pitch Voice American English
Mexican Spanish
France French
Germany German
German Lower Pitch Voice American English
Mexican Spanish
France French
Higher Pitch Voice American English
Mexican Spanish
France French
Australian English Lower Pitch Voice Mexican Spanish
France French
Germany German
Higher Pitch Voice Mexican Spanish
France French
Germany German
British English Lower Pitch Voice Mexican Spanish
France French
Germany German
Higher Pitch Voice Mexican Spanish
France French
Germany German
Indian English Lower Pitch Voice Mexican Spanish
France French
Germany German
Higher Pitch Voice Mexican Spanish
France French
Germany German
Irish English Lower Pitch Voice Mexican Spanish
France French
Germany German
Higher Pitch Voice Mexican Spanish
France French
Germany German
American English Lower Pitch Voice Mexican Spanish
France French
Germany German
Higher Pitch Voice Mexican Spanish
France French
Germany German
South African English Lower Pitch Voice Mexican Spanish
France French
Germany German
Higher Pitch Voice Mexican Spanish
France French
Germany German
Castillian Spanish Lower Pitch Voice American English
France French
Germany German
Higher Pitch Voice American English
France French
Germany German
Mexican Spanish Lower Pitch Voice American English
France French
Germany German
Higher Pitch Voice American English
France French
Germany German
France French Lower Pitch Voice American English
Mexican Spanish
Germany German
Higher Pitch Voice American English
Mexican Spanish
Germany German
Canadian French Lower Pitch Voice American English
Mexican Spanish
Germany German
Higher Pitch Voice American English
Mexican Spanish
Germany German
Italian Lower Pitch Voice American English
Mexican Spanish
France French
Germany German
Higher Pitch Voice American English
Mexican Spanish
France French
Germany German
Dutch Lower Pitch Voice American English
Mexican Spanish
France French
Germany German
Higher Pitch Voice American English
Mexican Spanish
France French
Germany German
Brazilian Portuguese Lower Pitch Voice American English
Mexican Spanish
France French
Germany German
Higher Pitch Voice American English
Mexican Spanish
France French
Germany German



© 2020 Apple Inc. All rights reserved.