Audio samples from "Combining speakers of multiple languages to improve quality of neural voices"

Authors: Javier Latorre, Charlotte Bailleul, Tuuli Morrill, Alistair Conkie, Yannis Stylianou

Abstract: In this work, we explore multiple architectures and training pro- cedures for developing a multi-speaker and multi-lingual neural TTS system with the goals to a) improve the quality when the available data in the target language is limited and b) enable cross-lingual synthesis. We report results from a large experi- ment using 30 speakers in 8 different languages across 15 differ- ent locales. The system is trained on the same amount of data per speaker. Compared to a single-speaker model, when the suggested system is fine tuned to a speaker, it produces signifi- cantly better quality in most of the cases while it only uses less than 40\% of the speaker’s data used to build the single-speaker model. In cross-lingual synthesis, on average, the generated quality is within 80\% of native single-speaker models, in terms of Mean Opinion Score.

The following samples show the quality of the different models in inlingual synthesis (voice language equal to text language)

Language	Voice	Recording reference	Unit selection (USEL)	Single speaker model (SingSpkr)	Fine-tune polyglot model without speaker embedding or residual VAE (FT)	Fine-tune polyglot model with residual VAE (FTres)	Fine-tune polyglot model with residual VAE and speaker embedding (FTresSE)	Non fine-tuned polyglot model with residual VAE and speaker embedding (resSE)
Danish	Lower Pitch Voice
Danish	Higher Pitch Voice
German	Lower Pitch Voice
German	Higher Pitch Voice
Australian English	Lower Pitch Voice
Australian English	Higher Pitch Voice
British English	Lower Pitch Voice
British English	Higher Pitch Voice
Iris English	Lower Pitch Voice
Iris English	Higher Pitch Voice
Indian English	Lower Pitch Voice
Indian English	Higher Pitch Voice
American English	Lower Pitch Voice
American English	Higher Pitch Voice
South African English	Lower Pitch Voice
South African English	Higher Pitch Voice
Castillian Spanish	Lower Pitch Voice
Castillian Spanish	Higher Pitch Voice
Mexican Spanish	Lower Pitch Voice
Mexican Spanish	Higher Pitch Voice
Canadian French	Lower Pitch Voice
Canadian French	Higher Pitch Voice
France French	Lower Pitch Voice
France French	Higher Pitch Voice
Italian	Lower Pitch Voice
Italian	Higher Pitch Voice
Dutch	Lower Pitch Voice
Dutch	Higher Pitch Voice
Brazilian Portuguese	Lower Pitch Voice
Brazilian Portuguese	Higher Pitch Voice

The following samples show the quality in cross-lingual synthesis (voice language different from text language)

References from a native single speaker model

Voice	American English	Mexican Spanish	France French	Germany German
Lower Pitch Voice
Higher Pitch Voice

Crosslingual samples from polyglot models

Voice Language	Voice	Text Language
Danish	Lower Pitch Voice	American English
		Mexican Spanish
		France French
		Germany German
	Higher Pitch Voice	American English
		Mexican Spanish
		France French
		Germany German
German	Lower Pitch Voice	American English
		Mexican Spanish
		France French
	Higher Pitch Voice	American English
		Mexican Spanish
		France French
Australian English	Lower Pitch Voice	Mexican Spanish
		France French
		Germany German
	Higher Pitch Voice	Mexican Spanish
		France French
		Germany German
British English	Lower Pitch Voice	Mexican Spanish
		France French
		Germany German
	Higher Pitch Voice	Mexican Spanish
		France French
		Germany German
Indian English	Lower Pitch Voice	Mexican Spanish
		France French
		Germany German
	Higher Pitch Voice	Mexican Spanish
		France French
		Germany German
Irish English	Lower Pitch Voice	Mexican Spanish
		France French
		Germany German
	Higher Pitch Voice	Mexican Spanish
		France French
		Germany German
American English	Lower Pitch Voice	Mexican Spanish
		France French
		Germany German
	Higher Pitch Voice	Mexican Spanish
		France French
		Germany German
South African English	Lower Pitch Voice	Mexican Spanish
		France French
		Germany German
	Higher Pitch Voice	Mexican Spanish
		France French
		Germany German
Castillian Spanish	Lower Pitch Voice	American English
		France French
		Germany German
	Higher Pitch Voice	American English
		France French
		Germany German
Mexican Spanish	Lower Pitch Voice	American English
		France French
		Germany German
	Higher Pitch Voice	American English
		France French
		Germany German
France French	Lower Pitch Voice	American English
		Mexican Spanish
		Germany German
	Higher Pitch Voice	American English
		Mexican Spanish
		Germany German
Canadian French	Lower Pitch Voice	American English
		Mexican Spanish
		Germany German
	Higher Pitch Voice	American English
		Mexican Spanish
		Germany German
Italian	Lower Pitch Voice	American English
		Mexican Spanish
		France French
		Germany German
	Higher Pitch Voice	American English
		Mexican Spanish
		France French
		Germany German
Dutch	Lower Pitch Voice	American English
		Mexican Spanish
		France French
		Germany German
	Higher Pitch Voice	American English
		Mexican Spanish
		France French
		Germany German
Brazilian Portuguese	Lower Pitch Voice	American English
		Mexican Spanish
		France French
		Germany German
	Higher Pitch Voice	American English
		Mexican Spanish
		France French
		Germany German