Speech synthesis results
We showcase synthesized speech results of the baselines and the proposed method. Our goal is to mimic both the voice characteristics of a speaker and the recoding condition (like noise and microphone response). To remove the effect of the pretrained vocoder when comparing synthesized speech samples with real speech samples, all real speech samples are converted to mel-spectrogram and reconstructed back to waveform using the same vocoder that is used by the generative models.
Our setting:
- - We only need (audio, texts) pairs during training.
- - We do not use any style labels like speaker IDs or any attribute labels.
- - Note that our model is a sequence-to-sequence model, not an image model.
We trained two models using the proposed method, one on the LibriTTS dataset, which contains ~500 hours of audios from ~2,300 speakers, and the other on the VCTK dataset, which contains ~40 hours of audios from 110 speakers.
Proposed and baseline methods:
- - gst-n: Global style token with n tokens. (unsupervised).
- - proposed: Our proposed style equalization. (unsupervised).
- - gst-nS: It is a supervised gst-n method that uses a pretrained speaker embedding. The speaker embedding requires speaker IDs to train and was trained using 2000 hours of audio samples from 7000 speakers in the VoxCeleb dataset. Note that this information is not accessible to the proposed method.
All models in the same comparison are trained on the same dataset (gst-nS has additional speaker information). All models generate mel-spectrogram with the same sampling rate and window size, which is then reconstructed to waveform using the same vocoder (waveglow). To isolate the effect of the vocoder, for real style input, we also show the reconstructed mel-spectrogram using the same vocoder.
LibriTTS, unseen speaker, nonparallel text
We randomly select style examples from the dev-clean split of LibriTTS dataset. The input text is fixed (shown above each table) while we changing the style inputs.
Input text 1:
I did not see any reason to change the captain.
style text | style input | gst-64 (unsupervised) |
gst-192 (unsupervised) |
proposed (unsupervised) |
gst-64s (supervised) |
gst-192s (supervised) |
---|---|---|---|---|---|---|
When the candle ends sent up their conical yellow flames, all the colored figures from Austria stood out clear and full of meaning against the green boughs. | ||||||
The man shrugged his broad shoulders and turned back into the arabesque chamber. | ||||||
As it dropped it set at liberty three legs on hinges, which supported the panel when let down, and which placed themselves straight on the ground like the legs of a table, and supported it above the earth like a platform. | ||||||
People came running in from all sides; they threw water in the princess's face and did all they could to restore her, but nothing would bring her to. | ||||||
Some do better than others, but none build like Mother Magpie. |
Input text 2:
Next year it plans to open an office in Tokyo.
style text | style input | gst-64 (unsupervised) |
gst-192 (unsupervised) |
proposed (unsupervised) |
gst-64s (supervised) |
gst-192s (supervised) |
---|---|---|---|---|---|---|
I had meant it to be the story of my life, but how little of my life is in it! | ||||||
As the inspiring music, the grand tramp drew near, Christie felt the old thrill and longed to fall in and follow the flag anywhere. | ||||||
We saw the United States flag flying from the ramparts, and thought that Yank would probably be asleep or catching lice, or maybe engaged in a game of seven-up. | ||||||
We can give this poor beggar some alms and send him away with a blessing." | ||||||
The terrible office he had held for twenty-five years had succeeded in making him more or less than man. |
LibriTTS, ablation study
We conduct ablation study on the effect of the proposed style equalization. In this study, we compare the proposed model trained with and without style equalization.All style input are unseen, ie, not in the training set. Please notice the difference between the parallel and nonparallel settings.
Example: 1
target text | style input | without style eq. | proposed | |
---|---|---|---|---|
parallel text | There is a healthy bank-holiday atmosphere about this book which is extremely pleasant. | |||
nonparallel text | What is the difference between cappuccino and latte?. |
Example: 2
target text | style input | without style eq. | proposed | |
---|---|---|---|---|
parallel text | Sheep Rock is about twenty miles from Sisson's, and is one of the principal winter pasture grounds of the wild sheep, from which it takes its name. | |||
nonparallel text | It has been a while since my last cigarette. |
Example: 3
target text | style input | without style eq. | proposed | |
---|---|---|---|---|
parallel text | Mrs. Bozzle, who well understood that business was business, and that wives were not business, felt no anger at this, and handed her husband his best coat. | |||
nonparallel text | The trees grow taller and taller, and finally into the sky. |
LibriTTS, unseen style interpolation
We showcase the capability of the proposed method to interpolate between two unseen styles.
Input text 1:
In a short time, boil up the vinegar again, add pepper and ginger in the above proportion, and instantly cover them up.
style text 1 | In a short time, boil up the vinegar again, add pepper and ginger in the above proportion, and instantly cover them up. |
---|---|
style 1 | |
interp coeff = 0 | |
interp coeff = 0.25 | |
interp coeff = 0.5 | |
interp coeff = 0.75 | |
interp coeff = 1 | |
style 2 | |
style text 2 | Perhaps the profession of doing good may be full, but every body should be kind at least to himself. |
Input text 2:
The storm rushed in; she put up her hand to shield the light from danger.
style text 1 | The storm rushed in; she put up her hand to shield the light from danger. |
---|---|
style 1 | |
interp coeff = 0 | |
interp coeff = 0.25 | |
interp coeff = 0.5 | |
interp coeff = 0.75 | |
interp coeff = 1 | |
style 2 | |
style text 2 | "I've seen them do that in the wild west shows too many times not to know how myself." |
LibriTTS, seen speaker, nonparallel text
We randomly select style examples from the train-all-960 split (train-clean-100 + train-clean-360 + train-other-500) of LibriTTS dataset. The input text is fixed (shown above each table) while we changing the style inputs.
Input text 0:
Please change the channel of the television, thank you.
style text | style input | gst-64 (unsupervised) |
gst-192 (unsupervised) |
proposed (unsupervised) |
gst-64s (supervised) |
gst-192s (supervised) |
---|---|---|---|---|---|---|
"She'll wake up fast enough when it's time to eat, and so will you," said Marie, with profound wisdom. | ||||||
The greatest general of the South was Lee, and his greatest lieutenant was Jackson. | ||||||
Then indeed his cheek turned livid, and the eye which had hitherto preserved its steadiness sought the floor. | ||||||
He's like a cat,--as sleek, and cunning, and fierce. | ||||||
"Yes, the noise outside the city wall is new, but the principle is old." |
Input text 1:
What is it that you are looking for?
style text | style input | gst-64 (unsupervised) |
gst-192 (unsupervised) |
proposed (unsupervised) |
gst-64s (supervised) |
gst-192s (supervised) |
---|---|---|---|---|---|---|
Was ever such a view entertained of Caesar, Socrates or of any other historical character? | ||||||
The room was now in dusk, save for the bulbs which made the portrait shine forth like a wayside shrine. | ||||||
Below this diadem hung, pendent, clusters of other disks, swarmed like the globular hiving of the constellation Hercules' captured stars. | ||||||
"What is his name, Miss Greeb?" repeated Lucian, quite impervious to the hint. | ||||||
"There, Rob, you must forgive him; we're none of-us-perfect. |
LibriTTS, random styles from the prior distribution
We showcase the capability of the proposed method to sample random styles from the learned prior distribution.
Input text 1:
This is not the end, it is just the beginning.
Input text 2:
You can't always get what you want.
VCTK, seen speaker, nonparallel text
We showcase nonparallel-text speech synthesis with seen speakers. The style example is shown first at each row.
input text | style input | gst-16 (unsupervised) |
gst-64 (unsupervised) |
proposed (unsupervised) |
gst-16s (supervised) |
gst-64s (supervised) |
---|---|---|---|---|---|---|
My car is just right by the corner. | ||||||
There is a house on top of the mountain. | ||||||
Can you show me where the coffee shop is? | ||||||
How are you doing today? | ||||||
The light shining through the windows makes the room beautiful. | ||||||
Can you bring me some tea, please? | ||||||
I look at the sky and see nothing. | ||||||
I think I am going to find my keys soon. | ||||||
Please teach me calculus. | ||||||
Am I dreaming, or is it really you? |
VCTK, seen speaker, parallel text
We showcase parallel-text speech synthesis with seen speakers. The style example is shown first at each row.
input text | style input | gst-16 (unsupervised) |
gst-64 (unsupervised) |
proposed (unsupervised) |
gst-16s (supervised) |
gst-64s (supervised) |
---|---|---|---|---|---|---|
What kind of person is he? | ||||||
It is still too early for any likely contenders to have emerged. | ||||||
If you're going to do it, do it right. | ||||||
On the front line beyond the bridge the scene was utter chaos. | ||||||
Still, in the end, it was a fair result. | ||||||
I think it is a sensible change. | ||||||
When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow. | ||||||
Mexico City was a wonderful experience. | ||||||
If the red of the second bow falls upon the green of the first, the result is to give a bow with an abnormally wide yellow band, since red and green light when mixed form yellow. | ||||||
The allegations were still under investigation, he added. |