Audio samples from “Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS”


Paper: arXiv

Authors: Tuomo Raitio, Jiangchuan Li, Shreyas Seshadri

Abstract: Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the synthetic speech often represents the average prosodic style of the database instead of having more versatile prosodic variation. Moreover, many models lack the ability to control the output prosody, which does not allow for different styles for the same text input. In this work, we train a non-autoregressive parallel neural TTS model hierarchically conditioned on both coarse and fine-grained acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions. Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension, generate a wide variety of speaking styles, and provide word-wise emphasis control, while maintaining equal or better quality to the baseline model.




The following samples demonstrate the prosody control capability by adjusting the bias of each prosodic dimension:


Baseline: High-quality baseline parallel TTS model:

No control
Voice 1
Voice 2

Proposed hierarchical parallel prosody control model trained with 36 hours of the high-pitched Voice 1 speech data:

Feature / Bias -1.0 -0.7 -0.5 -0.3 0.0 0.3 0.5 0.7 1.0
Pitch
Pitch range
Duration
Energy
Spectral tilt

Proposed hierarchical parallel prosody control model trained with 23 hours of the low-pitched Voice 2 speech data:

Feature / Bias -1.0 -0.7 -0.5 -0.3 0.0 0.3 0.5 0.7 1.0
Pitch
Pitch range
Duration
Energy
Spectral tilt




The following samples demonstrate the extrapolation ability of the model for spectral tilt:


Feature / Bias -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
Voice 1 Spectral tilt
Voice 2 Spectral tilt




The following samples demonstrate the emphasis control capability of the proposed model by adding bias of 0.5 for the coarse-grained pitch, pitch range, and duration for the emphasized words:


Text (emphasized word in bold) Voice 1 Voice 2
The total average height of the troposphere is 13 kilometers.
The total average height of the troposphere is 13 kilometers.
The total average height of the troposphere is 13 kilometers.
The total average height of the troposphere is 13 kilometers.
The total average height of the troposphere is 13 kilometers.
The total average height of the troposphere is 13 kilometers.
The total average height of the troposphere is 13 kilometers.



© 2021 Apple Inc. All rights reserved.