Audio samples from “Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise”



Authors: Tuomo Raitio, Petko Petkov, Jiangchuan Li, Muhammed Shifas, Andrea Davis, Yannis Stylianou

Abstract: We present a neural text-to-speech (TTS) method that models natural vocal effort variation to improve the intelligibility of synthetic speech in the presence of noise. The method consists of first measuring the spectral tilt of unlabeled conventional speech data, and then conditioning a neural TTS model with normalized spectral tilt among other prosodic factors. Changing spectral tilt and keeping other prosodic factors unchanged enables effective vocal effort control at synthesis time independent of other prosodic factors. By extrapolation of the spectral tilt values beyond what has been seen in the original data, we can generate speech with high vocal effort levels, thus improving the intelligibility of speech in the presence of masking noise. We evaluate the intelligibility and quality of normal speech and speech with increased vocal effort in the presence of various masking noise conditions, and compare these to well-known speech intelligibility-enhancing algorithms. The evaluations show that the proposed method can improve the intelligibility of synthetic speech with little loss in speech quality.






The following samples demonstrate the proposed TTS system with increased vocal effort in comparison to other speech intelligibility enhancement methods.

Baseline: Synthetic speech
Proposed: Synthetic speech with increased vocal effort
SS: Synthetic speech processed with spectral shaping (SS)
SSDRC: Synthetic speech processed with spectral shaping and dynamic range compression (SSDRC)

Only vocal effort (through spectral tilt) was changed in the proposed system while keeping other utterance-level prosodic factors the same.
Audio levels are normalized between the systems.


Voice 1 Baseline Proposed SS SSDRC
Utterance 1
Utterance 2
Utterance 3
Utterance 4


Voice 2 Baseline Proposed SS SSDRC
Utterance 1
Utterance 2
Utterance 3
Utterance 4


The following samples demonstrate the intelligibility of the proposed TTS system with increased vocal effort in the presence of various noise conditions.

CS -7dB: Competing speaker with SNR of -7 dB
CS -14dB: Competing speaker with SNR of -14 dB
CS -21dB: Competing speaker with SNR of -21 dB
SSN +1dB: Speech shaped noise with SNR of 1 dB
SSN -4dB: Speech shaped noise with SNR of -4 dB
SSN -9dB: Speech shaped noise with SNR of -9 dB

Voice 1 is masked by Voice 2 and vice versa.


Voice 1 Baseline Proposed SS SSDRC
CS -7dB
CS -14dB
CS -21dB
SSN +1dB
SSN -4dB
SSN -9dB


Voice 2 Baseline Proposed SS SSDRC
CS -7dB
CS -14dB
CS -21dB
SSN +1dB
SSN -4dB
SSN -9dB


The following samples demonstrate the effect of adjusting vocal effort from low (breathy speech) to high (loud speech).

-3: Synthesis with reduced vocal effort (extrapolated to -3)
-2: Synthesis with reduced vocal effort (extrapolated to -2)
-1: Synthesis with reduced vocal effort (minimum seen in the data)
0: Baseline synthesis
+1: Synthesis with increased vocal effort (maximum seen in the data)
+2: Synthesis with increased vocal effort (extrapolated to +2)
+3: Synthesis with increased vocal effort (extrapolated to +3)

Only vocal effort (through spectral tilt) was changed while keeping other utterance-level prosodic factors the same.
Audio levels are not normalized between the samples.


Voice 1 -3 -2 -1 0 +1 +2 +3
Utterance 1
Utterance 2
Utterance 3


Voice 2 -3 -2 -1 0 +1 +2 +3
Utterance 1
Utterance 2
Utterance 3



© 2022 Apple Inc. All rights reserved.