Abstract
Generated Audio:
Transcription:
In this paper we present a streaming text-to-speech system called SpeakStream that can generate audio incrementally from streaming text using a decoder only architecture. The model is trained using next step prediction loss on interleaved text speech data. During inference, SpeakStream generates speech incrementally while absorbing streaming input text, making it suitable for cascaded conversational agents where a large language model streams its output to a text-to-speech system. Our experiments show that SpeakStream matches non streaming model's quality while enabling streaming capabilities with state of the art latency results. The audio you are hearing is generated by our model. We will release our code soon.
SpeakStream architecture
Ablation Study (training on 24h single-speaker data)
We compare the quality of the generated audio with different text window lengths and speech generation lengths. The results can be found in our paper. Here are some audio samples.
NonStreaming | |
T5S1 | |
T5S2 | |
T5S3 | |
T5S4 | |
T5S5 |
BibTeX
@article{bai2025speakstream,
title={SpeakStream: Streaming Text-to-Speech with Interleaved Data},
author={Bai, He and Gu, Zijin and Likhomanenko, Tatiana and Jaitly, Navdeep},
journal={arXiv preprint arXiv:2505.19206},
year={2025}
}