SpeakStream: Streaming Text-to-Speech with Interleaved Data

Apple

Abstract

Generated Audio:

Transcription:

In this paper we present a streaming text-to-speech system called SpeakStream that can generate audio incrementally from streaming text using a decoder only architecture. The model is trained using next step prediction loss on interleaved text speech data. During inference, SpeakStream generates speech incrementally while absorbing streaming input text, making it suitable for cascaded conversational agents where a large language model streams its output to a text-to-speech system. Our experiments show that SpeakStream matches non streaming model's quality while enabling streaming capabilities with state of the art latency results. The audio you are hearing is generated by our model. We will release our code soon.

SpeakStream architecture

Ablation Study (training on 24h single-speaker data)

We compare the quality of the generated audio with different text window lengths and speech generation lengths. The results can be found in our paper. Here are some audio samples.

NonStreaming
T5S1
T5S2
T5S3
T5S4
T5S5

BibTeX

@article{bai2025speakstream,
    title={SpeakStream: Streaming Text-to-Speech with Interleaved Data},
    author={Bai, He and Gu, Zijin and Likhomanenko, Tatiana and Jaitly, Navdeep},
    journal={arXiv preprint arXiv:2505.19206},
    year={2025}
  }