Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models

Abstract

Controllable generative sequence models with the capability to extract and replicate the style of specific examples enable many applications, including narrating audiobooks in different voices, auto-completing and auto-correcting written handwriting, and generating missing training samples for downstream recognition tasks. However, typical training algorithms for these controllable sequence generative models suffer from the training-inference mismatch, where the same sample is used as content and style input during training but different samples are given during inference.

In this paper, we tackle the training-inference mismatch encountered during unsupervised learning of controllable generative sequence models. By introducing a style transformation module that we call style equalization, we enable training using different content and style samples and thereby mitigate the training-inference mismatch. To demonstrate its generality, we applied style equalization to text-to-speech and text-to-handwriting synthesis on three datasets. Our models achieve state-of-the-art style replication with a similar mean style opinion score as the real data. Moreover, the proposed method enables style interpolation between sequences and generates novel styles.

Our paper is here.

Our handwriting generative model has been used to improve handwriting recognition.

Our speech generative model has also been used to improve speech recognition.

Quick introduction video


Speech synthesis demo video

Given a reference speech audio, our model generates new audios that sound like they were recorded in the original environment by the same speaker. In other words, we mimic the voice characteristics of the speaker, the background noise, the echo, the microphone response, etc, but with our target content.

In the video below, we type the content in the input text box (top row), use the slider to choose a random speech audio as the style reference input (middle row), and synthesize the input text with the style of the reference audio (bottom row).

Please unmute the video and turn on your audio.

As can be seen, our method accurately mimics the style of the reference example while producing the correct content.

Here is a quick comparisons with global style token, which is also an unsupervised method.
The goal is to read the input text in the same style (e.g., voice characteristics, background noise, echo, etc) as the style input.

Input text 1: I did not see any reason to change the captain.

style text style input global style token proposed
When the candle ends sent up their conical yellow flames, all the colored figures from Austria stood out clear and full of meaning against the green boughs.
The man shrugged his broad shoulders and turned back into the arabesque chamber.

Input text 2: Next year it plans to open an office in Tokyo.

style text style input global style token proposed
I had meant it to be the story of my life, but how little of my life is in it!
As the inspiring music, the grand tramp drew near, Christie felt the old thrill and longed to fall in and follow the flag anywhere.

Please click to see a detailed comparison.

Handwriting synthesis demo video

Given a reference handwriting, which comprises a sequence of pen movements, our model generates a new handwriting in the same writing style as the unseen reference handwriting.

In the video below, we type the content in the input text box (top row), use the slider to choose a random style (rasterized style handwriting is shown in parallel with the selection in the middle row) and synthesize the input content with the selected handwriting style (shown as a sequence of strokes in the bottom row).

*Due to privacy reasons, the style references used in this video are synthetic. They are similar to unseen real style examples in our dataset and are synthesized using a generative model with a different architecture. The generations shown here are very similar when real samples are used as style input. Note that all the evaluations reported in the paper are done using real unseen style examples.

Please click to see more handwriting examples.