DALL E for TTS: TortoiseTTS – Non_Interactive

In an earlier post, I walked you through a project I’ve been working on, which I called “triforce” at the time.

I’ve finished training a first pass on this collection of models and want to write about the results.

Deploying this speech CLIP model on the outputs of my autoregressive speech token generator made all of the difference. Outputs are consistently awesome, and almost always clearly convey the desired speech.

Adding CLIP to the ensemble

After training the three triforce models, I was having considerable difficulty with the autoregressive portion of the model. Specifically, while I would generate a lot of really good speech, I would also regularly generate audio with a single syllable that dragged, for example: “three pigs went to the paaaaaaaaaaaaaaaaaaaaaaaa”.

After spending some time tinkering with different methods of autoregressive generation (thanks for all your work, HF team!), I finally came around to the realization that the secret sauce is adding a fourth model to the ensemble that serves the same purpose as DALLE’s CLIP.

This was simple to train. It’s essentially the same thing as CLIP but I replaced the ViT encoder with a speech token encoder. This basically makes a model that gets really good at determining whether a given audio clip corresponds to a given set of text. I operate in the speech token regime because unlike DALLE&CLIP, fully decoding an output from my model takes a huge amount of time (thanks, diffusion models..) By operating on speech tokens I can use the speech CLIP model directly on the outputs of my autoregressive generator.

With CLIP added to the ensemble, the name “triforce” no longer worked as well. So…………

Code

I’ve assembled a repo that can be used to perform inference on my models. I call the ensemble “Tortoise TTS” – which is poking fun at the insanely slow generation rates of this method. The repo can be found here: https://github.com/neonbjb/tortoise-tts

Data

Autoregressive and diffusion models like lots of data. To train this model, I assembled a speech dataset of ~18 million audio clips ranging from 2 seconds-12 seconds in length with a mean of ~4 seconds. To create tehse audio clips, I first crawled a variety of sources for speech audio: audiobooks, podcasts and YouTube are primary sources.

Next, I used pydub to break the audio data up wherever there were segments of silence. Clips between 2-12 seconds were kept, everything else was thrown out.

Next, the resulting clips were fed through a cleaning pipeline. This pipeline consisted of several bespoke ML models that detected:

People talking over one another (multiple voices)
Excessive reverb
Background music
Environmental noise
White noise (bad mic for example)
Low quality (<20kHz)

Any clip that exhibited these things were pruned from the final dataset.

Next, I needed to provide text to go along with the audio clips. This was the original reason why I wrote the ocotillo library. Thanks again to the HF team for open sourcing such an incredible speech transcription model as W2V2.

The data does not always have permissive licenses, so I will not be releasing it.

I do plan on releasing my speech processing and cleaning scripts in the future. If you are interested, please contact me directly.

Results

This model is pretty damned cool. I am really excited to share some of the things it is capable of generating as well as some lessons learned.

That’s going to come another day, though. Soon though!

In the coming weeks, I will provide pretrained weights for the model as well as output samples. These will be accompanied with a separate blog post.