Improving Diffusion Models for TTS – Non_Interactive

I’ve spent the majority of the last two months working on improving the diffusion model in Tortoise TTS.

The model used in v1 had a few major shortcomings:

Conditioning inputs were bottlenecked to a very small dimensional input into the main model, limiting their effectiveness.
The model was trained on audio signals at 11kHz. To make this feasible, I needed to chop up the signals into small clips which have limited context, and:
The model itself was relatively shallow: the top layers only had a depth of 1 and model channels at the top levels were very restricting.
No processing was performed on the input codes, meaning the main u-net needed to do all the heavy lifting here.

I tried out many solutions to the above problems. Of them, the spatial dimensionality of the model was the hardest to tackle. I experimented at length with models that operated at 5000kHz, with the intent of training a separate super-resolution model. The problem is that even at this dimensionality, training is ridiculously slow and model size is constrained.

MEL 4 lyfe

A couple of weeks ago I had a breakthrough which seems obvious in hindsight: train the diffusion model to reconstruct real MEL spectrograms, rather than the audio signal itself, then use an existing vocoder to go from the MEL to the audio signal.

This approach needed a little work to make it feasible. First, I set about finding a good Vocoder. Ideally something that was trained on a diverse dataset (which basically means LibriTTS) so that it could reconstruct a variety of voices. Ideally I’d like something pretrained so I don’t have to burn resources training another network. Finally, something efficient would be nice too; we don’t need to make TortoiseTTS any slower!

I ended up going with univnet. After testing the pretrained model, I found it worked well on many voices, including ones outside of the standard TTS datasets. It is also fast and fairly simple.

The next hurdle to overcome is that diffusion models can only generate data on the range [-1,1]. (log)MEL spectrograms far exceed this range. An easily solution is to normalize the spectrograms, but I wasn’t exactly sure how that would work. From past experience, I know that spectrograms are extremely sensitive to relatively tiny variances when compared with other media like audio signals or images.

So I built a normalization scheme and tested the concept on some toy models. To my surprise – it worked great! I set about training a diffusion model that would generate MEL spectrograms (which operate at a 256x spatial reduction when compared to 22kHz audio signals).

At this point, I’ve trained one of these models with 40M samples seen. The results are definitely much better than the existing diffusion model used in Tortoise-TTS. Except I probably won’t release it, because:

Going Flat

While the model was training on my main deep learning machine, I got to thinking: why am I using a u-net at all at such a low spatial resolution? It is entirely feasible to simply use a standard, flat transformer for the entire processing network that composes the diffusion model.

I’m pretty down on conv-nets these days and am especially skeptical of u-nets. I think u-nets could be powerful, but I think there is a serious lack of research into training them efficiently. You only need to look at the grad norms of each layer of a u-net to see how poorly the architecture trains: My hypothesis is that the lower (parameter-rich) parts of these models hardly train at all.

So lets build a flat, fully-attentional diffusion network. You got it! After experimenting with this toy model in the toy-regime, I found it performed incredibly. At the loss level, it was easily 20% superior to the u-net model.

Other things to try

One other thing I really want to try is to improve the structural conditioning signal provided to the diffusion network. Currently, I am simply providing the codes outputted from the autoregressive model (or the DVAE, during training).

However, I think this model would benefit greatly from a more robust signal. Specifically: why not train it directly on the outputs of the autoregressive model? Then it has access to the actual text inputs, among other things.

I plan to experiment with this before I release v2 of the tortoise-tts model suite. Stay tuned. I’m getting pretty excited, this thing is coming together exceptionally well.