As I covered in my last post, I’m currently working on improving the quality of the diffusion model used to rebuild discretized audio signals for tortoise-tts. Since realizing that the diffusion model can work entirely with spectrograms, I have been re-structuring the model to be a flat transformer/resnet hybrid.
One nifty thing about this set-up is that I can now concatenate the diffusion inputs with the conditioning signal from the discretized audio signal and feed the whole thing into the model. This is the same thing that multiple authors working on diffusion models have done with the low-resolution inputs for super-resolution models.
I naively figured that I could simply plop an embedding of the discretized audio signal into the model inputs. I found out that this was wrong. The diffusion model would ignore the conditioning signal and would learn to produce gibberish that sounded like human speech but had no actual words.
It turns out that in order for the concatenation set-up to work, the value being concatenated must have a signal that is usable by the model. Since I was feeding the signal through an embedding, the signal was psuedo-random (like most intermediate states in a freshly initialized neural network). Diffusion models will not learn an embedding for your conditioning signal, and will instead learn to just ignore it in the early phases of training!
This was confirmed by adding a surrogate loss to my model. I applied an MSE loss to the embeddings to force them to predict a reconstructed MEL signal (the same one that the diffusion model also learns to predict). This immediately solved the problem: the network learns to use the conditioning signal, resulting in dramatically improved overall losses.
It’s useful to remember sometimes that neural networks (and diffusion models specifically!) do not simply “learn” everything. They will take shortcuts, to their own eventual detriment. This is especially true when you are combining multiple inputs.