For my next project, I want to play around in the music generation space. I think it’ll be interesting to apply some of the lessons learned building Tortoise to music.
The first step is building the musical equivalent of a vocoder: a model that will transform a MEL spectrogram to waveform data. That way the main generator(s) can work in highly reduced spectrogram space, just like Tortoise.
I could just train a new Univnet model. That probably would have been the wisest choice. However, I don’t really like training GANs and I have no experience training Univnet. Finally, I really don’t care about inference speed for this project. So instead I opted to do this with a diffusion model.
The architecture I decided to use is fairly conventional: I used the a UNet model like the one you can get from the OpenAI Guided Diffusion repo with the a structural guidance scheme I developed while working on Tortoise. The conversion from MEL <=> waveform does not need global attention, so all global attention layers were removed as well.
Since we are working with audio data, the sequences are very long: 20,000 for a single second of music. (and that isn’t even high quality!) Since I was working in such high dimensions, I was forced to choose a low model dimensionality at the base layers: 64 was as big as I could get away with.
The results were awful:
Over the past few weeks, I’ve been tinkering with the architecture; thinking there was something wrong with how I was modeling the problem. Several permutations later and lots of reading, I had wasted ~200 hours of GPU time with nothing to show for it.
This morning when I woke up, I remembered something odd that had occurred when I had trained toy diffusion models for Tortoise: they all had horrible quality, even though their loss curves looked fairly normal. These models were very small, so I never thought much about it at the time. I got to thinking, though: what if there was some minimum bound to the number of channels for the base layer of a diffusion network?
To test out this theory, I needed a way to compress the output of the diffusion model. One idea that came to mind is using the concepts behind PixelShuffle. So I compressed the waveform by 16x using a 1D form of PixelShuffle, increased the base dimensionality to 256, and started another training run..
..And it worked! At 10k iterations in (~15% of total training), the audio quality is already considerably better than “fully trained” models that did not use PixelShuffle:
To make matters better, the model size is even smaller and executes faster.
I’m not sure what causes this phenomenon. I don’t think it is relegated to audio data. I have trained diffusion models for images and noticed the same problem. Models with 64 base channels performed poorly, but models with 128 base channels did great. I wonder if it has something to do with the timestep signal?
Anyhow, I hope you can learn from my wasted resources. If you’re training a diffusion model, don’t go below 128 channels and expect good results in inference. As a corrolary: don’t expect toy diffusion models to work in many cases.
A final note: this may only be a problem with OpenAI’s diffusion architecture and processes. I only have experience with these. Models like DiffWave suggest that this can be at least overcome with different design choices.