For the past two years, I’ve been tinkering around with generative models in my spare time. I think I’ve landed on an approach that produces by far the most compelling results available today, and which scales like big language models. I’d like to outline the approach here.
First of all, I want to touch on something that’ll become immediately obvious: this isn’t a novel architecture or anything. In fact, it is pretty much OpenAI’s DALL E with a diffusion upsampler attached. Instead, it’s a way of thinking how one can (1) improve upon DALL E and (2) universally model generative domains using a single set of techniques.
Three Models
This approach uses three different neural networks to produce the finished result, all trained separately from one another.
The first is a Discrete VAE. This model is responsible for translating your base medium (for example, images, music, video, etc) into a string of integers. The DVAE preserves the structure of your medium, but simplifies the contents of it into discrete bins that can be reasoned about. The DVAE can also compress the medium so it is more computationally tractable to reason about.
The second is the causal transformer, essentially a GPT model. This model is trained in next-token-prediction where the tokens are the discrete outputs of the DVAE. These models are especially neat because you can throw anything you like into the sequence and they will learn how to reason about them. Have text and audio and want to produce images? Discretize all three and throw them into your causal transformer! It’ll learn how to convert between these mediums and predict image tokens. Want to flip the problem around and predict text from images and audio clips? Just flip this sequence around! The flexibility of this architecture is incredible.
The final stage is the diffusion network. To understand why this is necessary, you have to first understand that DVAE’s have absolutely awful decoders. They are always lossy and that cannot be fixed because VAEs do not scale. Anecdotally – this is almost certainly the reason that DALL E’s generates are so blurry.
Diffusion models are, bar none, the best super resolution models in existence. What is good for super resolution is also good for upsampling the output of your DVAE decoder. You simply feed the output of your DVAE as a prior to your diffusion model and train it to reproduce full resolution images. Unlike DVAEs, diffusion models respond excellently to scaling. Unlike GANs, diffusion models do not suffer from mode collapse.
That is the slowest generative model in existence
… You’re right. Autoregressive transformers are slow to sample from, and so are diffusion networks. This is not a fast technique, and it’ll likely never see use on the edge. However, it is capable of producing extremely compelling generates. Better than anything I have seen or heard from literature. While there is certainly a place in the world for something that works fast, there is also a place for something that truly works well. I think we are only a few years away from ML models that produce generates that the average human would consider as true art. Voice, music, paintings, etc – it’s all possible, with enough data, compute and patience.
To that end, I am currently building a text-to-speech triforce model which I suspect will blow every previous TTS model out of the water. It’s going to be slow and ungodly large, but my goal is to build something that you can truly enjoy listening to. Something that you might actually use to transcribe audio books or as a stand-in to voice actors, for example.
Like all large transformer models, this thing is going to be enormously data hungry so my last few months has been spent building a massive speech dataset pulled from podcasts, audiobooks and YouTube. I hope to write about that soon.