We released DALL-E 3 this week. It has been a labor of love for myself, Aditya, Gabe and myself for a little over a year. It really is an impressive machine we have built. It continues to surprise me every day, despite having worked on it for so long. I’m extremely grateful to my fellow authors for a year of amazing learning and creating. I really hope everyone enjoys it and the world is a more colorful, graphical place because of it.
I’ve met quite a few amazing people through this blog, most of which I’ve only had the chance to trade e-mails with. I’m attending ICML next week and would love to grab a coffee or beer with any of you. Shoot me an e-mail if interested. jbetker -at- gmail.
A pet peeve of mine that often shows up in ML discourse is the claim that humans are much more data efficient at learning than the models we are currently training. The argument typically goes like this: “I’m blown away by how much knowledge my 3 year old has. They are smarter than most language models, despite being trained on a very small training dataset. Clearly, our models are missing something important because they cannot learn like my 3 year old!” But is the training dataset of a 3 year old actually smaller than a typical language model? For fun,…
In my last post, I briefly discussed the infuriating fact that a neural network, even when deeply flawed, will often “work” in the sense that it’ll do above-random at classification or a generative network might create things that may sometimes look plausibly from the dataset. Given an idea that you’re testing out that is performing poorly – how, then, do you tell the difference between a botched implementation and an idea that just isn’t good? I think this is one of the toughest questions I have to deal with on a daily basis as an ML engineer. It’s the difference…
I don’t read as many papers as I once did. I find this surprising as I always assumed that when I made ML my full-time job, I would spend a lot more time reading up on all of the things that other folks in the field are up to. To some extent, this is a weakness. There is a healthy balance one should strike between reading and writing and I’m definitely skewing a bit too far towards the writing side of things (code, not papers). With that said, I have the honor of working with some of the people I…
I’ve been at OpenAI for almost a year now. In that time, I’ve trained a lot of generative models. More than anyone really has any right to train. As I’ve spent these hours observing the effects of tweaking various model configurations and hyperparameters, one thing that has struck me is the similarities in between all the training runs. It’s becoming awfully clear to me that these models are truly approximating their datasets to an incredible degree. What that means is not only that they learn what it means to be a dog or a cat, but the interstitial frequencies between…
Obligatory: the views and opinions expressed in this post are my own and do not represent the views and opinions of my employer. In light of all the hype going around about ChatGPT, I wanted to offer my “hot take” on what the next 2-5 years of the web look like. One aspect of the rise of generative models that isn’t getting the right amount of attention is the long-term effects on the information economy. I think that being able to automatically produce arbitrary content that is indistinguishable from human-generated content at scale is the death knell of the web…
I’m going to take a stab at nailing down what I believe to be the five fundamental components of a deep neural network. I think there’s value in understanding complex systems at a simple, piecewise level. If you’re new to the field, I hope that these understandings I’ve built up over the last few years help you! Data Representation The unit of data representation in a DNN is a vector. Vectors are called many different things: embeddings, tensors, activations, hidden states. They’re all just a list of floating point numbers that represent some single thing. Storage The learned weights of…
Since joining OpenAI, I’ve had the distinct pleasure of interacting with some of the smartest people on the planet on the subject of generative models. In these conversations, I am often struck by how many different ways there are to “understand” how diffusion works. I don’t think most folk’s understanding of this paradigm is “right” or “wrong”: they are just different. I think there is a distinct value in having a different viewpoints here: an engineers perspective might be more useful to deploy these things to real products, whereas a mathematicians conceptualization may aid improvements in the core technology. I’d…
I’ve been meaning to write this for a couple of months now, but simply haven’t found the time. Life has gotten quite busy for me lately, and I hope to explain why. First, the elephant in the room – I have left Google and finally stepped into the ML industry. I’ve accepted a position as a research engineer at OpenAI. To say that I am over the moon about this would be to understate it. This is, quite literally, my dream job. Somehow I have convinced someone to pay me to do the exact thing that I spend most of…
In machine learning research, there is often a stated desire to build “end to end” training pipelines, where all of the models cohesively learn from a single training objective. In the past, it has been demonstrated that such models perform better than ones which are trained from multiple components, each with their own loss. The reasoning behind this notion is sound: every time you break up a model into different parts, you must necessarily introduce a new lossy medium: The prevailing theory is that these losses build up and produce an altogether inferior model at the end of the pipeline….
Lab notes is a way for me to openly blog about the things I am building. I intend to talk about things I am building and the methods I plan to use to build them. Everything written here should be treated with a healthy amount of skepticism. I’ve been researching something this week that shows a lot of promise, and I really wanted to write about it. I call them “cheater latents”. They’re inspired by something I observed from Tortoise: in an early version of Tortoise, I trained the AR model by using the output clip itself as the conditioning…
Lab notes is a way for me to openly blog about the things I am building. I intend to talk about things I am building and the methods I plan to use to build them. Everything written here should be treated with a healthy amount of skepticism. I wanted to write about something I built about a month ago that I think is really neat and I would like to return to someday. A quick disclaimer is that I think there is a strong probability that formal research on this idea already exists: I vaguely recall reading something similar about…
A lot of people have asked about the computers I used to train TorToiSe. I’ve been meaning to snap some pictures, but it’s never “convenient” to turn these servers off so I keep procrastinating. We had some severe thunderstorms today here in the front range which forced me to shut down my servers. I took the opportunity to take some photos. A little history to start Building out my servers has been a long, multi-year effort. I started the process right around the time that NVIDIA launched their Ampere GPU lineup. I was fortunate to be able to grab 6…
For my next project, I want to play around in the music generation space. I think it’ll be interesting to apply some of the lessons learned building Tortoise to music. The first step is building the musical equivalent of a vocoder: a model that will transform a MEL spectrogram to waveform data. That way the main generator(s) can work in highly reduced spectrogram space, just like Tortoise. I could just train a new Univnet model. That probably would have been the wisest choice. However, I don’t really like training GANs and I have no experience training Univnet. Finally, I really…
Overview TorToiSe is a text-to-speech (TTS) program which can mimic voices given 2-4 examples. It is composed of five separately-trained neural networks that are pipelined together to produce the final output. This document will first go into details about each of the five models that make up Tortoise, and will wrap up with a system-level description of how they interoperate. The Autoregressive Decoder Reference Clips A list of reference clips are also provided to the model. The model uses these clips to figure out how to properly mimic the voice, intonation, prosody, etc of the speech it is expected to…
As I covered in my last post, I’m currently working on improving the quality of the diffusion model used to rebuild discretized audio signals for tortoise-tts. Since realizing that the diffusion model can work entirely with spectrograms, I have been re-structuring the model to be a flat transformer/resnet hybrid. One nifty thing about this set-up is that I can now concatenate the diffusion inputs with the conditioning signal from the discretized audio signal and feed the whole thing into the model. This is the same thing that multiple authors working on diffusion models have done with the low-resolution inputs for super-resolution…
I’ve spent the majority of the last two months working on improving the diffusion model in Tortoise TTS. The model used in v1 had a few major shortcomings: Conditioning inputs were bottlenecked to a very small dimensional input into the main model, limiting their effectiveness. The model was trained on audio signals at 11kHz. To make this feasible, I needed to chop up the signals into small clips which have limited context, and: The model itself was relatively shallow: the top layers only had a depth of 1 and model channels at the top levels were very restricting. No processing…
I’ve updated the tortoise-tts repo with a script that automatically download model weights (thank to the HuggingFace Hub for hosting them!). I’ve also created a colab notebook if you want to try this out on Google hardware. Make sure you pick a GPU runtime. Sample outputs can be found in the results/ folder of the GitHub repo. Find some handpicked generate below. I’m not done with this project. It is clear to me that the autoregressive model does an extremely good job at producing realistic prosody. I will be making a few tweaks to make it less sensitive to the…
In an earlier post, I walked you through a project I’ve been working on, which I called “triforce” at the time. I’ve finished training a first pass on this collection of models and want to write about the results. Deploying this speech CLIP model on the outputs of my autoregressive speech token generator made all of the difference. Outputs are consistently awesome, and almost always clearly convey the desired speech. Adding CLIP to the ensemble After training the three triforce models, I was having considerable difficulty with the autoregressive portion of the model. Specifically, while I would generate a lot…