I’ve been trying to figure out how to best write this article for most of the last year. Today, I’ve decided to just write down something, rather than continue trying to wordsmith exactly what I mean.
I am tremendously excited by everything that is going on in ML right now. The breadth of the problem space to which we can apply generalist learning techniques seems virtually unbounded, and every time we scale something up, we see new capabilities start to emerge.
With all that being said, I think it’s worth considering from time to time what we haven’t achieved. Lets take a quick tour or the current state of the art:
The image space I know quite well. With DALL-E 3, we cracked spelling (most of the time), which was a major capability gap of text-to-image models when compared to humans. We still have a lot of problems. These models can’t tell left from right, can’t count past three, and still have a hard time getting pose and positions of body parts right. We still have to take hacky approaches to get high-resolution images (specifically, chaining multiple neural networks together with smaller, less “intelligent” ones responsible for modeling the high resolution space), resulting in comically distorted high-resolution details from time to time.
Text is a domain I know less about, but I am an avid user of ChatGPT. GPT-4 is exceptional at understanding intent, following directions, and creativity. It’s much less amazing at providing specific information. I notice this a lot when asking for recipe instructions or variations – it always picks the most generic, boring ingredients! The issue more generally shows up in any area that I’d consider myself experienced in – coffee, airplanes, gardening, biking, etc. It’s just not a great resource for these things as it only “knows” slightly more about any niche topic than the average person off the street. Of course there’s also the reversal curse recently discovered – this doesn’t bother me as much as it does some of my colleagues, but it is certainly a shortcoming!
I use ChatGPT for coding quite a bit. It’s fantastic at answering API questions and coming up with simple algorithms, but consistently falls flat when you start to have it design systems over a certain size or complexity. A common failure mode is a loop where you present a bug to the system, it recommends a fix, another bug pops up, and the system recommends removing the previous fix.
Audio is another domain I spend a lot of my mental energy on. We’ve got a pretty solid grasp on speech recognition and generation, but our models are currently quite weak at conversation. Despite their abilities with language, they don’t understand how to actually talk to people in a way that doesn’t sound robotic. Music understanding and generation is completely off of the map. There’s been some solid progress this year by the folks at Meta and Stability here, but I really can’t help but get the feeling that the text to music models they’ve put forth are just regurgitating beats, rhythms and melodies from their training datasets. I have serious doubts that any of these models have actual musical understanding and could ever create a song like GPT-4 can create a poem.
Video is an up and coming modality and most of my experience in it comes from Runway. They’ve got some really cool tech, but it’s painfully obvious that their models do not understand basic causal sequences of events, go look at any of their videos that run a scene for more than a few seconds to see what I mean.
There’s a couple of common themes with all of the shortcomings I described above:
- They arise from a lack of data in a specific domain or capability (or failure to train on the data we do have)
- There’s a failure to generalize at the semantic level – as an example, knowing to render images of hands with five fingers is as simple as learning to count, but the models refuse to do that even at “large” scales by modern standards.
This is important because it gives me the feeling sometimes that we’re still at just the very start of our journey in this space. More importantly, realizing that enormous models like GPT-4 or DALL-E 3 still have fundamental shortcomings is a sign that attempts to get truly intelligent behavior out of relatively small models like Llama 2 or Stable Diffusion is kind of hopeless.
I’ve had the privilege of playing with models at multiple scales, and I get the feeling that scaling compute directly addresses the above two points. They address the first by fitting the data manifold more precisely, allowing even minor data points that are trained on to emerge in the outputs of the model. They address the second through raw compression.
Ilya recently did a presentation on this at Berkeley, and had a quote somewhere in there about squeezing bits and how “compressing the last few bits is where all of the interesting stuff happens”. I love this as you can imagine a giant image generation model with many hundreds of billions of parameters having to contort its entire parameter space to learn that the world is not symmetric, and there exists these important concepts called left and right which can be expressed as a single bit. And that’s just the tip of the iceberg.
In many ways, this gets at OpenAI’s core mission. We’re all on an unstoppable ship of technological progress – in 10 years, we’ll be running something like GPT-4 on our smartphones. OpenAI (and others!) is building massive supercomputers to shine the light on what is going to be possible when that happens, or after.