Non_Interactive – Software & ML

Vibe CodingMay 23, 2025
I had a pretty incredible vibe coding experience with o3 today. As I’m sure many of you have also had recently – whether with o3, or Claude or Gemini. I was iterating on a problem with it over a couple of hours. I asked it to come up with an idea for a novel algorithm application for an advanced ML project. Then asking it to implement it. And write unit tests. I did static code analysis on the code it generated with the assistance of a separate instance of o3 before running anything. The I iterated over the unit tests…
Mixture of ExpertsApril 18, 2025
A Transformer is a stack of alternating Attention and MLP layers through which data embedded as high dimensional vectors is fed. A Mixture of Experts (MoE) Transformer substitutes the MLP layer for an “MoE Layer”. Let’s dive into what that means. The “MLP” is one of the oldest neural network architectures, consisting of two linear transformations. First, an embedding vector is expanded via the first transformation. Next, a non-linearity is applied to the expanded vector. Finally, it is contracted back to the original dimensionality. An “MoE” layer is a collection of n_experts MLP layers and a “router”. The router is…
The ParadigmMarch 16, 2025
Over the past decade, some of the most remarkable AI breakthroughs—AlphaGo, AlphaStar, AlphaFold1, VPT, OpenAI Five, ChatGPT—have all shared a common thread: they start with large-scale data gathering (self-supervised or imitation learning, or SSL) and then use reinforcement learning to refine their performance toward a specific goal. This marriage of general knowledge acquisition and focused, reward-driven specialization has emerged as a the paradigm by which we can reliably train AI systems to excel at arbitrary tasks. I’d like to talk about how and why this works so well. 1 – AlphaFold 2 technically does not use RL; instead it uses…
Beating ARC the hard wayDecember 22, 2024
ARC is benchmark developed to test out of distribution reasoning and common sense in general solvers. It is specifically designed to be: Easily solvable by most humans Not amenable to any kind of brute-force solvers (e.g. try every permutation of a solution) Not able to be solved with rote memorization The designers of ARC achieved the above in a creative way: by developing problems that contain visual puzzles in which the participant must find an algorithm that explains symmetries seen across several demonstrations. They then must apply that algorithm to a final input. This sounds complicated, but in practice it…
General Intelligence (2024)June 3, 2024
Folks in the field of AI like to make predictions for AGI. I have thoughts, and I’ve always wanted to write them down. Let’s do that. Since this isn’t something I’ve touched on in the past, I’ll start by doing my best to define what I mean by “general intelligence”: a generally intelligent entity is one that achieves a special synthesis of three things: A way of interacting with and observing a complex environment. Typically this means embodiment: the ability to perceive and interact with the natural world. A robust world model covering the environment. This is the mechanism which…
GPT-4oMay 14, 2024
I’m very pleased to show the world GPT-4o. I came into the project mid-last year with Alexis Conneau with the goal of scaling up speech models and building an “AudioLM”. We knew we had something special late last year, but I don’t think either of us imagined that we’d able to pull off something as cool as GPT-4o in this short of a time frame. That came from the dedicated work of a core team of “believers”. I’m incredibly proud to have had the chance to work with so many talented and motivated people. I agree with Sam that interacting…
Research CodeMarch 16, 2024
At my job, I’m currently in a cycle that is involving working with software engineers quite a bit. One thing that has happened a number of times is that a software engineer will bring up “research code” with a condescending tone. The implication is that research code is messy, unreadable, and difficult to maintain. I don’t deny this! It often is those things, but I also think it has a beauty to its purpose and prose that is worth acknowledging. Most code has a purpose from the get go. Someone thinks “wouldn’t it be nice if my computer did <x>”,…
Learned StructuresMarch 3, 2024
From 2019-2021, I was fascinated with neural network architectures. I think a lot of researchers in the field were at the time. The transformer paper had been out for a little while and it was starting to sink in how transformational it was going to be. The general question in the air was: what other simple tweaks can we make to greatly improve performance? As time has passed, I’ve internally converged on the understanding that there are only a few types of architectural tweaks that actually meaningfully impact performance across model scales. These tweaks seem to fall into one of…
go/rulesofthumbJanuary 6, 2024
Google has a neat internal website called “Rules of Thumb”, which compares the marginal cost of computational resources to the unit of a “SWE”. “SWE” refers to “Software Engineer” – which itself is the marginal cost to pay salary and benefits to the average engineer at the company. Throughout design docs at the company, you’ll see costs referred to in the units of SWE. For example, “deploying service <X> at 1000QPS will cost ~100 SWEs in resources”. I always thought comparing costs of fixed assets like compute, RAM, or database accesses to the cost of hiring a new employee was…
Compute MultipliersNovember 5, 2023
I’ve listened to a couple of interviews with Dario Amodei, CEO of Anthropic, this year. In both of them, he dropped the term “compute multiplier” a few times. This concept is exceptionally important in the field of ML, and I don’t see it talked about enough. In this post, I’m going to attempt to explain what it is and why it is so important. Computational Efficiency Chinchilla is undoubtedly the landmark academic paper of 2022 in the field of Machine Learning. It’s most known for documenting the optimal relationship between the amount of compute poured into training a neural network…
Is the Reversal Curse a generalization problem?October 18, 2023
In my last post, I made a claim that the recently discovered reversal curse is not something that worries me. In fact, when I originally learned of it, I can’t say I was very surprised. In this post, I wanted to dig into that a little bit more. My hypothesis is that the reversal curse is a attribute of knowledge look-up, not a problem with the ability of LLMs to perform reasoning. Lookup in NNs Let me first describe how I think knowledge look-up in neural networks currently works. At a high level, autoregressive neural networks map inputs into high-dimensional…
The State of ML in 2023October 7, 2023
I’ve been trying to figure out how to best write this article for most of the last year. Today, I’ve decided to just write down something, rather than continue trying to wordsmith exactly what I mean. I am tremendously excited by everything that is going on in ML right now. The breadth of the problem space to which we can apply generalist learning techniques seems virtually unbounded, and every time we scale something up, we see new capabilities start to emerge. With all that being said, I think it’s worth considering from time to time what we haven’t achieved. Lets…
DALL-E 3September 23, 2023
We released DALL-E 3 this week. It has been a labor of love for Aditya, Gabe and myself for a little over a year. It really is an impressive machine we have built. It continues to surprise me every day, despite having worked on it for so long. I’m extremely grateful to my fellow authors for a year of amazing learning and creating. I really hope everyone enjoys it and the world is a more colorful, graphical place because of it.
ICML 2023July 18, 2023
I’ve met quite a few amazing people through this blog, most of which I’ve only had the chance to trade e-mails with. I’m attending ICML next week and would love to grab a coffee or beer with any of you. Shoot me an e-mail if interested. jbetker -at- gmail.
On the efficiency of human intelligenceJuly 5, 2023
A pet peeve of mine that often shows up in ML discourse is the claim that humans are much more data efficient at learning than the models we are currently training. The argument typically goes like this: “I’m blown away by how much knowledge my 3 year old has. They are smarter than most language models, despite being trained on a very small training dataset. Clearly, our models are missing something important because they cannot learn like my 3 year old!” But is the training dataset of a 3 year old actually smaller than a typical language model? For fun,…
Techniques for debugging neural networksJuly 1, 2023
In my last post, I briefly discussed the infuriating fact that a neural network, even when deeply flawed, will often “work” in the sense that it’ll do above-random at classification or a generative network might create things that may sometimes look plausibly from the dataset. Given an idea that you’re testing out that is performing poorly – how, then, do you tell the difference between a botched implementation and an idea that just isn’t good? I think this is one of the toughest questions I have to deal with on a daily basis as an ML engineer. It’s the difference…
Ablations are really importantJune 25, 2023
I don’t read as many papers as I once did. I find this surprising as I always assumed that when I made ML my full-time job, I would spend a lot more time reading up on all of the things that other folks in the field are up to. To some extent, this is a weakness. There is a healthy balance one should strike between reading and writing and I’m definitely skewing a bit too far towards the writing side of things (code, not papers). With that said, I have the honor of working with some of the people I…
The “it” in AI models is the dataset.June 10, 2023
I’ve been at OpenAI for almost a year now. In that time, I’ve trained a lot of generative models. More than anyone really has any right to train. As I’ve spent these hours observing the effects of tweaking various model configurations and hyperparameters, one thing that has struck me is the similarities in between all the training runs. It’s becoming awfully clear to me that these models are truly approximating their datasets to an incredible degree. What that means is not only that they learn what it means to be a dog or a cat, but the interstitial frequencies between…
GPT might be an information virusMarch 9, 2023
Obligatory: the views and opinions expressed in this post are my own and do not represent the views and opinions of my employer. In light of all the hype going around about ChatGPT, I wanted to offer my “hot take” on what the next 2-5 years of the web look like. One aspect of the rise of generative models that isn’t getting the right amount of attention is the long-term effects on the information economy. I think that being able to automatically produce arbitrary content that is indistinguishable from human-generated content at scale is the death knell of the web…
The Fundamental Building Blocks of DLDecember 6, 2022
I’m going to take a stab at nailing down what I believe to be the five fundamental components of a deep neural network. I think there’s value in understanding complex systems at a simple, piecewise level. If you’re new to the field, I hope that these understandings I’ve built up over the last few years help you! Data Representation The unit of data representation in a DNN is a vector. Vectors are called many different things: embeddings, tensors, activations, hidden states. They’re all just a list of floating point numbers that represent some single thing. Storage The learned weights of…