I’ve listened to a couple of interviews with Dario Amodei, CEO of Anthropic, this year. In both of them, he dropped the term “compute multiplier” a few times. This concept is exceptionally important in the field of ML, and I don’t see it talked about enough. In this post, I’m going to attempt to explain what it is and why it is so important.
Computational Efficiency
Chinchilla is undoubtedly the landmark academic paper of 2022 in the field of Machine Learning. It’s most known for documenting the optimal relationship between the amount of compute poured into training a neural network and the amount of data used to train said network. In the process, it refuted some of the findings of an OpenAI paper from 2020, Scaling Laws for Neural Language models, which claimed that the optimal data:compute ratio was far smaller than was correct. (By the way, I highly recommend this post if you want to read more about Chinchilla’s findings)
Chinchilla did another thing, though – it highlighted the importance of studying the computational efficiency of our learning algorithms. Lets dig in there a bit:
Compute efficiency is a measurement of how well your model performs taking into account only the amount of compute you used to train it.
“How well your model performs” is generally measured by a loss function applied to a test dataset, but can be measured using any stable, low-variance metric that gets improved as your model trains.
“Compute” is generally measured in FLOPs, but it can be thought of as “how many GPUs, for how long”.
Multipliers
Putting the two together, you can re-define compute efficiency a little bit: given a set of design and hyper-parameter choices and a fixed test loss target, compute efficiency is the measurement of how much compute is required to meet that test loss. A more efficient model requires less GPUs, a less efficient one needs more.
And this is where the term “compute multipliers” come in. If you make any discovery that improves compute efficiency across all model scales, you have discovered a compute multipler. If your discovery increases efficiency by 20%, it’s as if your training fleet suddenly has 20% more GPUs in it.
Due to the way scaling actually works, compute multipliers are actually generally worth more than a proportional increase in the number of GPUs you have. This is because adding more GPUs comes with overhead which decreases the net efficiency of the system, for example slow interconnect speeds might mean that a GPU cluster that is 20% larger is only 18% faster at crunching your giant matmuls.
In a world where a single H100 costs $25k at a minimum, and we’re training these LLMs on thousands of these GPUs, you can see why these compute multipliers start to make a huge difference. Finding a 5% compute multiplier means you potentially saved your 4000-GPU company $5M in GPUs. That savings can be applied in several ways: buying less GPUs, training faster, or training a larger & better model.
Where are they hiding?
I started this post mentioning that Dario had talked about compute multipliers. The context within which he talked about them in both podcasts I listened to was within an infosec discussion – he considers proprietary compute multipliers to be the among the most valuable corporate secrets that Anthropic has.
As such, I obviously can’t share specifics on where you might go to look for these things. That’s OK, though, as the point of this post is to foster a different way of thinking about how improvements are made to ML algorithms. I think the search for compute multipliers is far more ubiquitous than anyone who doesn’t religiously adhere to scaling laws might think: literally every single scientist in the field of machine learning that isn’t looking into new applications of existing techniques should be performing compute efficiency scans to ensure that their discoveries are actually relevant. Some examples:
- You invent a new architecture that proposes to replace the transformer – you better show that you achieve better test losses than a transformer with a fixed compute budget!
- You create a new dataset – does it improve a test loss that you care about when compared with an otherwise identical model trained for an identical duration of time/tokens?
- You invent a new optimizer algorithm – what test loss does it achieve when compared with the old one?
- You tweak some hyperparameters – …. you get the point
In general, the important conclusion is that if you believe in scaling laws (and you should!), then it is not impressive to simply come up with an idea and achieve a state of the art score on some evaluation metric. Anyone can do that with any architecture and enough compute. You must measure the performance of your idea against other ideas with compute fixed.
Funnily enough if everyone did this, we’d see a lot less papers in the field as it’d become clear pretty quickly that most ideas simply don’t pan out where it matters. The only reason the paper mills keep turning is that just about everything can be SOTA with enough scale, compute and eval hacking.