I don’t read as many papers as I once did. I find this surprising as I always assumed that when I made ML my full-time job, I would spend a lot more time reading up on all of the things that other folks in the field are up to.
To some extent, this is a weakness. There is a healthy balance one should strike between reading and writing and I’m definitely skewing a bit too far towards the writing side of things (code, not papers). With that said, I have the honor of working with some of the people I respect most in the field, and they don’t read much more than I do.
There are several good reasons for this, but the one I’d like to talk about today is the importance of slow progress and ablations in the field. These two things work harmoniously to make papers either too boring to read past a quick glance or completely unusable for any future work. Let’s talk about why.
One of the most important things that I’ve learned over the last few years is the surprising capacity of neural networks to just work. What this means is that across the spectrum of crazy ideas you can try to use to improve the performance of a neural network, most of them will at least train and produce results. For this reason, we need to come up with measurements of how well a neural network works – these take the form of the various evals we use in the field.
The pie in the sky dream is to use these evals to judge the comparative performance of various techniques and ideas against one another. If you have a crazy idea, code it up and get a better eval – it’s a good idea, right? Time to write a paper.
Well… it’s not really that simple. For one thing – implementation matters a lot. Two “transformers” written from scratch by two different people will often have different performance characteristics. There’s a laundry list of small decisions you make when implementing these things and they all interact with other changes you might be testing out in unintuitive ways. Examples of common variates include init scales, optimizer choices, data preprocessing choices, the kind and placement of normalization layers, activation choices, how positional information is fed into the attention layers – the list goes on and on.
When you write a paper saying that your new NN technique is better than mine – the first question I’m going to ask is “can I see the code”? Because likely your new technique isn’t better than mine, your implementation is. I’d love to copy it. 🙂
Smart researchers are well aware of these problems and work around in one of two ways:
1. Only make small improvements to existing open source code with well-characterized performance characteristics. This only works if everything is open sourced BTW – including the training code and dataset. It also only truly works for trainers where the authors paid careful consideration to determinism.
2. Perform ablations, and a lot of them. Start with a bare implementation of some base model like a transformer. Train it for a long time. Make a small tweak. Train it again. Make another small tweak and train it again.
Most good researchers follow then second recipe, but most do not bother to publish about the intermediate steps. It’s a shame, because I think this is the most important data to have on hand when determining whether or not a new idea is worth pursuing!
This is why I really enjoy reading the ablations sections of any research paper: it gives me a sense for what actually matters, and how much of a result is simply due to the implementation choices a researcher chose.
I want to leave off with one last important thing to consider when reading about the results of an experiment, whether or not ablations are present: in ML, the two most important determinants of performance are data and compute. If any “new technique” changes either the amount of compute or type of data used to train a NN, the technique itself is questionable at best. It’s fairly obvious that we shouldn’t be training on different datasets from run to run, but measuring compute changes is often much more difficult.
To ablate compute changes properly, you really need to track the number and dimensions of all matrix multiplications used in your NN and compare that number against the same for your baselines. If your new number is different, that will inevitably affect performance. Even if it’s in something “dumb” like a normalization layer. NN’s are tricky little beasts and find ways to use any compute you will give them. This was not appreciated enough before Chinchilla (by myself included!), but researchers are starting to “get it”.