A pet peeve of mine that often shows up in ML discourse is the claim that humans are much more data efficient at learning than the models we are currently training. The argument typically goes like this:
“I’m blown away by how much knowledge my 3 year old has. They are smarter than most language models, despite being trained on a very small training dataset. Clearly, our models are missing something important because they cannot learn like my 3 year old!”
But is the training dataset of a 3 year old actually smaller than a typical language model? For fun, I’d like to do some napkin math to bring the numbers down to levels that we can actually reason over.
Starting with the LLM itself – let’s use Llama 65B. This model was trained on 1.4T tokens. For some easy math, let’s assume that the codebook size was 65536, which means that each token represents 16 bits of data. That means Llama was trained on 22.4 TBits of data in total.
Human “training data”
Let’s try to figure out how much information a human can gather in 3 years. To do this, we’ll first decompose our world experience into individual “experiences” which happen at a regular interval across those 3 years. Let’s say that a human has a new experience every second (it’s probably more frequent than this). Let’s also assume that the human is awake on average 12 hours a day. Over a time span of 3 years, that means a human will have 3 * 365 * 12 * 60 * 60 = 47,304,000 experiences.
Let’s now compare those 3 years of experience with the at a we used to train that 65B Llama model: 22,400,000,000,000 Bits / 47,304,000 experiences = 473,532 = 474 KBit/experience. That is to say – if every given human experience has more than 474 KBit of information, than a 3 year old human is technically getting trained on more raw information than Llama 65B.
Let’s digest that a bit further by the modalities of human experience:
The internet tells me that a human eye can perceive 576 Megapixels and 10M colors. 10M colors is ~23Bits. We’ve got two eyes so that comes out to 576,000,000 pixels * 23 bits/pixels * 2 possible states per observation. That comes out to 26GBit per experience. Hm.
I don’t think the human brain actually perceives all of the visual information it is presented with. Rather, it focuses on a very small fraction (attention!). But even a very small fraction of 26GBit is a big number! Basically any way you try to pare this number down, it’s going to be big.
Young humans can perceive sound frequencies up to 20kHz. I don’t really know how to measure how fine-grained the pressure fluctuations our ears can perceive are, I’ll use 8 bits (255 pressure values) as a reasonable lower bound. That means over the course of one second, a human could theoretically perceive 20,000*8 = 160KBits of audio data.
Touch, smell and taste
AFAIU, smelling is performed by chemicals binding to smell receptors. The action of binding is an on or off proposition, and the internet tells me we have ~400 different smell receptors. That comes out to an easy 400 bits of information from smell.
I’ll measure touch similarly – the internet tells me we have ~4M touch receptors (why does the number 4 keep coming up?). Each one is independent and (I assume?) can be on or off, which comes out to 4MBits of touch information.
Taste is a complex amalgam of touch, smell and tastebuds. We can taste something like 5 independent tastes, I’ll assume smell and touch is covered above. So let’s say taste is a simple 5 bits of information.
I’m not going to bother adding all these information sources together as I think this is all pretty pie in the sky. The point I’m trying to make – and I hope I’ve demonstrated it clearly – is that we can pretty easily make the argument that the human brain receives 474 KBit of information per second.
And if we can make this claim, then we can also make the claim that a 3 year old has very likely been trained on as much data as Llama 65B (though I suspect it’s quite a lot more!!)
I expect the skeptics reading this will make the following counterpoint: most human experiences are redundant! Even though the total information input is very dense, the amount of novel information is quite small!
But you can say the same thing about our models! The text datasets these models are trained on is composed of all the human text on the internet. This is necessarily highly redundant – humans love to talk about the same things, day in and day out. Politics, sex, war, food, dieting, exercise, sports, fashion, etc. The vast majority of human text has a very low amount of semantic entropy.
I actually think redundancy is very important to a learning system. It aids in the act of compression, which seems to be linked to intelligence. By being exposed to the same observations over long periods of time, we learn what matters and what does not. Highly redundant experiences fade from our attention and we instead focus on novel, unexpected occurrences. I would not be surprised to learn that our models work in the same way. Just the fact that training over multiple epochs with sensible data augmentations improves model performance seems like a small signal that this is the case.
Other types of efficiency
It’s disingenuous to only consider data efficiency when talking about the human brain. When it comes to energy efficiency, our brains are pretty damned remarkable. I don’t think we’re even within a few orders of magnitude of that type of efficiency with silicon. This gives me a lot of hope, though! If capabilities are already as awesome as they are at the paltry efficiencies we’ve been able to achieve, I can’t wait to see what they’ll look like in a decade or two.