In my last post, I made a claim that the recently discovered reversal curse is not something that worries me. In fact, when I originally learned of it, I can’t say I was very surprised. In this post, I wanted to dig into that a little bit more.
My hypothesis is that the reversal curse is a attribute of knowledge look-up, not a problem with the ability of LLMs to perform reasoning.
Lookup in NNs
Let me first describe how I think knowledge look-up in neural networks currently works.
At a high level, autoregressive neural networks map inputs into high-dimensional vectors (we call them “latents”). At the terminus of the network, we map those latents into sets of probabilities which can be used to make a series of predictions through the process of sampling.
The true power of AR models is in the context, which is the sequence of latents which came before the current prediction. Modern AR models learn to use the context to aid their predictions using a mechanism called self attention. Empirically, we find that the richer the context, the more accurate the prediction. For example, if you prompt a language model with the following two statements:
- The weather today is
- It’s December in Colorado, the weather today is
The prediction will be more accurate in the second case.
How does this mechanistically work inside of the model? Let’s assume that the model’s latent space is analogous to the internal registers of a computer processor. It can be loaded with a small bank of immediately relevant knowledge which is used at the last layer to make a prediction for the next token.
In the first example above, the register bank may only contain the values: [‘what is the weather’, ‘time: today’]. Probabilistically, the completion that should be produced from such a register bank is “sunny”.
In the second example, the register bank might contain the following: [‘location: colorado’, ‘month: december’, ‘what is the weather’, ‘time: today’], which will favor the word “snowy” with far higher probability.
An important thing to understand here is that the register bank described above has a limited capacity to hold information (but grows with scale!) Thus, for enabling the model to maximize it’s ability to predict the next word, it is very important that only salient knowledge from the dataset is loaded into the register bank. Put another way: for the purposes of loading knowledge from learned parameters into the latent space, the model does not use any kind of logic. It is simply a contextual lookup function that is learned by the dataset.
Neural networks don’t just learn direct input->output mappings, they also learn facts, which can be used to augment the output process. These facts are contextually applied to the latent space through the layers of the neural network. Using the above example, the final register bank after loading contextual facts from the model’s weights might be:
- [‘what is the weather’, ‘weather is at atmospheric phenomenon’, ‘possible weathers: sunny, snowy, rainy, …’, ‘common weather: sunny’, ‘time: today’]
- [‘December is a winter month’, ‘winter is cold in the northern hemisphere’, ‘we’re in Colorado’, ‘Colorado is a mountain state in the USA’, ‘Colorado is in the northern hemisphere’, ‘it snows in Colorado in the winter’, ‘what is the weather’, ‘weather is at atmospheric phenomenon’, ‘possible weathers: sunny, snowy, rainy, …’, ‘common weather: sunny’, ‘common winter weather in Colorado: snowy’, ‘time: today’, ‘today is in December’]
You can start to see why a model might make better predictions of the next word given world knowledge infilled by the neural network, and why context matters so much.
There’s a caveat to the above: I don’t really think that a model’s “register bank” is a discrete memory element that computer programmers are used to. Rather, I think it’s more like a superposition of information states. In such a superposition, all possibilities are represented equally initially, but as you add context, some information becomes more and more likely. So by adding “Colorado” and “December” to the context above, you update the information state to be more pre-disposed to wintery topics.
This last point is relevant because it explains why some types of information in the context are more important than others. If “Tom Cruise” is in the context, the model can configure the information content of the latent vectors such that they indicate higher probabilities for all kinds of specific facts about the actor Tom Cruise. The model will have needed to know these facts since Tom Cruise facts are quite relevant to fitting the training dataset.
However, when “Mary Lee Pfeiffer” is in the context, the model will augment the hidden state far less. That’s because this name rarely appears in the dataset and so the model has no reason to waste capacity learning mapping functions from “Mary Lee Pfeiffer” to facts about her – at least before a certain scale.
Where logic comes from
So we’ve got a framework for how a model might retrieve information, but how would such a model perform reasoning? I think reasoning happens entirely within the attention context. In the process of training, it becomes advantageous for the model to learn that if A has a relationship with B on the context, than B has an inverse relationship with A. This is particularly important for many aspects of programming. For example, ask your favorite LLM to evaluate this program:
a = 'john' l = [a, 'sally', 'fred'] for k in l: print('john' == k)
GPT 3.5 easily gets this one.
In fact, 3.5 also easily gets the prompt “who is Mary Lee Pfeiffer’s son” if you first give the model “Tom Cruise’s mother is Mary Lee Pfeiffer”. This might seem inane, but it shows the model truly has some reasoning abilities: it has not encoded the response to “who is Mary Lee Pfeiffer’s son” in its parameters, but is able to use information in the context to find the answer.
We’ve known about this for several years now. It is exactly why chain of thought prompting works: these models are able to reason within the context, but will often fall flat at the same tasks when you attempt to zero-shot them.
Is this even a model specific thing?
I’m pretty sure knowledge recall in my own head works in a very similar way to what I described above. When asked to recall a fact, I do a type of tree search that involves dragging up contextual clues which allows my mind to hunt down facts. This type of thinking so common, the objects I pull into my head have a term: “mnemonic”.
The authors of the reversal curse paper actually bring up this point themselves near the end:
The operating question is: assuming I have some knowledge of a relationship between objects, is there ever a case where I could not invert that relationship? I think the answer is “yes” – one place this commonly occurs for me is with people’s names. If you tell me a persons name, I can often see their face in my minds eye, but I often have a hard time recalling names given a face. I think others have this difficulty to – it’s why there’s a cultural guessing game that everyone likes to play when watching movies (“who is that actor?”). Incidentally I’m quite bad at that game, even though I often know the actors in question.
Another example that comes to mind – we have this term “light bulb moment” that refers to the moment when you remember something that pulls together two facts into new knowledge. The important point here is that you already knew everything you needed to have this “light bulb moment”, you only needed the right context to help you pull it out of your memory.
I think we have a tendency to assume our models have more context then they actually have. When you prompt a model with “who is Mary Lee Pfeiffer’s son?”, it is as if you walked up to a total stranger in a random place in the world and asked the same question. Neither person nor machine has any context for question and their ability to accurately respond will be entirely conditional on their ability to recall facts. I would not be surprised if highly capable Jeopardy contestants exhibited similar difficulties retrieving obscure relationships to our language models.
In conclusion – memory recall is not the same thing as capacity for logical reasoning, and we shouldn’t be alarmed that our models do not mix the two.