ARC is benchmark developed to test out of distribution reasoning and common sense in general solvers. It is specifically designed to be:
- Easily solvable by most humans
- Not amenable to any kind of brute-force solvers (e.g. try every permutation of a solution)
- Not able to be solved with rote memorization
The designers of ARC achieved the above in a creative way: by developing problems that contain visual puzzles in which the participant must find an algorithm that explains symmetries seen across several demonstrations. They then must apply that algorithm to a final input. This sounds complicated, but in practice it is quite intuitive – most children can complete ARC questions.
LLMs are being pitched as general solvers, so lately we have been trying them out on this challenge. However, to make ARC amenable to being solved by a pure language model, you must remove the visual “clues” to the problem.
More concretely, lets take the test problem on the ARC GitHub. Humans see this:
While the exact same problem is fed into a language model like this:
{"train": [{"input": [[0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 4, 5, 0, 0, 0, 0], [0, 0, 0, 4, 5, 4, 4, 0, 0], [0, 0, 3, 3, 5, 0, 0, 0, 0], [0, 0, 0, 3, 5, 0, 0, 0, 0], [0, 0, 0, 3, 5, 3, 3, 3, 0], [0, 0, 0, 3, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0]], "output": [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 4], [0, 0, 4, 4], [0, 0, 3, 3], [0, 0, 0, 3], [0, 3, 3, 3], [0, 0, 0, 3], [0, 0, 0, 0], [0, 0, 0, 0]]}, {"input": [[0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 2, 5, 0, 0, 0, 0], [0, 0, 0, 2, 5, 2, 6, 0, 0], [0, 0, 0, 2, 5, 0, 0, 0, 0], [0, 0, 0, 2, 5, 2, 2, 2, 0], [0, 0, 6, 6, 5, 6, 0, 0, 0], [0, 0, 0, 2, 5, 0, 0, 0, 0], [0, 2, 2, 0, 5, 2, 0, 0, 0], [0, 0, 0, 2, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0]], "output": [[0, 0, 0, 0], [0, 0, 0, 2], [0, 0, 6, 2], [0, 0, 0, 2], [0, 2, 2, 2], [0, 0, 6, 6], [0, 0, 0, 2], [0, 2, 2, 2], [0, 0, 0, 2], [0, 0, 0, 0]]}, {"input": [[0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 7, 0, 0, 0], [0, 0, 0, 8, 5, 0, 0, 0, 0], [0, 0, 0, 8, 5, 0, 0, 0, 0], [0, 7, 8, 8, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 8, 8, 0, 0], [0, 0, 0, 8, 5, 0, 0, 0, 0], [0, 0, 0, 8, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 8, 7, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0]], "output": [[0, 0, 0, 0], [0, 0, 0, 7], [0, 0, 0, 8], [0, 0, 0, 8], [0, 7, 8, 8], [0, 0, 8, 8], [0, 0, 0, 8], [0, 0, 0, 8], [0, 0, 7, 8], [0, 0, 0, 0]]}], "test": [{"input": [[0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 1, 5, 0, 0, 0, 0], [0, 0, 0, 1, 5, 1, 0, 0, 0], [0, 1, 1, 1, 5, 1, 1, 1, 6], [0, 0, 0, 6, 5, 6, 6, 0, 0], [0, 0, 0, 0, 5, 1, 1, 1, 0], [0, 0, 0, 1, 5, 0, 0, 0, 0], [0, 0, 0, 1, 5, 1, 6, 0, 0], [0, 0, 0, 0, 5, 6, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0]], "output":
(Note: there may be a bit of prompt scaffolding around this)
This makes the problem considerably harder. While it’s true that clever programmers who are presented with the above text problem would probably figure it out with enough time, I do not think that most humans could solve it. Here’s an example set of steps you could pursue if you wanted to tackle such a problem from a command line interface:
- Recognize that the format is JSON and parse it
- See that the inputs and outputs are grids, and that there is some pattern to the grid sizes of the inputs and outputs.
- Print the grids out using some pretty-print library
- Recognize that there is some pattern going on with the numbers
- Develop an algorithm that reproduces the observed pattern
- Write the algorithm down as code
- Verify the algorithm works on all of the provided examples
- Apply the algorithm to the final input, submit the result
(5) is the really tricky bit. It becomes much easier when you show the grids with colored boxes, but I would not underestimate how hard it is to find the symmetries using only text. Mere mortals, myself included, need not apply.
o-3
Today, OpenAI released some details about an upcoming reasoning model “o-3”. Along with very impressive math and coding results (which are much more important than ARC, but to which I feel unable to comment since it has long since surpassed my capabilities in either field), it was revealed that o-3 achieved a new state of the art score on the ARC test set.
That’d be really cool if it had done so by operating a UI, but the way it was actually done is far more impressive to me: o-3 achieved these scores with the text version of this eval. The way o-3 does this is by brainstorming solutions, then testing those solutions in it’s own thoughts. It does this over and over again until it finds a solution that works. It also sometimes thinks about “giving up” and “guessing” and other uncannily human things.
There’s a few really important things I want to call out here:
- Using RL to learn reasoning continues to scale
- Reasoning models are getting better at building and verifying algorithms in-CoT
- There’s clear evidence of in-domain generalization from training on very few examples
I want to call out that I am not currently involved in building or deploying the o- series of models in any way. I am just in awe of what my colleagues have achieved!