Within Benchmark limits
Can puzzle solving prove general intelligence?
Abstract reasoning puzzles can reveal important abilities, but even strong results leave open whether a system can adapt broadly.
On this page
- Why abstract puzzles matter
- How search and specialization complicate the claim
- What ARC style results can and cannot prove
Page outline Jump by section
Introduction
ARC-style puzzles are among the most ambitious attempts to test machine reasoning. The Abstraction and Reasoning Corpus (ARC), introduced by François Chollet in 2019, was designed to measure how well an AI system can infer unfamiliar rules from a handful of examples rather than rely on memorised knowledge or large training datasets. Because the benchmark focuses on novel pattern problems, many researchers view strong ARC performance as more relevant to general intelligence than traditional exams or question-answering tests. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…
Yet even impressive ARC results do not settle claims that artificial general intelligence (AGI) has been achieved. The benchmark can provide evidence about abstract reasoning and adaptation, but it cannot by itself demonstrate the broad, flexible competence that AGI is usually taken to require. The debate is not whether ARC matters—it clearly does—but whether success on a single family of puzzles can establish general intelligence.
Why abstract puzzles matter
ARC was created to address a weakness in many AI benchmarks. Traditional evaluations often reward systems that absorb enormous amounts of data and recognise patterns similar to those seen during training. ARC instead presents small visual grid puzzles where the correct transformation must be inferred from a few examples. The tasks are intended to test abstraction, analogy, and generalisation to novel situations. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…
This design reflects Chollet’s argument that intelligence is not merely accumulated skill. In his framework, intelligence is closely related to the efficiency with which a system acquires new skills when facing unfamiliar problems. ARC was built specifically to probe that ability. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…
The benchmark therefore captures something that many other evaluations miss:
- Solving tasks with little prior task-specific training.
- Discovering rules rather than recalling facts.
- Generalising from very limited examples.
- Handling problems that are deliberately unfamiliar.
These are important ingredients of intelligence, which is why ARC has attracted attention as a potential indicator of progress toward more capable AI systems. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…
How search and specialisation complicate the claim
The strongest reason ARC results do not settle AGI claims is that high scores can emerge through methods that are narrower than general intelligence.
Historically, many leading ARC systems have relied on specialised search procedures, program synthesis, hand-crafted priors, or extensive test-time computation. Rather than instantly understanding a puzzle as a human might, a system may generate and evaluate large numbers of candidate solutions until one fits the examples. ARC Prize reports have documented substantial progress from approaches that combine machine learning with program search and test-time adaptation. [arXiv]arxiv.orgARC Prize 2024: Technical Report5 Dec 2024 — In this paper, we survey top approaches, review new open-source implementations, discus…
This creates an interpretive problem. A high score could reflect:
- Genuine abstract reasoning. [facebook.com]facebook.comAGI-2 A new gold-standard benchmark for abstract reasoning from the ARC Prize team. Humans solve 100%, frontier models score less than 5%…
- Powerful search over many possibilities.
- A hybrid of reasoning and search.
- Benchmark-specific engineering optimisations.
The benchmark score alone cannot determine which explanation is correct.
The issue became particularly visible when newer reasoning systems achieved much higher ARC scores than previous models. Critics noted that some systems appeared to rely on extensive computational exploration, generating many candidate answers before selecting one. Supporters argued that search is itself part of intelligence, while sceptics responded that brute-force exploration may not demonstrate the flexible skill-acquisition efficiency ARC was originally intended to measure. [The Atlantic]theatlantic.comThe Atlantic The Man Out to Prove How Dumb AI Still IsWhile Altman, CEO of OpenAI, asserts that their latest models are approaching Artificial General Intelligence (AGI), Chollet argues that…
This disagreement reveals a broader challenge: benchmark success does not automatically reveal the underlying cognitive mechanism.
What ARC-style results can and cannot prove
ARC can provide meaningful evidence that a system possesses some capacity for abstract reasoning. If a model consistently solves novel tasks that cannot easily be memorised, that is informative. It suggests progress beyond simple pattern recall. [Nature]nature.comA Comprehensive Behavioral Dataset for the Abstraction…by S LeGris · 2025 · Cited by 3 — The Abstraction and Reasoning Corpus (A…
However, ARC cannot establish several broader claims often associated with AGI.
It cannot prove broad competence
ARC focuses on a specific class of visual reasoning problems. AGI, by contrast, is usually described as competence across many environments, tasks, and forms of knowledge.
A system could excel at ARC while remaining weak in social reasoning, long-term planning, scientific discovery, embodied interaction, or other domains often associated with general intelligence. Success on one benchmark does not guarantee competence elsewhere. [Nature]nature.comA Comprehensive Behavioral Dataset for the Abstraction…by S LeGris · 2025 · Cited by 3 — The Abstraction and Reasoning Corpus (A…
It cannot prove robust transfer
True general intelligence is often expected to transfer skills across radically different situations. ARC tests transfer within a carefully designed puzzle domain, but does not directly measure performance across the full range of real-world challenges.
This distinction matters because many AI systems perform well when tasks resemble benchmark conditions yet struggle when environments become more open-ended or ambiguous. The benchmark offers evidence about one form of generalisation, not every form. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…
It cannot settle disputes about the definition of intelligence
Even among researchers who admire ARC, there is no universal agreement that it captures the entirety of intelligence.
Chollet himself presented ARC as an actionable benchmark derived from a particular theory of intelligence centred on efficient skill acquisition. Other researchers place greater emphasis on agency, embodiment, social understanding, memory, planning, or interaction with complex environments. As a result, solving ARC would support one influential conception of intelligence without necessarily resolving competing definitions. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…
The moving-target problem
Another reason ARC does not settle AGI claims is that benchmark designers can discover weaknesses in a benchmark after systems begin to perform well on it.
The evolution from ARC-AGI-1 to ARC-AGI-2 illustrates this issue. ARC-AGI-2 was introduced because researchers believed stronger evaluations were needed to distinguish between different levels of reasoning ability and to reduce the possibility that systems were exploiting shortcuts specific to the original benchmark. Performance that looked impressive on the earlier version often dropped sharply on the newer one. [arXiv]arxiv.orgarXiv ARC-AGI-2: A New Challenge for Frontier AI Reasoning SystemsARC-AGI-2: A New Challenge for Frontier AI Reasoning SystemsMay 17, 2025…
This does not mean previous results were meaningless. Instead, it shows that benchmark success is often provisional. A score may demonstrate mastery of a particular test, while a revised benchmark can reveal limitations that were previously hidden.
The history of AI evaluation repeatedly shows this pattern: systems conquer a benchmark, researchers identify loopholes or blind spots, and a more demanding benchmark replaces it.
Can puzzle-solving prove general intelligence?
The evidence from ARC suggests a balanced conclusion. ARC-style puzzles matter because they test forms of abstraction and novel problem solving that are genuinely relevant to intelligence. Strong performance is therefore more significant than success on many conventional benchmarks. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…
At the same time, puzzle-solving alone cannot prove AGI. High scores do not reveal whether success came from broadly reusable reasoning abilities, benchmark-specific strategies, extensive search, or some combination of all three. Nor can a single puzzle domain establish competence across the enormous range of situations that the concept of general intelligence normally implies. [arXiv+2The Atlantic]arxiv.orgARC Prize 2024: Technical Report5 Dec 2024 — In this paper, we survey top approaches, review new open-source implementations, discus…
ARC-style evaluations are therefore best understood as valuable evidence in the AGI debate rather than decisive proof. They can show that AI systems are becoming better at abstract reasoning, but they cannot, by themselves, settle the question of whether a system has become generally intelligent.
Amazon book picks
Further Reading
Books and field guides related to Can puzzle solving prove general intelligence?. Use these as the next step if you want deeper reading beyond the article.
The Master Algorithm
Explores learning, generalization, and the limits of AI approaches.
Artificial Intelligence
Rating: 4.5/5 from 10 Google Books ratings
Provides context for reasoning, search, and intelligence evaluation.
Godel, Escher, Bach
Explores abstraction, analogy, and cognition relevant to puzzle-solving claims.
Endnotes
-
Source: arxiv.org
Link: https://arxiv.org/abs/1911.01547Source snippet
arXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to...
Published: November 5, 2019
-
Source: arxiv.org
Link: https://arxiv.org/html/2412.04604v1Source snippet
ARC Prize 2024: Technical Report5 Dec 2024 — In this paper, we survey top approaches, review new open-source implementations, discus...
-
Source: nature.com
Link: https://www.nature.com/articles/s41597-025-05687-1Source snippet
A Comprehensive Behavioral Dataset for the Abstraction...by S LeGris · 2025 · Cited by 3 — The Abstraction and Reasoning Corpus (A...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2305.07141 -
Source: arxiv.org
Title: arXiv ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
Link: https://arxiv.org/abs/2505.11831Source snippet
ARC-AGI-2: A New Challenge for Frontier AI Reasoning SystemsMay 17, 2025...
Published: May 17, 2025
-
Source: arxiv.org
Link: https://arxiv.org/html/2601.10904v1Source snippet
ARC Prize 2025: Technical Report15 Jan 2026 — The ARC-AGI benchmark series serves as a critical measure of few-shot generalization on nov...
-
Source: theatlantic.com
Title: The Atlantic The Man Out to Prove How Dumb AI Still Is
Link: https://www.theatlantic.com/technology/archive/2025/04/arc-agi-chollet-test/682295/Source snippet
While Altman, CEO of OpenAI, asserts that their latest models are approaching Artificial General Intelligence (AGI), Chollet argues that...
-
Source: Wikipedia
Link: https://en.wikipedia.org/wiki/Fran%C3%A7oisSource snippet
FrançoisFrançois is a French masculine given name and surname, equivalent to the English name Francis. François. Pronunciation, French...
-
Source: en.wiktionary.org
Link: https://en.wiktionary.org/wiki/Fran%C3%A7oisSource snippet
wiktionary.orgFrançoisEnglish male given names from French · English surnames · English surnames from patronymics · French terms inherite...
Additional References
-
Source: arcprize.org
Link: https://arcprize.org/Source snippet
ARC PrizeARC-AGI Benchmark Series... Trusted by the world's leading AI labs and top academic researchers, explore the only benchmark tha...
-
Source: medium.com
Link: https://medium.com/%40rajveer.rathod1301/the-abstraction-and-reasoning-corpus-arc-a-gateway-to-artificial-general-intelligence-87d4724bbb0dSource snippet
Abstraction and Reasoning Corpus for Artificial General...ARC is a dataset and benchmark designed to evaluate the reasoning, abstraction...
-
Source: linkedin.com
Link: https://www.linkedin.com/posts/craig-brinton-74945315_exclusive-this-new-benchmark-could-expose-activity-7445564535643545601-2CRlSource snippet
François Chollet's ARC-AGI-3 Benchmark Measures AI...ARC tries to close this gap by measuring intelligence as skill acquisition efficien...
-
Source: researchgate.net
Link: https://www.researchgate.net/publication/337048073_The_Measure_of_IntelligenceSource snippet
The Measure of Intelligence... Chollet's thesis that intelligence is skill-acquisition efficiency. Test-time adaptation and refinement lo...
-
Source: medium.com
Link: https://medium.com/%40ayshasaliha.cietmcet/the-frontier-of-agi-understanding-the-abstraction-and-reasoning-corpus-arc-24e562e87ff5Source snippet
Understanding the Abstraction and Reasoning Corpus (ARC)29 Dec 2025 — The Abstraction and Reasoning Corpus (ARC), created by François Cho...
-
Source: youtube.com
Link: https://www.youtube.com/watch?v=M3b59lZYBW8&vl=enSource snippet
ARC Prize Version 2 Launch Video! [Francois Chollet, Mike...Francois Chollet and Mike Knoop join Tim Scarfe to announce ARC-AGI 2 and th...
-
Source: youtube.com
Link: https://www.youtube.com/watch?v=9RnKGRDhCyoSource snippet
"How to measure intelligence?" | Six researchers debate* He clarifies that the ARC benchmark is not intended to be the definition of inte...
-
Source: researchgate.net
Title: 403193756 ARC AGI 3 A New Challenge for Frontier Agentic Intelligence
Link: https://www.researchgate.net/publication/403193756_ARC-AGI-3_A_New_Challenge_for_Frontier_Agentic_IntelligenceSource snippet
ARC-AGI-3: A New Challenge for Frontier Agentic...Mar 27, 2026 — In this paper, we present the benchmark design, its efficiency-based sc...
-
Source: threadreaderapp.com
Link: https://threadreaderapp.com/thread/1192121587467784192.htmlSource snippet
ce, as well as a new AI evaluation dataset, the "Abstraction and Reasoning Corpus".Read more...
-
Source: facebook.com
Link: https://www.facebook.com/groups/chatgpt4u/posts/1597889177507452/Source snippet
AGI-2 A new gold-standard benchmark for abstract reasoning from the ARC Prize team. Humans solve 100%, frontier models score less than 5%...
Topic Tree



