Can puzzle solving prove general intelligence?

Introduction

ARC-style puzzles are among the most ambitious attempts to test machine reasoning. The Abstraction and Reasoning Corpus (ARC), introduced by François Chollet in 2019, was designed to measure how well an AI system can infer unfamiliar rules from a handful of examples rather than rely on memorised knowledge or large training datasets. Because the benchmark focuses on novel pattern problems, many researchers view strong ARC performance as more relevant to general intelligence than traditional exams or question-answering tests. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

ARC debate illustration 1 Yet even impressive ARC results do not settle claims that artificial general intelligence (AGI) has been achieved. The benchmark can provide evidence about abstract reasoning and adaptation, but it cannot by itself demonstrate the broad, flexible competence that AGI is usually taken to require. The debate is not whether ARC matters—it clearly does—but whether success on a single family of puzzles can establish general intelligence.

Why abstract puzzles matter

ARC was created to address a weakness in many AI benchmarks. Traditional evaluations often reward systems that absorb enormous amounts of data and recognise patterns similar to those seen during training. ARC instead presents small visual grid puzzles where the correct transformation must be inferred from a few examples. The tasks are intended to test abstraction, analogy, and generalisation to novel situations. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

This design reflects Chollet’s argument that intelligence is not merely accumulated skill. In his framework, intelligence is closely related to the efficiency with which a system acquires new skills when facing unfamiliar problems. ARC was built specifically to probe that ability. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

The benchmark therefore captures something that many other evaluations miss:

Solving tasks with little prior task-specific training.
Discovering rules rather than recalling facts.
Generalising from very limited examples.
Handling problems that are deliberately unfamiliar.

These are important ingredients of intelligence, which is why ARC has attracted attention as a potential indicator of progress toward more capable AI systems. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

How search and specialisation complicate the claim

The strongest reason ARC results do not settle AGI claims is that high scores can emerge through methods that are narrower than general intelligence.

Historically, many leading ARC systems have relied on specialised search procedures, program synthesis, hand-crafted priors, or extensive test-time computation. Rather than instantly understanding a puzzle as a human might, a system may generate and evaluate large numbers of candidate solutions until one fits the examples. ARC Prize reports have documented substantial progress from approaches that combine machine learning with program search and test-time adaptation. [arXiv]arxiv.orgARC Prize 2024: Technical Report5 Dec 2024 — In this paper, we survey top approaches, review new open-source implementations, discus…

This creates an interpretive problem. A high score could reflect:

Genuine abstract reasoning. [facebook.com]facebook.comAGI-2 A new gold-standard benchmark for abstract reasoning from the ARC Prize team. Humans solve 100%, frontier models score less than 5%…
Powerful search over many possibilities.
A hybrid of reasoning and search.
Benchmark-specific engineering optimisations.

The benchmark score alone cannot determine which explanation is correct.

The issue became particularly visible when newer reasoning systems achieved much higher ARC scores than previous models. Critics noted that some systems appeared to rely on extensive computational exploration, generating many candidate answers before selecting one. Supporters argued that search is itself part of intelligence, while sceptics responded that brute-force exploration may not demonstrate the flexible skill-acquisition efficiency ARC was originally intended to measure. [The Atlantic]theatlantic.comThe Atlantic The Man Out to Prove How Dumb AI Still IsWhile Altman, CEO of OpenAI, asserts that their latest models are approaching Artificial General Intelligence (AGI), Chollet argues that…

This disagreement reveals a broader challenge: benchmark success does not automatically reveal the underlying cognitive mechanism.

ARC debate illustration 2

What ARC-style results can and cannot prove

ARC can provide meaningful evidence that a system possesses some capacity for abstract reasoning. If a model consistently solves novel tasks that cannot easily be memorised, that is informative. It suggests progress beyond simple pattern recall. [Nature]nature.comA Comprehensive Behavioral Dataset for the Abstraction…by S LeGris · 2025 · Cited by 3 — The Abstraction and Reasoning Corpus (A…

However, ARC cannot establish several broader claims often associated with AGI.

It cannot prove broad competence

ARC focuses on a specific class of visual reasoning problems. AGI, by contrast, is usually described as competence across many environments, tasks, and forms of knowledge.

A system could excel at ARC while remaining weak in social reasoning, long-term planning, scientific discovery, embodied interaction, or other domains often associated with general intelligence. Success on one benchmark does not guarantee competence elsewhere. [Nature]nature.comA Comprehensive Behavioral Dataset for the Abstraction…by S LeGris · 2025 · Cited by 3 — The Abstraction and Reasoning Corpus (A…

It cannot prove robust transfer

True general intelligence is often expected to transfer skills across radically different situations. ARC tests transfer within a carefully designed puzzle domain, but does not directly measure performance across the full range of real-world challenges.

This distinction matters because many AI systems perform well when tasks resemble benchmark conditions yet struggle when environments become more open-ended or ambiguous. The benchmark offers evidence about one form of generalisation, not every form. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

It cannot settle disputes about the definition of intelligence

Even among researchers who admire ARC, there is no universal agreement that it captures the entirety of intelligence.

Chollet himself presented ARC as an actionable benchmark derived from a particular theory of intelligence centred on efficient skill acquisition. Other researchers place greater emphasis on agency, embodiment, social understanding, memory, planning, or interaction with complex environments. As a result, solving ARC would support one influential conception of intelligence without necessarily resolving competing definitions. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

ARC debate illustration 3

The moving-target problem

Another reason ARC does not settle AGI claims is that benchmark designers can discover weaknesses in a benchmark after systems begin to perform well on it.

The evolution from ARC-AGI-1 to ARC-AGI-2 illustrates this issue. ARC-AGI-2 was introduced because researchers believed stronger evaluations were needed to distinguish between different levels of reasoning ability and to reduce the possibility that systems were exploiting shortcuts specific to the original benchmark. Performance that looked impressive on the earlier version often dropped sharply on the newer one. [arXiv]arxiv.orgarXiv ARC-AGI-2: A New Challenge for Frontier AI Reasoning SystemsARC-AGI-2: A New Challenge for Frontier AI Reasoning SystemsMay 17, 2025…Published: May 17, 2025

This does not mean previous results were meaningless. Instead, it shows that benchmark success is often provisional. A score may demonstrate mastery of a particular test, while a revised benchmark can reveal limitations that were previously hidden.

The history of AI evaluation repeatedly shows this pattern: systems conquer a benchmark, researchers identify loopholes or blind spots, and a more demanding benchmark replaces it.

Can puzzle-solving prove general intelligence?

The evidence from ARC suggests a balanced conclusion. ARC-style puzzles matter because they test forms of abstraction and novel problem solving that are genuinely relevant to intelligence. Strong performance is therefore more significant than success on many conventional benchmarks. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

At the same time, puzzle-solving alone cannot prove AGI. High scores do not reveal whether success came from broadly reusable reasoning abilities, benchmark-specific strategies, extensive search, or some combination of all three. Nor can a single puzzle domain establish competence across the enormous range of situations that the concept of general intelligence normally implies. [arXiv+2The Atlantic]arxiv.orgARC Prize 2024: Technical Report5 Dec 2024 — In this paper, we survey top approaches, review new open-source implementations, discus…

ARC-style evaluations are therefore best understood as valuable evidence in the AGI debate rather than decisive proof. They can show that AI systems are becoming better at abstract reasoning, but they cannot, by themselves, settle the question of whether a system has become generally intelligent.

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

A.I. Artificial Intelligence Bluray Steelbook + Original Film Poster

Search eBay.co.uk: artificial intelligence poster

Browse similar on eBay.co.uk

Example eBay listing

A.I. Artificial Intelligence Film Poster/Print A3/A4/A5 230gsm Framed

Search eBay.co.uk: artificial intelligence poster

Browse similar on eBay.co.uk

Example eBay listing

A.I. - Artificial Intelligence Movie/Film Poster Art PICTURE / PRINT 9" x 8"

Search eBay.co.uk: artificial intelligence poster

Browse similar on eBay.co.uk

Example eBay listing

A.i. ARTIFICIAL INTELLIGENCE (2001) UK cinema poster Spielberg Kubrick Jude Law

Search eBay.co.uk: artificial intelligence poster

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: arxiv.org
Link: https://arxiv.org/abs/1911.01547
Source snippet
arXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to...

Published: November 5, 2019
Source: arxiv.org
Link: https://arxiv.org/html/2412.04604v1
Source snippet
ARC Prize 2024: Technical Report5 Dec 2024 — In this paper, we survey top approaches, review new open-source implementations, discus...
Source: nature.com
Link: https://www.nature.com/articles/s41597-025-05687-1
Source snippet
A Comprehensive Behavioral Dataset for the Abstraction...by S LeGris · 2025 · Cited by 3 — The Abstraction and Reasoning Corpus (A...
Source: arxiv.org
Link: https://arxiv.org/abs/2305.07141
Source: arxiv.org
Title: arXiv ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
Link: https://arxiv.org/abs/2505.11831
Source snippet
ARC-AGI-2: A New Challenge for Frontier AI Reasoning SystemsMay 17, 2025...

Published: May 17, 2025
Source: arxiv.org
Link: https://arxiv.org/html/2601.10904v1
Source snippet
ARC Prize 2025: Technical Report15 Jan 2026 — The ARC-AGI benchmark series serves as a critical measure of few-shot generalization on nov...
Source: theatlantic.com
Title: The Atlantic The Man Out to Prove How Dumb AI Still Is
Link: https://www.theatlantic.com/technology/archive/2025/04/arc-agi-chollet-test/682295/
Source snippet
While Altman, CEO of OpenAI, asserts that their latest models are approaching Artificial General Intelligence (AGI), Chollet argues that...
Source: Wikipedia
Link: https://en.wikipedia.org/wiki/Fran%C3%A7ois
Source snippet
FrançoisFrançois is a French masculine given name and surname, equivalent to the English name Francis. François. Pronunciation, French...
Source: en.wiktionary.org
Link: https://en.wiktionary.org/wiki/Fran%C3%A7ois
Source snippet
wiktionary.orgFrançoisEnglish male given names from French · English surnames · English surnames from patronymics · French terms inherite...

Additional References

Source: arcprize.org
Link: https://arcprize.org/
Source snippet
ARC PrizeARC-AGI Benchmark Series... Trusted by the world's leading AI labs and top academic researchers, explore the only benchmark tha...
Source: medium.com
Link: https://medium.com/%40rajveer.rathod1301/the-abstraction-and-reasoning-corpus-arc-a-gateway-to-artificial-general-intelligence-87d4724bbb0d
Source snippet
Abstraction and Reasoning Corpus for Artificial General...ARC is a dataset and benchmark designed to evaluate the reasoning, abstraction...
Source: linkedin.com
Link: https://www.linkedin.com/posts/craig-brinton-74945315_exclusive-this-new-benchmark-could-expose-activity-7445564535643545601-2CRl
Source snippet
François Chollet's ARC-AGI-3 Benchmark Measures AI...ARC tries to close this gap by measuring intelligence as skill acquisition efficien...
Source: researchgate.net
Link: https://www.researchgate.net/publication/337048073_The_Measure_of_Intelligence
Source snippet
The Measure of Intelligence... Chollet's thesis that intelligence is skill-acquisition efficiency. Test-time adaptation and refinement lo...
Source: medium.com
Link: https://medium.com/%40ayshasaliha.cietmcet/the-frontier-of-agi-understanding-the-abstraction-and-reasoning-corpus-arc-24e562e87ff5
Source snippet
Understanding the Abstraction and Reasoning Corpus (ARC)29 Dec 2025 — The Abstraction and Reasoning Corpus (ARC), created by François Cho...
Source: youtube.com
Link: https://www.youtube.com/watch?v=M3b59lZYBW8&vl=en
Source snippet
ARC Prize Version 2 Launch Video! [Francois Chollet, Mike...Francois Chollet and Mike Knoop join Tim Scarfe to announce ARC-AGI 2 and th...
Source: youtube.com
Link: https://www.youtube.com/watch?v=9RnKGRDhCyo
Source snippet
"How to measure intelligence?" | Six researchers debate* He clarifies that the ARC benchmark is not intended to be the definition of inte...
Source: researchgate.net
Title: 403193756 ARC AGI 3 A New Challenge for Frontier Agentic Intelligence
Link: https://www.researchgate.net/publication/403193756_ARC-AGI-3_A_New_Challenge_for_Frontier_Agentic_Intelligence
Source snippet
ARC-AGI-3: A New Challenge for Frontier Agentic...Mar 27, 2026 — In this paper, we present the benchmark design, its efficiency-based sc...
Source: threadreaderapp.com
Link: https://threadreaderapp.com/thread/1192121587467784192.html
Source snippet
ce, as well as a new AI evaluation dataset, the "Abstraction and Reasoning Corpus".Read more...
Source: facebook.com
Link: https://www.facebook.com/groups/chatgpt4u/posts/1597889177507452/
Source snippet
AGI-2 A new gold-standard benchmark for abstract reasoning from the ARC Prize team. Humans solve 100%, frontier models score less than 5%...

Can puzzle solving prove general intelligence?

Introduction

Why abstract puzzles matter

How search and specialisation complicate the claim

What ARC-style results can and cannot prove

It cannot prove broad competence

It cannot prove robust transfer

It cannot settle disputes about the definition of intelligence

The moving-target problem

Can puzzle-solving prove general intelligence?

Further Reading

The Master Algorithm

Artificial Intelligence

Human Compatible

Godel, Escher, Bach

Marketplace Samples

A.I. Artificial Intelligence Bluray Steelbook + Original Film Poster

A.I. Artificial Intelligence Film Poster/Print A3/A4/A5 230gsm Framed

A.I. - Artificial Intelligence Movie/Film Poster Art PICTURE / PRINT 9" x 8"

A.i. ARTIFICIAL INTELLIGENCE (2001) UK cinema poster Spielberg Kubrick Jude Law

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 2