Within Benchmark limits

Can puzzle solving prove general intelligence?

Abstract reasoning puzzles can reveal important abilities, but even strong results leave open whether a system can adapt broadly.

On this page

  • Why abstract puzzles matter
  • How search and specialization complicate the claim
  • What ARC style results can and cannot prove
Preview for Can puzzle solving prove general intelligence?

Introduction

ARC-style puzzles are among the most ambitious attempts to test machine reasoning. The Abstraction and Reasoning Corpus (ARC), introduced by François Chollet in 2019, was designed to measure how well an AI system can infer unfamiliar rules from a handful of examples rather than rely on memorised knowledge or large training datasets. Because the benchmark focuses on novel pattern problems, many researchers view strong ARC performance as more relevant to general intelligence than traditional exams or question-answering tests. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

ARC debate illustration 1 Yet even impressive ARC results do not settle claims that artificial general intelligence (AGI) has been achieved. The benchmark can provide evidence about abstract reasoning and adaptation, but it cannot by itself demonstrate the broad, flexible competence that AGI is usually taken to require. The debate is not whether ARC matters—it clearly does—but whether success on a single family of puzzles can establish general intelligence.

Why abstract puzzles matter

ARC was created to address a weakness in many AI benchmarks. Traditional evaluations often reward systems that absorb enormous amounts of data and recognise patterns similar to those seen during training. ARC instead presents small visual grid puzzles where the correct transformation must be inferred from a few examples. The tasks are intended to test abstraction, analogy, and generalisation to novel situations. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

This design reflects Chollet’s argument that intelligence is not merely accumulated skill. In his framework, intelligence is closely related to the efficiency with which a system acquires new skills when facing unfamiliar problems. ARC was built specifically to probe that ability. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

The benchmark therefore captures something that many other evaluations miss:

  • Solving tasks with little prior task-specific training.
  • Discovering rules rather than recalling facts.
  • Generalising from very limited examples.
  • Handling problems that are deliberately unfamiliar.

These are important ingredients of intelligence, which is why ARC has attracted attention as a potential indicator of progress toward more capable AI systems. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

How search and specialisation complicate the claim

The strongest reason ARC results do not settle AGI claims is that high scores can emerge through methods that are narrower than general intelligence.

Historically, many leading ARC systems have relied on specialised search procedures, program synthesis, hand-crafted priors, or extensive test-time computation. Rather than instantly understanding a puzzle as a human might, a system may generate and evaluate large numbers of candidate solutions until one fits the examples. ARC Prize reports have documented substantial progress from approaches that combine machine learning with program search and test-time adaptation. [arXiv]arxiv.orgARC Prize 2024: Technical Report5 Dec 2024 — In this paper, we survey top approaches, review new open-source implementations, discus…

This creates an interpretive problem. A high score could reflect:

  1. Genuine abstract reasoning. [facebook.com]facebook.comAGI-2 A new gold-standard benchmark for abstract reasoning from the ARC Prize team. Humans solve 100%, frontier models score less than 5%…
  2. Powerful search over many possibilities.
  3. A hybrid of reasoning and search.
  4. Benchmark-specific engineering optimisations.

The benchmark score alone cannot determine which explanation is correct.

The issue became particularly visible when newer reasoning systems achieved much higher ARC scores than previous models. Critics noted that some systems appeared to rely on extensive computational exploration, generating many candidate answers before selecting one. Supporters argued that search is itself part of intelligence, while sceptics responded that brute-force exploration may not demonstrate the flexible skill-acquisition efficiency ARC was originally intended to measure. [The Atlantic]theatlantic.comThe Atlantic The Man Out to Prove How Dumb AI Still IsWhile Altman, CEO of OpenAI, asserts that their latest models are approaching Artificial General Intelligence (AGI), Chollet argues that…

This disagreement reveals a broader challenge: benchmark success does not automatically reveal the underlying cognitive mechanism.

ARC debate illustration 2

What ARC-style results can and cannot prove

ARC can provide meaningful evidence that a system possesses some capacity for abstract reasoning. If a model consistently solves novel tasks that cannot easily be memorised, that is informative. It suggests progress beyond simple pattern recall. [Nature]nature.comA Comprehensive Behavioral Dataset for the Abstraction…by S LeGris · 2025 · Cited by 3 — The Abstraction and Reasoning Corpus (A…

However, ARC cannot establish several broader claims often associated with AGI.

It cannot prove broad competence

ARC focuses on a specific class of visual reasoning problems. AGI, by contrast, is usually described as competence across many environments, tasks, and forms of knowledge.

A system could excel at ARC while remaining weak in social reasoning, long-term planning, scientific discovery, embodied interaction, or other domains often associated with general intelligence. Success on one benchmark does not guarantee competence elsewhere. [Nature]nature.comA Comprehensive Behavioral Dataset for the Abstraction…by S LeGris · 2025 · Cited by 3 — The Abstraction and Reasoning Corpus (A…

It cannot prove robust transfer

True general intelligence is often expected to transfer skills across radically different situations. ARC tests transfer within a carefully designed puzzle domain, but does not directly measure performance across the full range of real-world challenges.

This distinction matters because many AI systems perform well when tasks resemble benchmark conditions yet struggle when environments become more open-ended or ambiguous. The benchmark offers evidence about one form of generalisation, not every form. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

It cannot settle disputes about the definition of intelligence

Even among researchers who admire ARC, there is no universal agreement that it captures the entirety of intelligence.

Chollet himself presented ARC as an actionable benchmark derived from a particular theory of intelligence centred on efficient skill acquisition. Other researchers place greater emphasis on agency, embodiment, social understanding, memory, planning, or interaction with complex environments. As a result, solving ARC would support one influential conception of intelligence without necessarily resolving competing definitions. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

ARC debate illustration 3

The moving-target problem

Another reason ARC does not settle AGI claims is that benchmark designers can discover weaknesses in a benchmark after systems begin to perform well on it.

The evolution from ARC-AGI-1 to ARC-AGI-2 illustrates this issue. ARC-AGI-2 was introduced because researchers believed stronger evaluations were needed to distinguish between different levels of reasoning ability and to reduce the possibility that systems were exploiting shortcuts specific to the original benchmark. Performance that looked impressive on the earlier version often dropped sharply on the newer one. [arXiv]arxiv.orgarXiv ARC-AGI-2: A New Challenge for Frontier AI Reasoning SystemsARC-AGI-2: A New Challenge for Frontier AI Reasoning SystemsMay 17, 2025…Published: May 17, 2025

This does not mean previous results were meaningless. Instead, it shows that benchmark success is often provisional. A score may demonstrate mastery of a particular test, while a revised benchmark can reveal limitations that were previously hidden.

The history of AI evaluation repeatedly shows this pattern: systems conquer a benchmark, researchers identify loopholes or blind spots, and a more demanding benchmark replaces it.

Can puzzle-solving prove general intelligence?

The evidence from ARC suggests a balanced conclusion. ARC-style puzzles matter because they test forms of abstraction and novel problem solving that are genuinely relevant to intelligence. Strong performance is therefore more significant than success on many conventional benchmarks. [arXiv]arxiv.orgarXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to…Published: November 5, 2019

At the same time, puzzle-solving alone cannot prove AGI. High scores do not reveal whether success came from broadly reusable reasoning abilities, benchmark-specific strategies, extensive search, or some combination of all three. Nor can a single puzzle domain establish competence across the enormous range of situations that the concept of general intelligence normally implies. [arXiv+2The Atlantic]arxiv.orgARC Prize 2024: Technical Report5 Dec 2024 — In this paper, we survey top approaches, review new open-source implementations, discus…

ARC-style evaluations are therefore best understood as valuable evidence in the AGI debate rather than decisive proof. They can show that AI systems are becoming better at abstract reasoning, but they cannot, by themselves, settle the question of whether a system has become generally intelligent.

Amazon book picks

Further Reading

Books and field guides related to Can puzzle solving prove general intelligence?. Use these as the next step if you want deeper reading beyond the article.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: arxiv.org
    Link: https://arxiv.org/abs/1911.01547
    Source snippet

    arXiv[1911.01547] On the Measure of IntelligenceNovember 5, 2019 — by F Chollet · 2019 · Cited by 1410 — We argue that ARC can be used to...

    Published: November 5, 2019

  2. Source: arxiv.org
    Link: https://arxiv.org/html/2412.04604v1
    Source snippet

    ARC Prize 2024: Technical Report5 Dec 2024 — In this paper, we survey top approaches, review new open-source implementations, discus...

  3. Source: nature.com
    Link: https://www.nature.com/articles/s41597-025-05687-1
    Source snippet

    A Comprehensive Behavioral Dataset for the Abstraction...by S LeGris · 2025 · Cited by 3 — The Abstraction and Reasoning Corpus (A...

  4. Source: arxiv.org
    Link: https://arxiv.org/abs/2305.07141

  5. Source: arxiv.org
    Title: arXiv ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
    Link: https://arxiv.org/abs/2505.11831
    Source snippet

    ARC-AGI-2: A New Challenge for Frontier AI Reasoning SystemsMay 17, 2025...

    Published: May 17, 2025

  6. Source: arxiv.org
    Link: https://arxiv.org/html/2601.10904v1
    Source snippet

    ARC Prize 2025: Technical Report15 Jan 2026 — The ARC-AGI benchmark series serves as a critical measure of few-shot generalization on nov...

  7. Source: theatlantic.com
    Title: The Atlantic The Man Out to Prove How Dumb AI Still Is
    Link: https://www.theatlantic.com/technology/archive/2025/04/arc-agi-chollet-test/682295/
    Source snippet

    While Altman, CEO of OpenAI, asserts that their latest models are approaching Artificial General Intelligence (AGI), Chollet argues that...

  8. Source: Wikipedia
    Link: https://en.wikipedia.org/wiki/Fran%C3%A7ois
    Source snippet

    FrançoisFrançois is a French masculine given name and surname, equivalent to the English name Francis. François. Pronunciation, French...

  9. Source: en.wiktionary.org
    Link: https://en.wiktionary.org/wiki/Fran%C3%A7ois
    Source snippet

    wiktionary.orgFrançoisEnglish male given names from French · English surnames · English surnames from patronymics · French terms inherite...

Additional References

  1. Source: arcprize.org
    Link: https://arcprize.org/
    Source snippet

    ARC PrizeARC-AGI Benchmark Series... Trusted by the world's leading AI labs and top academic researchers, explore the only benchmark tha...

  2. Source: medium.com
    Link: https://medium.com/%40rajveer.rathod1301/the-abstraction-and-reasoning-corpus-arc-a-gateway-to-artificial-general-intelligence-87d4724bbb0d
    Source snippet

    Abstraction and Reasoning Corpus for Artificial General...ARC is a dataset and benchmark designed to evaluate the reasoning, abstraction...

  3. Source: linkedin.com
    Link: https://www.linkedin.com/posts/craig-brinton-74945315_exclusive-this-new-benchmark-could-expose-activity-7445564535643545601-2CRl
    Source snippet

    François Chollet's ARC-AGI-3 Benchmark Measures AI...ARC tries to close this gap by measuring intelligence as skill acquisition efficien...

  4. Source: researchgate.net
    Link: https://www.researchgate.net/publication/337048073_The_Measure_of_Intelligence
    Source snippet

    The Measure of Intelligence... Chollet's thesis that intelligence is skill-acquisition efficiency. Test-time adaptation and refinement lo...

  5. Source: medium.com
    Link: https://medium.com/%40ayshasaliha.cietmcet/the-frontier-of-agi-understanding-the-abstraction-and-reasoning-corpus-arc-24e562e87ff5
    Source snippet

    Understanding the Abstraction and Reasoning Corpus (ARC)29 Dec 2025 — The Abstraction and Reasoning Corpus (ARC), created by François Cho...

  6. Source: youtube.com
    Link: https://www.youtube.com/watch?v=M3b59lZYBW8&vl=en
    Source snippet

    ARC Prize Version 2 Launch Video! [Francois Chollet, Mike...Francois Chollet and Mike Knoop join Tim Scarfe to announce ARC-AGI 2 and th...

  7. Source: youtube.com
    Link: https://www.youtube.com/watch?v=9RnKGRDhCyo
    Source snippet

    "How to measure intelligence?" | Six researchers debate* He clarifies that the ARC benchmark is not intended to be the definition of inte...

  8. Source: researchgate.net
    Title: 403193756 ARC AGI 3 A New Challenge for Frontier Agentic Intelligence
    Link: https://www.researchgate.net/publication/403193756_ARC-AGI-3_A_New_Challenge_for_Frontier_Agentic_Intelligence
    Source snippet

    ARC-AGI-3: A New Challenge for Frontier Agentic...Mar 27, 2026 — In this paper, we present the benchmark design, its efficiency-based sc...

  9. Source: threadreaderapp.com
    Link: https://threadreaderapp.com/thread/1192121587467784192.html
    Source snippet

    ce, as well as a new AI evaluation dataset, the "Abstraction and Reasoning Corpus".Read more...

  10. Source: facebook.com
    Link: https://www.facebook.com/groups/chatgpt4u/posts/1597889177507452/
    Source snippet

    AGI-2 A new gold-standard benchmark for abstract reasoning from the ARC Prize team. Humans solve 100%, frontier models score less than 5%...

Topic Tree

Follow this branch

Parent topic

Benchmark limits Do benchmark wins prove intelligence?

Related pages 2