ARC-AGI-2: the toughest AI benchmark ever, explained

When François Chollet, the AI researcher who created the original ARC benchmark in 2019, released ARC-AGI-2 in March 2025, the benchmark immediately did something unusual. It made frontier AI models look much weaker than their press releases suggested. OpenAI’s GPT-5 scored 9.9 percent. Google’s Gemini 2.5 Pro scored 4.9 percent. Anthropic’s Claude variants scored in the same range. The humans Chollet’s team tested in calibration sessions, with no prior training on the benchmark, scored an average of 60 percent. A panel of ten people achieved 100 percent on the same tasks the frontier models could barely touch. ARC-AGI-2 is, by Chollet’s design, a benchmark that humans find easy and current AI struggles with. The reason it matters is what it reveals about the gap between what AI models can do and what general fluid intelligence actually is.

What ARC-AGI-2 is, and why it was needed

The original ARC-AGI benchmark, introduced by Chollet in his 2019 paper “On the Measure of Intelligence,” posed a deceptively simple problem. Each task presents a small number of input-output grid examples, typically two to four, and asks the model to infer the underlying transformation rule and apply it to a new input grid. The grids are colored cells in two-dimensional arrays, and the transformations involve operations like rotation, reflection, color mapping, pattern completion, and various forms of compositional reasoning. The tasks require no domain knowledge, no specialized training data, and no large vocabulary. They test what Chollet calls fluid intelligence, namely the ability to solve novel problems by recognizing structure rather than retrieving memorized patterns.

ARC-AGI-1 became, between 2019 and 2024, the benchmark that resisted brute-force scaling longer than any other. Models that scored well on every other AI benchmark struggled on ARC. The pattern broke late in 2024 when OpenAI’s o3 model, using massive test-time compute and tightly engineered prompting scaffolds, achieved 87.5 percent on the semi-private evaluation set, at an estimated compute cost of approximately $346,000. The high score did not feel like a clean win to the AI research community. The combination of brute-force test-time compute and prompting techniques specifically tuned for ARC felt closer to gaming the benchmark than to demonstrating general intelligence.

ARC-AGI-2 was Chollet’s response. The benchmark preserves the input-output grid task format of its predecessor for continuity, but introduces several structural changes. The tasks emphasize on-the-fly symbol interpretation, multi-step compositional reasoning, and context-dependent rules that resist the pattern-matching approaches that brute-force scaling had used to game ARC-AGI-1. The benchmark is human-calibrated through testing with 400 people in live sessions, with tasks retained only when multiple humans could reliably solve them. The benchmark has three evaluation sets, namely public, private, and semi-private, with calibrated identical human difficulty across all three.

What the leaderboard actually shows

The ARC-AGI-2 leaderboard, maintained by the ARC Prize Foundation, has become one of the most-watched evaluation surfaces in AI research. The current standings reveal a pattern that the frontier model marketing does not communicate clearly. Frontier closed-API models cluster in the 4 to 10 percent range on ARC-AGI-2, depending on the specific configuration and the date of evaluation. Cost-per-task varies dramatically, with some models requiring substantially more compute to reach their published scores than others.

OpenAI’s GPT-5 reached 9.9 percent at approximately $0.73 per task on the semi-private evaluation set. GPT-5 Mini scored 4.4 percent at $0.20 per task. GPT-5 Nano scored 2.5 percent. The pattern across other frontier labs is similar: Gemini, Claude, and Grok variants all sit in the same low single-digit to low double-digit range, with the specific score depending heavily on how much test-time compute the model uses and how the prompting scaffold is structured.

The most striking data point on the leaderboard, however, is not from a frontier lab. Samsung’s Tiny Recursive Model, the 7-million-parameter system documented in our TRM analysis, scores 7.8 percent on ARC-AGI-2 at training and inference costs that are several orders of magnitude lower than any frontier model. The TRM result is the part of the leaderboard that has drawn the most attention from researchers, because it suggests that the path to ARC-AGI capability may not run through scale.

The launch of ARC-AGI-3 in March 2026 raised the difficulty further. Early reports indicate frontier models score below 1 percent on the new benchmark’s interactive tasks, while humans continue to solve them at near-100 percent rates. The goalpost continues to move as the benchmark community responds to incremental progress with harder tests.

See also  AI models in 2025: purpose-driven architectures and human integration

What the benchmark actually measures

Chollet’s framing of intelligence, articulated across the 2019 paper and the 2025 update, is the conceptual scaffolding that makes ARC-AGI-2 meaningful. Intelligence, in his framing, is the efficiency with which a system can acquire new skills from minimal examples in unfamiliar problem spaces. The framing is deliberately distinct from the dominant LLM benchmark culture, where models are evaluated on their performance on tasks they have been trained for, with the implicit assumption that better task performance equals better intelligence.

The distinction matters because the two framings produce different research priorities. LLM benchmarks reward larger training corpora, more compute, and longer context windows. ARC-AGI-2 rewards architectural choices that produce efficient generalization from sparse examples. The frontier LLMs perform well on the former and poorly on the latter, while a specialized 7-million-parameter system can outperform them on the latter while being incapable of the former. Neither result invalidates the other. They measure different capabilities.

The implications for AI research are real. The labs that bet exclusively on scaling are betting that LLM-benchmark gains continue to compound in ways that matter for the downstream applications they care about. The labs that bet on architectural innovation, including the recursive reasoning models, the various reinforcement learning approaches documented in our Qwen QwQ analysis, and the broader exploration of alternatives to next-token prediction, are hedging the scaling thesis. The patterns connect with the State of LLMs 2025 coverage and the AI models 2025 analysis.

What the benchmark is and is not

ARC-AGI-2 is not a comprehensive measure of AI capability. The benchmark tests a specific class of structured reasoning problems and does not evaluate language understanding, world knowledge, creative generation, multimodal integration, or any of the other capabilities that frontier LLMs excel at. A model that scores well on ARC-AGI-2 may still be useless for most production AI workloads. A model that scores poorly on ARC-AGI-2 may still be commercially valuable for the workloads it actually targets.

The benchmark’s value lies in what it reveals about the gap between current AI and general intelligence. Frontier LLMs can write essays, summarize documents, generate code, and pass professional exams. They cannot reliably solve the kind of abstract pattern-recognition problems that an eight-year-old finds straightforward. The gap is real, persistent, and not closing as fast as the LLM benchmark scores would suggest. The patterns connect with the methodology discussions documented in our Stanford HAI 2026 coverage.

For procurement teams, the ARC-AGI-2 results are not directly actionable in most cases. The benchmark does not predict performance on customer service, document review, code generation, or any other typical enterprise workload. The benchmark does inform a broader procurement question: how much of the marketing around frontier AI capability reflects genuine progress toward general intelligence, and how much reflects narrow performance gains on tasks where evaluation reliability has degraded through contamination?

A reorientation for AI capability assessment

The architectural reorientation worth naming is that the AI field has been measuring capability through benchmarks whose reliability is increasingly uneven. ARC-AGI-2, with its private evaluation set, novel task generation, and human-calibrated difficulty, represents one of the more reliable signals available. The fact that frontier models score in the single digits while humans score in the 60-to-100 percent range should temper claims about how close current AI is to general intelligence.

For executives making strategic AI decisions, the implication is operational. The capability gains being marketed by frontier labs are real on the tasks the labs choose to highlight. They are not uniform across all dimensions of intelligence. The procurement question is not whether to buy frontier AI, since the capability is genuinely useful for many workloads. It is how to calibrate expectations against the specific gaps that ARC-AGI-2 and similar benchmarks expose. The pattern is documented across our agentic AI report, where the gap between demo capability and production reliability has similar shape.

What the next ARC cycle will reveal

ARC Prize Foundation continues to update the benchmark, with ARC-AGI-3 raising the bar further and pushing the gap between human and AI performance back open. The labs that take ARC seriously, namely those pursuing architectural innovation alongside scaling, will continue to produce results that compress the gap. The labs that bet exclusively on scaling will continue to find ARC scores embarrassing relative to their other benchmark performance, because the benchmark is specifically designed to resist what scaling alone can deliver.

So one question for any executive whose AI strategy assumes frontier model capability will continue compounding at current rates: if the next two years of progress on benchmarks like ARC-AGI-2 follow the recent trend, with scaled models still scoring in single digits against humans solving the same tasks easily, would your strategy survive the discovery that general intelligence is not on the immediate roadmap of the labs you are buying from?

Blog author
Scroll to Top