ARC-AGI-3: The Benchmark That Humbles Every AI Model

Every AI lab on the planet is claiming their model “reasons.” ARC-AGI-3 just called everyone’s bluff.

The new benchmark dropped this week and the scores are brutal. It’s an interactive test where AI agents get dropped into game-like environments with zero instructions. No prompts. No examples. No hints. Figure out the objective, figure out the rules, execute. Humans score 100%. Gemini 3.1 Pro, the best performer, scored 0.37%. GPT-5.4 got 0.26%. Opus 4.6 managed 0.25%. Grok-4.20 scored literally zero.

Let that sink in. The most capable AI systems on earth can barely crack one-third of one percent on a test that every human participant aced.

Why This Benchmark Is Different

Most AI benchmarks test pattern matching dressed up as intelligence. Can you complete this code? Can you answer this trivia question? Can you summarize this document? Those are useful skills. They’re also exactly the kind of thing you’d expect a very sophisticated autocomplete engine to handle well.

ARC-AGI-3 tests something fundamentally different: the ability to encounter a totally novel situation and figure out what’s going on from scratch. No training data to fall back on. No similar examples in your weights. Just raw “look at this thing, understand it, act on it.” That’s closer to what we actually mean when we say intelligence.

The $700K prize pool isn’t just a marketing stunt, though it is also that. It creates a clear, public scoreboard that’s immune to the usual benchmark gaming. You can’t fine-tune your way to a good score here because each evaluation is unique. Either your model can genuinely adapt to new situations or it can’t. Right now, it very much can’t.

The Builder’s Take

This should recalibrate how you think about what AI agents can actually do in production. The gap between “impressive demo” and “reliable autonomous agent” is enormous. If the best models on earth can’t figure out a simple game environment without instructions, they definitely can’t reliably handle open-ended tasks in your codebase, your infrastructure, or your business processes without significant guardrails.

That’s not a reason to stop building with AI. It’s a reason to be honest about where you put humans in the loop. The current generation of models is spectacular at well-defined tasks with clear inputs and outputs. It falls apart when the problem space is ambiguous and the objectives aren’t spelled out.

If you’re building agents, design for the 0.37% world, not the marketing demo world. Assume your agent will fail at anything genuinely novel. Build recovery paths. Keep humans close to the decisions that matter. The models will get better. But the benchmark exists specifically to measure the gap between “better at pattern matching” and “actually intelligent,” and that gap is currently a canyon.

Every few months someone declares that AGI is basically here. ARC-AGI-3 is the cold shower. We’re making progress. We’re not close.

Source: ARC Prize

The Vault

Why This Benchmark Is Different

The Builder’s Take