Every Single AI Model Just Failed This New Test Miserably (And There's $2 Million on the Line)
The brand-new ARC-AGI-3 benchmark dropped and not one frontier AI model could score above 1%. The prize for cracking it? Two million dollars.
Think today's AI models are getting close to human-level intelligence? A brand new test just delivered a brutal reality check.
ARC-AGI-3, the latest version of a benchmark designed to measure how well AI can actually reason and think (not just memorize patterns), launched this week. The results? Every single frontier AI model scored under 1%. GPT, Claude, Gemini, all of them. Completely stumped.
To put that in perspective, the average human can handle these puzzles without much trouble. We're talking about visual pattern recognition and logical reasoning tasks that feel pretty intuitive to people but apparently break the world's most advanced AI systems.
The organization behind the test is putting $2 million on the table for anyone who can build a system that cracks it. That's double the previous prize, which tells you how confident they are that today's AI still has a long way to go.
This matters because companies like OpenAI and Google keep telling us we're on the doorstep of "AGI" (artificial general intelligence, meaning AI as smart as humans). But if their best models can't even score 1% on a reasoning test that humans breeze through, maybe we should pump the brakes on the hype a little.
As reported by ARC Prize.
Source: ARC Prize
Sponsored