The Arc Prize Foundation, co-founded by AI researcher François Chollet, has introduced ARC-AGI-2, a new benchmark designed to evaluate artificial general intelligence (AGI). This test presents AI models with visual pattern recognition tasks involving colored square grids, requiring abstract reasoning and adaptability. Unlike traditional AI assessments that rely on extensive data training, ARC-AGI-2 challenges models to solve novel problems without prior exposure. In recent evaluations, reasoning AI models such as OpenAI’s o1-pro and DeepSeek’s R1 scored between 1% and 1.3%, while non-reasoning models like GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash achieved around 1%. In contrast, human participants averaged a 60% success rate, highlighting the current limitations of AI in abstract problem-solving.
ARC-AGI-2 not only assesses problem-solving accuracy but also evaluates the efficiency of skill acquisition, discouraging reliance on brute-force computation. This focus on efficiency addresses shortcomings identified in the earlier ARC-AGI-1 test, which allowed models to leverage extensive computing power to find solutions. For instance, OpenAI’s o3 model scored 75.7% on ARC-AGI-1 but required $200 worth of computing power per task. When tested on ARC-AGI-2, the same model managed only a 4% success rate, underscoring the new benchmark’s emphasis on efficient problem-solving.
The introduction of ARC-AGI-2 reflects a growing need for more rigorous and comprehensive benchmarks to evaluate AI progress. Existing tests often fail to differentiate between narrow AI capabilities and genuine intelligence, leading to misleading assessments of AGI development. By emphasizing both problem-solving ability and efficiency, ARC-AGI-2 aims to guide AI research toward models that demonstrate true reasoning and adaptability, rather than relying on increased computational power. This initiative is crucial for advancing AI systems toward achieving human-like cognitive abilities.
Leave a Reply