The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a "canvas" that can be processed like natural images. It is then natural for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.
翻译:抽象与推理语料库(ARC)旨在推动对抽象推理的研究,这是人类智能的一个基本方面。处理ARC的常见方法将其视为面向语言的问题,通常由大型语言模型(LLM)或循环推理模型来解决。然而,尽管ARC中的谜题式任务本质上是视觉性的,现有研究很少从以视觉为中心的视角来探讨该问题。在这项工作中,我们将ARC置于视觉范式下,将其构建为一个图像到图像转换问题。为了融入视觉先验,我们将输入表示在“画布”上,使其能够像自然图像一样被处理。因此,我们自然地应用标准视觉架构,例如基础的视觉Transformer(ViT),来执行图像到图像映射。我们的模型仅使用ARC数据从头开始训练,并通过测试时训练泛化到未见任务。我们的框架,称为视觉ARC(VARC),在ARC-1基准测试中达到了60.4%的准确率,显著优于同样从头开始训练的现有方法。我们的结果与领先的LLMs相竞争,并缩小了与人类平均表现之间的差距。