ARC是一个视觉问题！ (ARC Is a Vision Problem!)

The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a "canvas" that can be processed like natural images. It is then natural for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.

翻译：抽象与推理语料库（ARC）旨在推动对抽象推理的研究，这是人类智能的一个基本方面。处理ARC的常见方法将其视为面向语言的问题，通常由大型语言模型（LLM）或循环推理模型来解决。然而，尽管ARC中的谜题式任务本质上是视觉性的，现有研究很少从以视觉为中心的视角来探讨该问题。在这项工作中，我们将ARC置于视觉范式下，将其构建为一个图像到图像转换问题。为了融入视觉先验，我们将输入表示在“画布”上，使其能够像自然图像一样被处理。因此，我们自然地应用标准视觉架构，例如基础的视觉Transformer（ViT），来执行图像到图像映射。我们的模型仅使用ARC数据从头开始训练，并通过测试时训练泛化到未见任务。我们的框架，称为视觉ARC（VARC），在ARC-1基准测试中达到了60.4%的准确率，显著优于同样从头开始训练的现有方法。我们的结果与领先的LLMs相竞争，并缩小了与人类平均表现之间的差距。