深度改进监督 (Deep Improvement Supervision)

Recently, it was shown that small, looped architectures, such as Tiny Recursive Models (TRMs), can outperform Large Language Models (LLMs) on complex reasoning tasks, including the Abstraction and Reasoning Corpus (ARC). In this work, we investigate a core question: how can we further improve the efficiency of these methods with minimal changes? To address this, we frame the latent reasoning of TRMs as a form of classifier-free guidance and implicit policy improvement algorithm. Building on these insights, we propose a novel training scheme that provides a target for each loop during training. We demonstrate that our approach significantly enhances training efficiency. Our method reduces the total number of forward passes by 18x and eliminates halting mechanisms, while maintaining quality comparable to standard TRMs. Notably, we achieve 24% accuracy on ARC-1 with only 0.8M parameters, outperforming most LLMs.

翻译：近期研究表明，小型循环架构（如微型递归模型（TRMs））在复杂推理任务（包括抽象与推理语料库（ARC））上能够超越大型语言模型（LLMs）。本研究探讨了一个核心问题：如何以最小的改动进一步提升这些方法的效率？为此，我们将TRMs的潜在推理框架化为一种无分类器引导和隐式策略改进算法。基于这些洞见，我们提出了一种新颖的训练方案，为训练过程中的每个循环提供目标。我们证明，该方法显著提升了训练效率。我们的方法将前向传播总次数减少了18倍，并消除了停止机制，同时保持了与标准TRMs相当的质量。值得注意的是，我们仅用0.8M参数就在ARC-1上实现了24%的准确率，超越了大多数LLMs。