学习何时停止：基于强化学习的自适应潜在推理 (Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning)

Latent reasoning represents a new development in Transformer language models that has shown potential in compressing reasoning lengths compared to chain-of-thought reasoning. By directly passing the information-rich previous final latent state into the next sequence, latent reasoning removes the restriction to human language tokens as the medium for reasoning. We develop adaptive-length latent reasoning models and introduce a post-SFT reinforcement-learning methodology to optimize latent reasoning length by minimizing reasoning length while maintaining accuracy. This, in turn, further reduces compute usage and raises the bar on the compressive capabilities of latent reasoning models. Experiments on the Llama 3.2 1B model and the GSM8K-Aug dataset show a $52\%$ drop in total reasoning length with no penalty to accuracy. In future work, we plan to extend to additional models and datasets, analyze relationships between training coefficients, experiment with architecture variations, and continue our knowledge distillation for latent reasoning SFT efforts. We make our code and pretrained weights available at https://github.com/apning/adaptive-latent-reasoning.

翻译：潜在推理代表了Transformer语言模型的一项新发展，与链式思维推理相比，在压缩推理长度方面展现出潜力。通过直接将信息丰富的先前最终潜在状态传递至下一序列，潜在推理消除了以人类语言标记作为推理媒介的限制。我们开发了自适应长度的潜在推理模型，并引入一种后监督微调的强化学习方法，通过最小化推理长度同时保持准确性来优化潜在推理长度。这进而进一步降低了计算使用量，并提升了潜在推理模型的压缩能力上限。在Llama 3.2 1B模型和GSM8K-Aug数据集上的实验表明，总推理长度降低了$52\\%$，且未损害准确性。在未来的工作中，我们计划扩展至更多模型和数据集，分析训练系数之间的关系，尝试架构变体实验，并继续推进潜在推理监督微调的知识蒸馏工作。我们的代码与预训练权重已在https://github.com/apning/adaptive-latent-reasoning公开。