Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent "thought" vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM's own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.
翻译:近年来,大型语言模型(LLMs)的发展已从显式的思维链(CoT)推理转向更高效的潜在推理,其中中间思维以向量而非文本形式表示。然而,在具有挑战性的分布外任务中,潜在推理可能较为脆弱,而这类任务恰恰对鲁棒推理最为关键。为克服这些局限,我们提出了潜在思维策略优化(LTPO),一种无需参数更新的框架,可在测试时完全增强LLM的推理能力。LTPO将中间潜在“思维”向量视为动态参数,针对每个问题实例进行主动优化。它采用在线策略梯度方法,以基于置信度的内在奖励信号为指导,该信号直接从冻结LLM自身的输出分布计算得出,从而在优化过程中无需外部监督或昂贵的文本生成。在五个推理基准测试上的大量实验表明,LTPO不仅在标准任务上匹配或超越了强基线方法,还在其他方法失效时展现出显著的鲁棒性。尤其值得注意的是,在现有潜在推理基线准确率近乎崩溃的高难度AIME基准测试中,LTPO实现了显著提升,展示了其在复杂推理方面的独特能力。