ZIP-RC：通过零开销联合奖励-成本预测优化测试时计算 (ZIP-RC: Optimizing Test-Time Compute via Zero-Overhead Joint Reward-Cost Prediction)

Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.

翻译：大型语言模型在推理方面表现出色，但缺乏自我反思的关键方面，包括预测自身成功以及实现成功所需的计算量。人类利用实时内省来决定投入多少努力、何时进行多次尝试、何时停止以及何时发出成功或失败的信号。缺乏这种能力，大型语言模型难以做出智能的元认知决策。像Best-of-N这样的测试时扩展方法，无论生成过程中每个样本的边际效益如何，都使用固定的样本预算，从而推高了成本和延迟；而置信度信号的缺失可能误导用户，阻碍适当升级到更优工具，并损害可信度。学习型验证器或奖励模型可以提供置信度估计，但无法实现自适应推理，并且因需要额外模型或前向传播而增加显著成本。我们提出了ZIP-RC，一种自适应推理方法，使模型能够在推理时以零开销预测奖励和成本。在每个标记处，ZIP-RC复用同一前向传播中保留或未使用的逻辑值（与下一标记预测并行），输出最终奖励和剩余长度的联合分布——无需额外模型、架构更改或推理开销。利用该完整联合分布计算采样效用，即若生成至完成时，样本集的期望最大奖励、总计算量和延迟的线性组合。在推理过程中，我们通过元操作最大化此效用，这些元操作决定继续或从哪些标记前缀开始采样。在混合难度数学基准测试中，ZIP-RC在相同或更低平均成本下，比多数投票法将准确率提升高达12%，并在质量、计算量和延迟之间绘制出平滑的帕累托前沿。通过提供实时奖励-成本内省，ZIP-RC实现了自适应、高效的推理。