Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.
翻译:现代大型语言模型通过长链思维实现了令人瞩目的推理能力,但它们在推理过程中会产生巨大的计算成本,这促使人们研究提升性能成本比的技术。在这些技术中,推测解码通过采用快速但不精确的草稿模型自回归地生成候选标记,然后由能力更强的目标模型并行验证,从而加速推理。然而,由于语义等效步骤中的标记不匹配导致不必要的拒绝,传统的标记级推测解码在推理任务中表现不佳。尽管最近的研究已转向步骤级语义验证,通过接受或拒绝整个推理步骤来提高效率,但现有的步骤级方法仍会重新生成许多被拒绝的步骤,改进有限,浪费了宝贵的目标模型计算资源。为应对这一挑战,我们提出了Arbitrage,一种新颖的步骤级推测生成框架,该框架根据草稿模型与目标模型之间的相对优势动态路由生成过程。Arbitrage不采用固定的接受阈值,而是使用一个轻量级的路由器,该路由器经过训练以预测目标模型何时可能产生显著更优的步骤。这种路由方式近似于一个理想的套利预言机,该预言机总是选择更高质量的步骤,实现了近乎最优的效率-准确性权衡。在多个数学推理基准测试中,Arbitrage持续超越先前的步骤级推测解码基线,在保持相同准确性的情况下,将推理延迟降低了高达约2倍。