Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform resource distribution across all reasoning sub-problems, creating fundamental bottlenecks where challenging sub-problems receive insufficient attention while routine operations consume disproportionate resources. This uniform allocation creates performance bottlenecks where additional computational resources yield diminishing returns. Inspired by dual-process theory, we propose \textbf{SCALE} (Selective Resource Allocation), a framework that selectively allocates computational resources based on sub-problem difficulty. SCALE operates through four stages: (1) problem decomposition into sequential reasoning sub-problems, (2) difficulty assessment of each sub-problem to distinguish between routine operations and computationally challenging sub-problems, (3) selective processing mode assignment between System 1 for simple sub-problems and System 2 for complex ones, and (4) sequential execution with context propagation. By concentrating resources on challenging sub-problems while processing routine operations efficiently, SCALE achieves substantial performance improvements with superior resource utilization. Extensive experiments demonstrate that SCALE significantly outperforms uniform scaling baselines, achieving accuracy improvements of up to 13.75 percentage points (57.50% to 71.25% on AIME25) while reducing computational costs by 33%-53%, representing a major advance in test-time scaling that addresses fundamental limitations of current approaches.
翻译:测试时计算扩展已成为增强大型语言模型(LLM)数学推理能力的强大范式,通过在推理过程中分配额外的计算资源。然而,现有方法在所有推理子问题上采用统一的资源分配策略,导致根本性瓶颈:具有挑战性的子问题得不到足够关注,而常规操作却消耗了不成比例的资源。这种均匀分配造成了性能瓶颈,使得额外计算资源的投入产生收益递减。受双过程理论启发,我们提出 \\textbf{SCALE}(选择性资源分配),一种根据子问题难度选择性分配计算资源的框架。SCALE 通过四个阶段运行:(1)将问题分解为顺序推理子问题,(2)评估每个子问题的难度以区分常规操作与计算上具有挑战性的子问题,(3)在简单子问题使用 System 1 和复杂子问题使用 System 2 之间进行选择性处理模式分配,以及(4)伴随上下文传播的顺序执行。通过将资源集中在具有挑战性的子问题上,同时高效处理常规操作,SCALE 实现了显著的性能提升和优越的资源利用率。大量实验表明,SCALE 显著优于均匀扩展基线,准确率提升高达 13.75 个百分点(在 AIME25 上从 57.50% 提升至 71.25%),同时将计算成本降低 33%-53%,代表了测试时扩展领域的重要进展,解决了现有方法的根本局限性。