超越多数投票：面向测试时强化学习的细粒度与更可靠奖励信号 (Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning)

Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for improving reasoning ability of large language models (LLMs). However, this voting strategy often induces confirmation bias and suffers from sparse rewards, limiting the overall performance. In this work, we propose subgroup-specific step-wise confidence-weighted pseudo-label estimation (SCOPE), a framework integrating model confidence and dynamic subgroup partitioning to address these issues. Specifically, SCOPE integrates the proposed step-wise confidence into pseudo label deduction, prioritizing high-quality reasoning paths over simple frequency count. Furthermore, it dynamically partitions the candidate outputs pool into independent subgroups by balancing reasoning quality against exploration diversity. By deriving local consensus via repeat sampling for each sub group, SCOPE provides diverse supervision targets to encourage broader exploration. We conduct experiments across various models and benchmarks, experimental results show that SCOPE consistently outperforms recent baselines. Notably, SCOPE achieving relative improvements of 13.1\% on challenging AIME 2025 and 8.1\% on AMC. The code is released at \href{https://github.com/szu-tera/SCOPE}{https://github.com/szu-tera/SCOPE}.

翻译：测试时强化学习通过使用多数投票结果作为伪标签，减轻了对标注数据的依赖，成为可验证奖励强化学习（RLVR）的补充方向，旨在提升大语言模型（LLMs）的推理能力。然而，这种投票策略常引发确认偏误，且因奖励稀疏而受限整体性能。本文提出子群特定步进置信加权伪标签估计框架（SCOPE），该框架整合模型置信度与动态子群划分以应对上述问题。具体而言，SCOPE将所提出的步进置信度融入伪标签推导，优先考虑高质量推理路径而非简单频次统计。此外，它通过平衡推理质量与探索多样性，将候选输出池动态划分为独立子群。通过对每个子群进行重复采样以达成局部共识，SCOPE提供多样化的监督目标以鼓励更广泛的探索。我们在多种模型与基准测试上开展实验，结果表明SCOPE持续优于近期基线方法。值得注意的是，SCOPE在具有挑战性的AIME 2025上实现了13.1%的相对提升，在AMC上实现了8.1%的相对提升。代码已发布于https://github.com/szu-tera/SCOPE。