Reinforcement learning with verifiable rewards (RLVR) has demonstrated superior performance in enhancing the reasoning capability of large language models (LLMs). However, this accuracy-oriented learning paradigm often suffers from entropy collapse, which reduces policy exploration and limits reasoning capabilities. To address this challenge, we propose an efficient reinforcement learning framework that leverages entropy signals at both the semantic and token levels to improve reasoning. From the data perspective, we introduce semantic entropy-guided curriculum learning, organizing training data from low to high semantic entropy to guide progressive optimization from easier to more challenging tasks. For the algorithmic design, we adopt non-uniform token treatment by imposing KL regularization on low-entropy tokens that critically impact policy exploration and applying stronger constraints on high-covariance portions within these tokens. By jointly optimizing data organization and algorithmic design, our method effectively mitigates entropy collapse and enhances LLM reasoning. Experimental results across 6 benchmarks with 3 different parameter-scale base models demonstrate that our method outperforms other entropy-based approaches in improving reasoning.
翻译:基于可验证奖励的强化学习(RLVR)在提升大语言模型(LLMs)的推理能力方面已展现出卓越性能。然而,这种以准确性为导向的学习范式常面临熵崩溃问题,导致策略探索受限并削弱推理能力。为应对这一挑战,本文提出一种高效的强化学习框架,通过利用语义与标记层面的熵信号来优化推理过程。从数据角度,我们引入语义熵引导的课程学习,将训练数据按语义熵从低到高组织,以引导模型从易到难任务逐步优化。在算法设计上,采用非均匀标记处理策略:对关键影响策略探索的低熵标记施加KL正则化,并在这些标记内的高协方差部分施加更强约束。通过联合优化数据组织与算法设计,本方法有效缓解熵崩溃并增强LLM推理能力。在6个基准测试集上使用3种不同参数规模基模型的实验结果表明,本方法在提升推理性能方面优于其他基于熵的优化方法。