Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to iteratively learn from themselves - thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited, largely due to their susceptibility to reward hacking and training instability. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates challenging questions with ground-truth answers, and a Solver, which attempts to solve them. We enrich the Proposer with knowledge from a pre-training dataset to ensure the questions' quality and diversity. To avoid reward hacking, the Proposer is rewarded for producing only valid questions that push the Solver's limit, while the Solver is rewarded for solving them correctly, and both are updated jointly. To further enhance training stability, we introduce an optional offline paradigm that decouples Proposer and Solver updates, alternately updating each for several steps while holding the other fixed. Notably, PasoDoble operates without supervision during training. Experimental results show that PasoDoble can improve the reasoning performance of LLMs. Our project page is available at https://hcy123902.github.io/PasoDoble.
翻译:大语言模型(LLMs)通过带可验证奖励的强化学习(RLVR)取得了显著进展,但仍然严重依赖外部监督(例如,人工标注的标签)。对抗学习,特别是通过自我博弈,提供了一种有前景的替代方案,使模型能够从自身迭代学习——从而减少对外部监督的依赖。双角色对抗学习通过为两个模型分配专门的角色并让它们相互对抗训练,扩展了对抗学习,促进了持续的竞争和共同进化。尽管前景广阔,但将双角色对抗训练应用于大语言模型仍然有限,这主要是由于它们容易受到奖励破解和训练不稳定的影响。在本文中,我们介绍了PasoDoble,一种新颖的大语言模型双角色对抗学习框架。PasoDoble对抗性地训练两个从同一基础模型初始化的模型:一个提案者,负责生成具有真实答案的挑战性问题;一个求解者,负责尝试解决这些问题。我们利用预训练数据集的知识来丰富提案者,以确保问题的质量和多样性。为了避免奖励破解,提案者仅因产生能挑战求解者极限的有效问题而获得奖励,而求解者则因正确解决问题而获得奖励,两者同时进行联合更新。为了进一步增强训练稳定性,我们引入了一种可选的离线范式,该范式解耦了提案者和求解者的更新,交替更新每个模型若干步,同时保持另一个模型固定。值得注意的是,PasoDoble在训练过程中无需监督。实验结果表明,PasoDoble能够提升大语言模型的推理性能。我们的项目页面位于 https://hcy123902.github.io/PasoDoble。