Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, on-policy algorithms used for post-training are not naturally robust to a diversified content of experience replay buffers, which asynchronous off-policy actors can efficiently populate in parallel to training. We propose efficiently learning on such off-policy data via Trajectory Balance with Asynchrony (TBA), an approach to asynchronous RL for LLMs that leverages the principled off-policy TB objective. On math, preference-tuning, and automated red-teaming tasks, we post-train models ranging from Pythia 410M to Qwen 2.5 7B, finding TBA offers speed and performance boosts over strong baselines like Online DPO and Dr. GRPO. Beyond TBA's performance benefits (high accuracy even as asynchrony grows) and speedups ($4\times$ or more), we show its reward- and recency-prioritizing sampling enable further gains as data generation is scaled. Our code is available at https://github.com/bbartoldson/TBA.
翻译:强化学习是大语言模型后训练的关键组成部分。然而,用于后训练的在线策略算法对经验回放缓冲区中多样化的内容并不天然具备鲁棒性,而异步离线策略执行器可以并行于训练过程高效地填充这些缓冲区。我们提出通过轨迹平衡与异步性高效地学习此类离线策略数据,这是一种利用原则性离线策略轨迹平衡目标的大语言模型异步强化学习方法。在数学、偏好调优和自动化红队测试任务中,我们对从Pythia 410M到Qwen 2.5 7B的模型进行后训练,发现TBA在速度和性能上均优于Online DPO和Dr. GRPO等强基线。除了TBA的性能优势(即使异步性增强仍保持高精度)和加速效果(4倍或更高),我们还证明其基于奖励和时效性的优先级采样能够在数据生成规模扩大时带来进一步增益。我们的代码发布于https://github.com/bbartoldson/TBA。