Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
翻译:利用长思维链的推理模型需要运用多种认知技能,例如答案验证、回溯、通过替代方法重试等。先前研究表明,当基础语言模型展现出这些技能时,通过强化学习(RL)进一步训练该模型可以学会有效运用这些技能。那么,如何让模型掌握基础模型不具备的技能?本研究提出的SkillFactory方法,是一种在强化学习前的监督微调(SFT)阶段使模型初步学习这些技能的微调方法。该方法不依赖于从更强模型的蒸馏,而是利用模型自身生成的样本,通过重新组织构建符合技能训练格式的数据。这些“银标”SFT轨迹可能并不完美,但能有效为模型在强化学习阶段掌握技能奠定基础。评估结果表明:(1)从SkillFactory的SFT初始化开始,有助于模型在强化学习后泛化至任务的更难变体,尽管强化学习前性能较低;(2)模型确实使用了认知技能;(3)经过强化学习的SkillFactory模型相比强化学习的基础模型,在域外任务上表现出更强的抗退化鲁棒性。本研究说明,强化学习前习得的归纳偏置有助于模型学习稳健的认知技能运用。