Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver's ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver's ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data bootstrap problem-design strategies from the generator. Then, we treat the solver's feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver's competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our method achieves an average improvement of 2.5% and generalizes to both language and vision-language models. Moreover, a solver trained on the synthesized data provides improved rewards for continued generator training, enabling co-evolution and yielding a further 0.7% performance gain. Our code will be made publicly available here.
翻译:为训练大型推理模型进行数据合成,提供了一种可扩展的替代方案,以克服人工标注数据集有限的问题,并能生成高质量数据。然而,现有方法面临若干挑战:(i) 无差别生成忽略求解器能力导致产生低价值问题,或依赖复杂数据流水线来平衡问题难度;(ii) 问题生成缺乏推理过程,导致产生浅层问题变体。本文开发了一种问题生成器,在合成前通过显式推理规划问题方向,并根据求解器能力自适应调整难度。具体而言,我们构建相关的问题对,并通过推理模型生成的中间问题设计思维链进行增强。这些数据从生成器中引导出问题设计策略。随后,我们将求解器对合成问题的反馈作为奖励信号,使生成器能够校准难度,并在求解器能力边界附近生成互补性问题。在10个数学与通用推理基准上的大量实验表明,本方法平均提升2.5%,并能泛化至语言模型和视觉语言模型。此外,使用合成数据训练的求解器能为生成器的持续训练提供改进的奖励信号,实现协同进化,并带来额外0.7%的性能提升。我们的代码将在此公开。