Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, TuluTalk, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.
翻译:近期关于大型语言模型(LLM)的研究日益聚焦于后训练及与数据集的校准,这些数据集旨在增强指令遵循、世界知识和专业技能。然而,主流开源和闭源LLM所使用的多数后训练数据集仍不向公众开放,其构建过程信息有限。这种透明度的缺乏推动了近期开源后训练语料库的发展。尽管在这些开源替代方案上进行训练可获得与领先模型相媲美的性能,但由于大规模严格实施所需的计算成本高昂,系统性比较仍具挑战性,因此此类比较在很大程度上仍缺失。因此,在评估数据质量时,具体样本、任务类型或筛选策略如何影响下游性能仍不明确。在本研究中,我们对两个重要的开源后训练数据集——Tulu-3-SFT-Mix和SmolTalk——进行了首次全面的并行分析。利用Magpie框架,我们为每个样本标注了详细的质量指标,包括对话轮次结构(单轮vs.多轮)、任务类别、输入质量和响应质量,并推导出揭示两个数据集之间结构性和质性异同的统计量。基于这些洞见,我们设计了一种原则性的筛选方案,生成了新的数据混合集TuluTalk,该集合比任一源数据集的样本数量减少14%,同时在关键基准测试中达到或超越了它们的性能。我们的研究结果为构建更有效的后训练数据集提供了可操作的见解,以在实际资源限制内提升模型性能。为支持未来研究,我们公开发布了标注的源数据集及我们筛选的TuluTalk混合集。