知识层级引导的生物医学数据集蒸馏用于领域大语言模型训练 (Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training)

The rapid advancement of large language models (LLMs) in biological-medical applications has highlighted a gap between their potential and the limited scale and often low quality of available open-source annotated textual datasets. In addition, the inherent complexity of the biomedical knowledge hierarchy significantly hampers efforts to bridge this gap.Can LLMs themselves play a pivotal role in overcoming this limitation? Motivated by this question, we investigate this challenge in the present study.We propose a framework that automates the distillation of high-quality textual training data from the extensive scientific literature. Our approach self-evaluates and generates questions that are more closely aligned with the biomedical domain, guided by the biomedical knowledge hierarchy through medical subject headings (MeSH). This comprehensive framework establishes an automated workflow, thereby eliminating the need for manual intervention. Furthermore, we conducted comprehensive experiments to evaluate the impact of our framework-generated data on downstream language models of varying sizes. Our approach substantially improves question-answering tasks compared to pre-trained models from the life sciences domain and powerful close-source models represented by GPT-4. Notably, the generated AI-Ready dataset enabled the Llama3-70B base model to outperform GPT-4 using MedPrompt with multiple times the number of parameters. Detailed case studies and ablation experiments underscore the significance of each component within our framework

翻译：大语言模型（LLMs）在生物医学应用中的快速发展突显了其潜力与现有开源标注文本数据集规模有限且质量往往较低之间的差距。此外，生物医学知识层级固有的复杂性显著阻碍了弥合这一差距的努力。LLMs自身能否在克服这一限制中发挥关键作用？受此问题启发，我们在本研究中探讨了这一挑战。我们提出一个框架，能够从广泛的科学文献中自动蒸馏出高质量的文本训练数据。该方法通过医学主题词（MeSH）引导，以生物医学知识层级为指导，进行自我评估并生成更贴近生物医学领域的问题。这一综合性框架建立了一个自动化工作流程，从而消除了人工干预的需求。此外，我们进行了全面的实验，以评估我们框架生成的数据对不同规模的下游语言模型的影响。与生命科学领域的预训练模型以及以GPT-4为代表的强大闭源模型相比，我们的方法在问答任务上取得了显著提升。值得注意的是，生成的AI就绪数据集使得Llama3-70B基础模型在使用MedPrompt时，以数倍于其参数量的情况下超越了GPT-4的性能。详细的案例研究和消融实验凸显了我们框架中每个组件的重要性。