The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study introduces a newly curated dataset comprising 20k doctor-patient Q\&A pairs and 60\% of a 90-million-token crawled corpus from medical magazines. Using a parameter-efficient fine-tuning approach, we enhanced the medical knowledge of the baseline model, aya-expanse-8b. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and successfully passed the Iranian Basic Medical Science Entrance Exam (IBSEE) in September 2023, which the baseline model did not. Additionally, the fine-tuned model improved Persian-translated MMLU accuracy by an average of 2.67\%. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments. Future research could explore multimodal input to further enhance performance.
翻译:语言模型的快速发展展示了人工智能在医疗行业的潜力。然而,对于波斯语等低资源语言,小型语言模型在专业领域面临挑战。尽管波斯语存在大量医学领域网站,但此前缺乏经过整理的数据集或语料库,使得本研究构建的数据集成为该领域的首创。本研究引入了一个新整理的数据集,包含2万条医患问答对以及从医学杂志爬取的9000万词元语料库中的60%。采用参数高效微调方法,我们增强了基线模型aya-expanse-8b的医学知识。基准评估表明,微调后的模型在医学问答任务中实现了更高的准确率,并成功通过了2023年9月的伊朗基础医学入学考试(IBSEE),而基线模型未能通过。此外,微调模型将波斯语翻译版MMLU的准确率平均提升了2.67%。这项工作凸显了利用开放获取在线数据丰富医学领域小型语言模型的潜力,为资源受限环境下的波斯语医学人工智能应用提供了创新解决方案。未来研究可探索多模态输入以进一步提升性能。