生成式人工智能在生物信息学中的应用：模型、方法学进展与系统性综述 (Generative Artificial Intelligence in Bioinformatics: A Systematic Review of Models, Applications, and Methodological Advances)

Riasad Alvi,Sayeem Been Zaman,Wasimul Karim,Arefin Ittesafun Abian,Mohaimenul Azam Khan Raiaan,Saddam Mukta,Md Rafi Ur Rashid,Md Rafiqul Islam,Yakub Sebastian,Sami Azam

Generative artificial intelligence (GenAI) has become a transformative approach in bioinformatics that often enables advancements in genomics, proteomics, transcriptomics, structural biology, and drug discovery. To systematically identify and evaluate these growing developments, this review proposed six research questions (RQs), according to the preferred reporting items for systematic reviews and meta-analysis methods. The objective is to evaluate impactful GenAI strategies in methodological advancement, predictive performance, and specialization, and to identify promising approaches for advanced modeling, data-intensive discovery, and integrative biological analysis. RQ1 highlights diverse applications across multiple bioinformatics subfields (sequence analysis, molecular design, and integrative data modeling), which demonstrate superior performance over traditional methods through pattern recognition and output generation. RQ2 reveals that adapted specialized model architectures outperformed general-purpose models, an advantage attributed to targeted pretraining and context-aware strategies. RQ3 identifies significant benefits in the bioinformatics domains, focusing on molecular analysis and data integration, which improves accuracy and reduces errors in complex analysis. RQ4 indicates improvements in structural modeling, functional prediction, and synthetic data generation, validated by established benchmarks. RQ5 suggests the main constraints, such as the lack of scalability and biases in data that impact generalizability, and proposes future directions focused on robust evaluation and biologically grounded modeling. RQ6 examines that molecular datasets (such as UniProtKB and ProteinNet12), cellular datasets (such as CELLxGENE and GTEx) and textual resources (such as PubMedQA and OMIM) broadly support the training and generalization of GenAI models.

翻译：生成式人工智能（GenAI）已成为生物信息学领域的一种变革性方法，其在基因组学、蛋白质组学、转录组学、结构生物学和药物发现等方面常推动研究进展。为系统性地识别和评估这些日益增长的发展，本综述依据系统综述与荟萃分析方法的首选报告条目，提出了六个研究问题（RQs）。其目标在于评估在方法学进展、预测性能及专业化方面具有影响力的GenAI策略，并识别面向高级建模、数据密集型发现及整合性生物学分析的有前景方法。RQ1强调了在多个生物信息学子领域（序列分析、分子设计和整合数据建模）的多样化应用，这些应用通过模式识别与输出生成展现出优于传统方法的性能。RQ2揭示，经过调整的专业化模型架构优于通用模型，这一优势归因于有针对性的预训练和上下文感知策略。RQ3识别了在生物信息学领域，特别是分子分析和数据整合方面的显著益处，这提高了复杂分析的准确性并减少了误差。RQ4指出了在结构建模、功能预测和合成数据生成方面的改进，这些改进已通过既定基准验证。RQ5指出了主要限制，例如缺乏可扩展性以及影响泛化能力的数据偏差，并提出了未来研究方向，重点关注稳健性评估和基于生物学的建模。RQ6考察了分子数据集（如UniProtKB和ProteinNet12）、细胞数据集（如CELLxGENE和GTEx）以及文本资源（如PubMedQA和OMIM）广泛支持GenAI模型的训练与泛化。