Antibodies are vital proteins offering robust protection for the human body from pathogens. The development of general protein and antibody-specific pre-trained language models both facilitate antibody prediction tasks. However, there have been limited studies that comprehensively explore the representation capability of distinct pre-trained language models on different antibody tasks. To investigate the problem, we aim to answer several key questions in this paper, such as how pre-trained language models perform in antibody tasks with different specificity and how introducing specific biological mechanisms to the pre-training process can benefit the model. Additionally, we evaluate if the learned antibody pre-trained representations can be applied to real-world antibody problems, like drug discovery and immune process understanding. Previously, no benchmark available largely hindered the study to answer these questions. To aid in our investigation, we provide an AnTibody Understanding Evaluation (ATUE) benchmark. We comprehensively evaluate the performance of protein pre-trained language models by empirical study along with conclusions and new insights. Our ATUE and code are released at https://github.com/dqwang122/EATLM.
翻译:抗体是人体免受病原体影响的重要蛋白质; 开发一般蛋白质和抗体专用的预先训练语言模型,都有助于反体预测任务; 然而,已经进行了有限的研究,全面探讨不同抗体任务方面不同的预先训练语言模型的代表性能力; 为了调查这一问题,我们力求回答本文件中的若干关键问题,例如,预先训练的语言模型如何在具有不同特性的抗体任务中发挥作用,以及将特定生物机制引入培训前进程,如何有利于该模型; 此外,我们评估所学的抗体预先训练的表述是否可适用于真实世界的抗体问题,如药物发现和免疫过程理解。以前,没有可资利用的基准,主要妨碍研究对这些问题的答案。为了协助我们的调查,我们提供了“Antibo体理解评估”基准。我们通过经验研究以及结论和新见解,全面评价蛋白先训练的语言模型的性能。我们的ATUE和代码在http://github.com/dqwang122/EATTLM上公布。</s>