In this study, we tackle the challenging task of predicting secondary structures from protein primary sequences, a pivotal initial stride towards predicting tertiary structures, while yielding crucial insights into protein activity, relationships, and functions. Existing methods often utilize extensive sets of unlabeled amino acid sequences. However, these approaches neither explicitly capture nor harness the accessible protein 3D structural data, which is recognized as a decisive factor in dictating protein functions. To address this, we utilize protein residue graphs and introduce various forms of sequential or structural connections to capture enhanced spatial information. We adeptly combine Graph Neural Networks (GNNs) and Language Models (LMs), specifically utilizing a pre-trained transformer-based protein language model to encode amino acid sequences and employing message-passing mechanisms like GCN and R-GCN to capture geometric characteristics of protein structures. Employing convolution within a specific node's nearby region, including relations, we stack multiple convolutional layers to efficiently learn combined insights from the protein's spatial graph, revealing intricate interconnections and dependencies in its structural arrangement. To assess our model's performance, we employed the training dataset provided by NetSurfP-2.0, which outlines secondary structure in 3-and 8-states. Extensive experiments show that our proposed model, SSRGNet surpasses the baseline on f1-scores.
翻译:本研究致力于解决从蛋白质一级序列预测二级结构的挑战性任务,这是预测三级结构的关键初始步骤,同时为理解蛋白质活性、相互作用及功能提供重要线索。现有方法通常利用大量未标记的氨基酸序列,但未能显式捕获或利用可获取的蛋白质三维结构数据——该数据被公认为决定蛋白质功能的关键因素。为此,我们采用蛋白质残基图并引入多种序列或结构连接形式以增强空间信息捕捉能力。我们巧妙结合图神经网络(GNNs)与语言模型(LMs),具体采用预训练的基于Transformer的蛋白质语言模型编码氨基酸序列,并运用如GCN和R-GCN等消息传递机制捕获蛋白质结构的几何特征。通过在包含关系的特定节点邻域内进行卷积操作,我们堆叠多个卷积层以高效学习蛋白质空间图的综合信息,揭示其结构排布中复杂的相互关联与依赖关系。为评估模型性能,我们采用NetSurfP-2.0提供的训练数据集(包含3态与8态二级结构标注)。大量实验表明,我们提出的SSRGNet模型在f1分数上超越了基线方法。