Proteins are the essential drivers of biological processes. At the molecular level, they are chains of amino acids that can be viewed through a linguistic lens where the twenty standard residues serve as an alphabet combining to form a complex language, referred to as the language of life. To understand this language, we must first identify its fundamental units. Analogous to words, these units are hypothesized to represent an intermediate layer between single residues and larger domains. Crucially, just as protein diversity arises from evolution, these units should inherently reflect evolutionary relationships. We introduce PUMA (Protein Units via Mutation-Aware Merging) to discover these evolutionarily meaningful units. PUMA employs an iterative merging algorithm guided by substitution matrices to identify protein units and organize them into families linked by plausible mutations. This process creates a hierarchical genealogy where parent units and their mutational variants coexist, simultaneously producing a unit vocabulary and the genealogical structure connecting them. We validate that PUMA families are biologically meaningful; mutations within a PUMA family correlate with clinically benign variants and with high-scoring mutations in high-throughput assays. Furthermore, these units align with the contextual preferences of protein language models and map to known functional annotations. PUMA's genealogical framework provides evolutionarily grounded units, offering a structured approach for understanding the language of life.
翻译:蛋白质是生物过程的核心驱动者。在分子层面,蛋白质是由氨基酸链构成的,可通过语言学的视角进行审视:二十种标准残基构成一个字母表,组合形成一种复杂的语言,即生命语言。为理解这种语言,我们首先需要识别其基本单元。这些单元类似于词汇,被假设为介于单个残基与更大结构域之间的中间层。关键的是,正如蛋白质多样性源于进化,这些单元应内在反映进化关系。我们提出PUMA(通过突变感知合并的蛋白质单元)来发现这些具有进化意义的单元。PUMA采用一种迭代合并算法,以替代矩阵为指导,识别蛋白质单元并将其组织成通过合理突变关联的家族。这一过程构建了一个层次化的谱系,其中父单元及其突变变体共存,同时生成单元词汇表及连接它们的谱系结构。我们验证了PUMA家族具有生物学意义:PUMA家族内的突变与临床良性变异及高通量实验中的高评分突变相关。此外,这些单元与蛋白质语言模型的上下文偏好一致,并映射到已知的功能注释。PUMA的谱系框架提供了基于进化的单元,为理解生命语言提供了一种结构化方法。