Terpene synthases (TPS) are a key family of enzymes responsible for generating the diverse terpene scaffolds that underpin many natural products, including front-line anticancer drugs such as Taxol. However, de novo TPS design through directed evolution is costly and slow. We introduce TpsGPT, a generative model for scalable TPS protein design, built by fine-tuning the protein language model ProtGPT2 on 79k TPS sequences mined from UniProt. TpsGPT generated de novo enzyme candidates in silico and we evaluated them using multiple validation metrics, including EnzymeExplorer classification, ESMFold structural confidence (pLDDT), sequence diversity, CLEAN classification, InterPro domain detection, and Foldseek structure alignment. From an initial pool of 28k generated sequences, we identified seven putative TPS enzymes that satisfied all validation criteria. Experimental validation confirmed TPS enzymatic activity in at least two of these sequences. Our results show that fine-tuning of a protein language model on a carefully curated, enzyme-class-specific dataset, combined with rigorous filtering, can enable the de novo generation of functional, evolutionarily distant enzymes.
翻译:萜烯合酶(TPS)是一类关键酶家族,负责生成多样化的萜烯骨架,这些骨架是许多天然产物的基础,包括紫杉醇等一线抗癌药物。然而,通过定向进化进行从头TPS设计成本高昂且进展缓慢。我们提出了TpsGPT,一种用于可扩展TPS蛋白质设计的生成模型,该模型通过在从UniProt挖掘的79k个TPS序列上微调蛋白质语言模型ProtGPT2构建而成。TpsGPT在计算机中从头生成酶候选物,我们使用多种验证指标对其进行了评估,包括EnzymeExplorer分类、ESMFold结构置信度(pLDDT)、序列多样性、CLEAN分类、InterPro结构域检测以及Foldseek结构比对。从最初生成的28k个序列池中,我们筛选出七个满足所有验证标准的推定TPS酶。实验验证证实其中至少两个序列具有TPS酶活性。我们的结果表明,在精心策划的酶类特异性数据集上微调蛋白质语言模型,并结合严格筛选,能够实现功能性、进化距离较远的酶的从头生成。