蛋白蛋白序列家庭简单基因模型 (Sparse generative modeling of protein-sequence families)

Pairwise Potts models (PM) provide accurate statistical models of families of evolutionarily related protein sequences. Their parameters are the local fields, which describe site-specific patterns of amino-acid conservation, and the two-site couplings, which mirror the coevolution between pairs of distinct sites. This coevolution reflects structural and functional constraints acting on protein sequences during evolution, and couplings can a priori connect any pairs of sites, even those being distant along the protein chain, or distant in the three-dimensional protein fold. The most conservative choice to describe all of the coevolution signal is to include all possible two-site couplings into the PM. This choice, typically made by what is known as Direct Coupling Analysis, has been highly successful in using sequences for predicting residue contacts in the three-dimensional structure, mutational effects, and in generating new functional sequences. However, the resulting PM suffers from important over-fitting effects: many couplings are small, noisy and hardly interpretable, and the PM is close to a critical point, meaning that it is highly sensitive to small parameter perturbations. In this work, we introduce a parameter-reduction procedure via iterative decimation of the less statistically significant couplings. We propose an information-based criterion that identifies couplings that are either weak, or statistically unsupported. We show that our procedure allows one to remove more than 90% of the PM couplings, while preserving the predictive and generative properties of the original dense PM. The resulting model is far away from criticality, meaning that it is more robust to noise, and its couplings are more easily interpretable.

翻译：Pairwise Potts 模型( PM) 提供了进化相关蛋白序列家族的精确统计模型。其参数是描述所有共变信号的本地域, 描述所有特定地点的氨基酸保护模式, 以及反映不同地点之间迭生变化的两站点组合。这种迭生过程反映了在进化过程中对蛋白序列的结构性和功能性限制, 而合并可以先验地连接任何一对站点, 甚至是蛋白链上距离遥远的站点, 或者在三维蛋白折叠中距离遥远的站点。描述所有共变信号的最保守的选择是将所有可能的两站点组合都包含到 PM 。这种选择, 通常由所谓的直接叠加分析所做出的两站点组合组合, 以及两站点的组合组合, 在使用序列来预测三维结构中的残余接触, 突变效应, 以及产生新的功能性序列。但是, 由此形成的PMM 会产生重要的超时效应: 许多组合是小的保存性、和难以解释的硬质的, 而PMinfillable 和我们接近一个临界点, 意味着它对于一个不敏感的精确的意味着它是一个非常敏感的精确的精确的。通过一个非常敏感的精确的精确的精确的精确的。在一种重要的变变变现的变变变的变现的变的。