In recent years, diffusion models trained on equilibrium molecular distributions have proven effective for sampling biomolecules. Beyond direct sampling, the score of such a model can also be used to derive the forces that act on molecular systems. However, while classical diffusion sampling usually recovers the training distribution, the corresponding energy-based interpretation of the learned score is often inconsistent with this distribution, even for low-dimensional toy systems. We trace this inconsistency to inaccuracies of the learned score at very small diffusion timesteps, where the model must capture the correct evolution of the data distribution. In this regime, diffusion models fail to satisfy the Fokker--Planck equation, which governs the evolution of the score. We interpret this deviation as one source of the observed inconsistencies and propose an energy-based diffusion model with a Fokker--Planck-derived regularization term to enforce consistency. We demonstrate our approach by sampling and simulating multiple biomolecular systems, including fast-folding proteins, and by introducing a state-of-the-art transferable Boltzmann emulator for dipeptides that supports simulation and achieves improved consistency and efficient sampling. Our code, model weights, and self-contained JAX and PyTorch notebooks are available at https://github.com/noegroup/ScoreMD.
翻译:近年来,基于平衡态分子分布训练的扩散模型已被证明在生物分子采样中具有显著效果。除直接采样外,此类模型的评分函数还可用于推导作用于分子系统的力场。然而,尽管经典扩散采样通常能复现训练数据分布,但所学评分函数对应的基于能量的解释常与该分布不一致,即使在低维玩具系统中亦是如此。我们将此不一致性归因于模型在极小扩散时间步长下评分函数学习的误差,此时模型必须准确捕捉数据分布的演化规律。在此区间内,扩散模型未能满足描述评分函数演化的福克-普朗克方程。我们将此偏差解释为观测不一致性的来源之一,并提出一种基于能量的扩散模型,通过引入福克-普朗克方程导出的正则化项来强制保持一致性。我们通过对多个生物分子系统(包括快速折叠蛋白质)进行采样与模拟,并构建支持模拟的二肽可迁移玻尔兹曼模拟器,展示了该方法在提升一致性与采样效率方面的优势。相关代码、模型权重及独立的JAX与PyTorch笔记本已发布于https://github.com/noegroup/ScoreMD。