Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate an autoregressive antibody language model, p-IgGen, and steer its generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose an hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs are sufficient for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.


翻译:稀疏自编码器(SAEs)作为一种机制可解释性技术,已被用于揭示大型蛋白质语言模型中学习到的概念。本研究采用TopK与Ordered SAEs方法,对自回归抗体语言模型p-IgGen进行解析并引导其生成。实验表明,TopK SAEs能够揭示具有生物学意义的潜在特征,但高特征概念相关性并不能保证对生成过程的因果控制。相比之下,Ordered SAEs通过引入层次化结构可稳定识别可引导特征,但代价是激活模式更为复杂且可解释性降低。这些发现推动了领域特异性蛋白质语言模型的机制可解释性研究,并表明:TopK SAEs适用于潜在特征与概念的映射分析,而在需要精确生成引导的场景中,Ordered SAEs更具优势。

0
下载
关闭预览

相关内容

AAAI 2022 | ProtGNN:自解释图神经网络
专知
10+阅读 · 2022年2月28日
国家自然科学基金
0+阅读 · 2016年12月31日
VIP会员
Top
微信扫码咨询专知VIP会员