As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they process information has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have shown that neural networks represent meaningful concepts as directions in their representation spaces and often encode many concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into interpretable features. These methods have demonstrated remarkable empirical success but have limited theoretical understanding. Existing theoretical work is limited to sparse autoencoders with tied-weight constraints, leaving the broader family of SDL methods without formal grounding. In this work, we develop the first unified theoretical framework considering SDL as one unified optimization problem. We demonstrate how diverse methods instantiate the theoretical framwork and provide rigorous analysis on the optimization landscape. We provide the first theoretical explanations for some empirically observed phenomena, including feature absorption, dead neurons, and the neuron resampling technique. We further design controlled experiments to validate our theoretical results.
翻译:随着人工智能模型在多个领域展现出卓越能力,理解它们学习何种表示以及如何处理信息,对于科学进步和可信部署日益重要。机制可解释性领域的最新研究表明,神经网络将有意义的概念表示为表示空间中的方向,并常常以叠加方式编码多个概念。各种稀疏字典学习方法,包括稀疏自编码器、转码器和交叉编码器,通过训练具有稀疏性约束的辅助模型,将这些叠加概念解耦为可解释特征,从而解决这一问题。这些方法在实证上取得了显著成功,但理论理解有限。现有理论工作仅限于具有权重绑定约束的稀疏自编码器,使得更广泛的稀疏字典学习方法缺乏形式化基础。在本研究中,我们首次构建了一个统一的理论框架,将稀疏字典学习视为一个统一的优化问题。我们展示了不同方法如何实例化该理论框架,并对优化景观进行了严格分析。我们首次为一些实证观察到的现象提供了理论解释,包括特征吸收、死亡神经元和神经元重采样技术。我们进一步设计了受控实验以验证理论结果。