There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.
翻译:利用机制可解释性与可控性来更好地理解和干预大语言模型(LLMs)的内部动态已成为日益增长的研究方向。然而,现有方法在可靠定位与操纵特征表示方面面临根本性挑战。稀疏自编码器(SAEs)近期作为大规模特征提取的有前景方向出现,但其仍受限于特征隔离不完整与单义性不可靠的问题。为系统量化这些局限,我们提出了特征单义性评分(FMS),这是一种用于量化潜在表示中特征单义性的新度量指标。基于这些发现,我们提出了引导式稀疏自编码器(G-SAE),该方法在训练过程中将潜在表征与标注概念进行条件化关联。我们证明,在潜在空间中可靠定位并解耦目标概念能够提升可解释性、行为检测能力及控制效果。具体而言,我们在毒性检测、写作风格识别和隐私属性识别任务上的评估表明,G-SAE不仅增强了单义性,还能以更低的性能损失实现更有效、更细粒度的模型调控。本研究结果为测量与推进LLMs的机制可解释性及控制提供了可操作的指导原则。