Audio pretrained models are widely employed to solve various tasks in speech processing, sound event detection, or music information retrieval. However, the representations learned by these models are unclear, and their analysis mainly restricts to linear probing of the hidden representations. In this work, we explore the use of Sparse Autoencoders (SAEs) to analyze the hidden representations of pretrained models, focusing on a case study in singing technique classification. We first demonstrate that SAEs retain both information about the original representations and class labels, enabling their internal structure to provide insights into self-supervised learning systems. Furthermore, we show that SAEs enhance the disentanglement of vocal attributes, establishing them as an effective tool for identifying the underlying factors encoded in the representations.
翻译:音频预训练模型广泛应用于语音处理、声音事件检测或音乐信息检索等各类任务。然而,这些模型学习到的表征尚不明确,现有分析主要局限于对隐藏表征的线性探测。本研究探索利用稀疏自编码器(SAEs)分析预训练模型的隐藏表征,并以歌唱技巧分类为案例进行重点研究。我们首先证明SAEs能够同时保留原始表征和类别标签的信息,其内部结构可为自监督学习系统提供深入洞见。此外,我们发现SAEs能增强声乐属性的解耦性,从而确立其作为识别表征中编码潜在因子的有效工具。