Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.
翻译:音频分类是将音频样本映射成相应标签的一项重要任务。 最近,在这一领域采用了具有自我注意机制的变压器模型。 但是,现有的音频变压器需要大型的GPU记忆和长时间的培训时间,同时依靠预先训练的视觉模型来达到高性能,这限制了模型在音频任务中的可缩放性。为了解决这些问题,我们引入了HTS-AT:一个具有等级结构的音频变压器,以减少模型大小和培训时间。它与一个象征性的语义模块进一步结合,将最终产出映射成类特征图,从而使得音频事件探测模型(即及时本地化)得以建立。我们评估了三种音频分类数据集的HTS-AT,这三套数据在音频SOTA和ESC-50上取得了新的最新水平(SOTA)结果,与SOTA V2. 语音命令的SOTA相比,在事件本地化方面的表现也比以前的CNN模型要好。 此外,HS-AT只需要35%的模型参数和前音频变压器的15%的培训时间。