HTS-AT:用于健全分类和检测的等级调制音频变异器 (HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection)

Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.

翻译：音频分类是将音频样本映射成相应标签的一项重要任务。最近,在这一领域采用了具有自我注意机制的变压器模型。但是,现有的音频变压器需要大型的GPU记忆和长时间的培训时间,同时依靠预先训练的视觉模型来达到高性能,这限制了模型在音频任务中的可缩放性。为了解决这些问题,我们引入了HTS-AT:一个具有等级结构的音频变压器,以减少模型大小和培训时间。它与一个象征性的语义模块进一步结合,将最终产出映射成类特征图,从而使得音频事件探测模型(即及时本地化)得以建立。我们评估了三种音频分类数据集的HTS-AT,这三套数据在音频SOTA和ESC-50上取得了新的最新水平(SOTA)结果,与SOTA V2. 语音命令的SOTA相比,在事件本地化方面的表现也比以前的CNN模型要好。此外,HS-AT只需要35%的模型参数和前音频变压器的15%的培训时间。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

专知会员服务

92+阅读 · 2019年12月22日