Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.
翻译:近期的基础模型,如SSAST、EAT、HuBERT、Qwen-Audio和Audio Flamingo,在标准音频基准测试中取得了顶尖性能,但受限于固定的输入采样率和时长,影响了其复用性。本文提出增强驱动的多视角音频Transformer(AMAuT),这是一种从零开始训练的框架,无需依赖预训练权重,同时支持任意采样率和音频长度。AMAuT整合了四个关键组件:(1)用于鲁棒性的增强驱动多视角学习;(2)用于稳定时序编码的conv1 + conv7 + conv1一维CNN瓶颈结构;(3)用于双向上下文表示的双CLS + TAL令牌;(4)用于提升推理可靠性的测试时自适应/增强(TTA^2)。在五个公开基准测试(AudioMNIST、SpeechCommands V1 & V2、VocalSound和CochlScene)上的实验表明,AMAuT实现了高达99.8%的准确率,同时消耗的GPU时数不到同类预训练模型的3%。因此,AMAuT为大型预训练模型提供了一种高效灵活的替代方案,使得在计算资源受限的环境中也能实现最先进的音频分类。