AMAuT：一种灵活高效、从零开始训练的多视角音频Transformer框架 (AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch)

from arxiv, Updating note: 1. CLS+TAL is the distill token from DeiT rather than the alternative class token. Adjust the content to clarify it. 2. Figure 4 presents an error sequence of figures (a) and (b). 3. Remove an unrelated citation about the VS set. 4. A missing citation in section 4.4 (SSAST [19] here is not a correct citation)

Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.

翻译：近期的基础模型，如SSAST、EAT、HuBERT、Qwen-Audio和Audio Flamingo，在标准音频基准测试中取得了顶尖性能，但受限于固定的输入采样率和时长，影响了其复用性。本文提出增强驱动的多视角音频Transformer（AMAuT），这是一种从零开始训练的框架，无需依赖预训练权重，同时支持任意采样率和音频长度。AMAuT整合了四个关键组件：（1）用于鲁棒性的增强驱动多视角学习；（2）用于稳定时序编码的conv1 + conv7 + conv1一维CNN瓶颈结构；（3）用于双向上下文表示的双CLS + TAL令牌；（4）用于提升推理可靠性的测试时自适应/增强（TTA^2）。在五个公开基准测试（AudioMNIST、SpeechCommands V1 & V2、VocalSound和CochlScene）上的实验表明，AMAuT实现了高达99.8%的准确率，同时消耗的GPU时数不到同类预训练模型的3%。因此，AMAuT为大型预训练模型提供了一种高效灵活的替代方案，使得在计算资源受限的环境中也能实现最先进的音频分类。