In this report, we describe our submission to Track 5 of the DCASE 2025 Challenge for the task of Audio Question Answering(AQA). Our system leverages the SSL backbone BEATs to extract frame-level audio features, which are then processed by a classification head to generate segment-level predictions of acoustic events, following the Audioset ontology. These segment-level predictions are subsequently calibrated before producing event-level predictions. Finally, these predictions are incorporated into a structured prompt, along with the question and candidate answers. This prompt is then fed to a fine-tuned version of Qwen2.5-7B-Instruct, trained using the GRPO algorithm with a simple reward function. Our method achieves an accuracy of 62.6 % on the development set, demonstrating the effectiveness of combining acoustic event reasoning with instruction-tuned large language models for AQA.
翻译:本报告介绍了我们针对DCASE 2025挑战赛第五赛道音频问答任务所提交的系统。该系统采用自监督学习骨干网络BEATs提取帧级音频特征,随后通过分类头生成基于Audioset本体的声学事件片段级预测。这些片段级预测在生成事件级预测前经过校准处理。最终,校准后的事件预测与问题及候选答案共同构成结构化提示,输入至经GRPO算法微调的Qwen2.5-7B-Instruct模型(采用简单奖励函数训练)。该方法在开发集上达到62.6%的准确率,验证了结合声学事件推理与指令微调大语言模型在音频问答任务中的有效性。