We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU and MMAR benchmarks. Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.
翻译:我们提出了Omni-R1,该方法采用强化学习算法GRPO,在音频问答数据集上对最新的多模态大语言模型Qwen2.5-Omni进行微调。该模型在最新的MMAU和MMAR基准测试中取得了新的最先进性能。Omni-R1在声音、音乐、语音及整体平均类别上,于Test-mini和Test-full划分中均获得了最高准确率。为探究性能提升的原因,我们测试了包含与不包含音频输入的模型,发现GRPO带来的性能提升很大程度上可归因于文本推理能力的增强。我们还意外发现,在纯文本数据集上进行无音频微调,能有效提升基于音频任务的性能。