This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We show the effectiveness of our method on two recent Argumentative Fallacy Detection and Classification tasks where the use of audio was believed counterproductive, reaching state-of-the-art results. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available [online](https://github.com/salocinc/EACL26SpeechTokFallacy/).
翻译:本文提出一种简易方法,可在针对特定分类任务进行微调时,利用语音信息轻松增强基于文本的预训练大语言模型。音频与文本多模态融合的一个经典问题是音频序列长度远大于文本序列。我们的方法利用一个为自动语音识别训练的现有语音标记器,其可从大规模词汇表中输出长序列标记,但直接以低成本整合到大语言模型中较为困难。通过对多模态词袋表示应用基于LASSO的特征选择,我们仅保留对任务最重要的音频标记,并通过自监督语言建模目标使语言模型适应这些标记,随后在下游任务上进行微调。实验表明,与单模态模型、更大规模的SpeechLM模型或通过学习表示整合音频的方法相比,本方法有助于提升性能。我们在两个近期的论证谬误检测与分类任务上验证了本方法的有效性,这些任务中音频的使用曾被认为了无裨益,而我们的方法取得了最先进的结果。我们还提供了对该方法的深入分析,表明即使随机选择音频标记也有助于增强单模态模型。我们的代码已公开[在线](https://github.com/salocinc/EACL26SpeechTokFallacy/)。