End-to-end speech-to-speech (S2S) dialogue systems have recently garnered increasing research attention for their lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration of information. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind the SOTA cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. Our code and dataset are released.
翻译:端到端语音对话系统因其较低的延迟及对情感、说话人身份等非语言线索更自然的整合能力,近年来受到越来越多的研究关注。然而,这类系统面临关键挑战,特别是在融入外部知识方面——这一能力通常由基于文本的大语言模型中的检索增强生成技术实现。核心困难在于输入语音与检索到的文本知识之间的模态差异,阻碍了信息的有效整合。为解决这一问题,我们提出了一种新颖的端到端检索增强生成框架,可直接从语音查询中检索相关文本知识。实验结果表明,我们的方法显著提升了端到端语音对话系统的性能,同时实现了更高的检索效率。尽管整体性能仍落后于当前最优的级联模型,但本框架为增强端到端语音对话系统的知识整合能力提供了有前景的研究方向。相关代码与数据集已开源发布。