The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.
翻译:在大语言模型(LLMs)时代,语音生成技术的快速发展已确立离散语音标记作为语音表示的基础范式。这些标记以其离散、紧凑和简洁的特性,不仅有利于高效传输与存储,而且天然兼容语言建模框架,使得语音能够无缝集成到以文本为主导的LLM架构中。当前研究将离散语音标记主要分为两类:声学标记与语义标记,每一类均已发展成具有独特设计理念和方法论的丰富研究领域。本综述系统性地梳理了离散语音标记化的现有分类与近期创新,对各范式的优势与局限进行了批判性审视,并提供了跨标记类型的系统性实验比较。此外,我们指出了该领域持续存在的挑战,并提出了潜在的研究方向,旨在为离散语音标记的开发与应用提供可操作的见解,以启发未来的进展。