PhraseVAE与PhraseLDM：面向全曲多轨符号音乐生成的潜在扩散模型 (PhraseVAE and PhraseLDM: Latent Diffusion for Full-Song Multitrack Symbolic Music Generation)

This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by introducing PhraseVAE and PhraseLDM, the first latent diffusion framework designed for full-song multitrack symbolic music. PhraseVAE compresses variable-length polyphonic note sequences into compact 64-dimensional phrase-level representations with high reconstruction fidelity, allowing efficient training and a well-structured latent space. Built on this latent space, PhraseLDM generates an entire multi-track song in a single pass without any autoregressive components. The system eliminates bar-wise sequential modeling, supports up to 128 bars of music (8 minutes in 64 bpm), and produces complete songs with coherent local texture, idiomatic instrument patterns, and clear global structure. With only 45M parameters, our framework generates a full song within seconds while maintaining competitive musical quality and generation diversity. Together, these results show that phrase-level latent diffusion provides an effective and scalable solution to long-sequence modeling in symbolic music generation. We hope this work encourages future symbolic music research to move beyond note-attribute tokens and to consider phrase-level units as a more effective and musically meaningful modeling target.

翻译：本技术报告提出了一种全曲符号音乐生成的新范式。现有符号模型基于音符属性标记运行，存在序列极长、上下文长度受限以及对长程结构支持薄弱的问题。为解决这些问题，我们引入了PhraseVAE和PhraseLDM，这是首个为全曲多轨符号音乐设计的潜在扩散框架。PhraseVAE将可变长度的复调音符序列压缩为紧凑的64维乐句级表征，具备高重建保真度，实现了高效训练和结构良好的潜在空间。基于此潜在空间构建的PhraseLDM无需任何自回归组件，即可单次生成完整的多轨乐曲。该系统摒弃了逐小节的序列建模，支持长达128小节（以64bpm计约8分钟）的音乐生成，并能产出具有连贯局部织体、地道乐器模式和清晰全局结构的完整乐曲。仅使用4500万参数，我们的框架可在数秒内生成全曲，同时保持具有竞争力的音乐质量和生成多样性。这些成果共同表明，乐句级潜在扩散为符号音乐生成长序列建模提供了高效且可扩展的解决方案。我们希望这项工作能推动未来符号音乐研究超越音符属性标记，将乐句级单元视为更有效且更具音乐意义的建模目标。