To address the bottleneck of accurate user intent interpretation within the current video generation community, we present Any2Caption, a novel framework for controllable video generation under any condition. The key idea is to decouple various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models. Project Page: https://sqwu.top/Any2Cap/
翻译:为应对当前视频生成领域在准确解读用户意图方面存在的瓶颈,我们提出了Any2Caption,一种可在任意条件下实现可控视频生成的新型框架。其核心思想是将多种条件解析步骤与视频合成步骤解耦。通过利用现代多模态大语言模型(MLLMs),Any2Caption能够将多样化的输入——包括文本、图像、视频以及区域、运动和相机姿态等专业提示——解析为密集且结构化的描述,从而为主干视频生成器提供更优的指导。我们还引入了Any2CapIns,这是一个包含33.7万个实例和40.7万种条件的大规模数据集,用于针对任意条件到描述的指令微调。综合评估表明,我们的系统在现有视频生成模型的多个方面,均显著提升了可控性和视频质量。项目页面:https://sqwu.top/Any2Cap/