We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.
翻译:我们提出了WorldCanvas框架,用于实现可提示的世界事件,通过结合文本、轨迹和参考图像,支持丰富且用户导向的仿真。与纯文本方法及现有基于轨迹控制的图像到视频方法不同,我们的多模态方法将轨迹——编码运动、时序和可见性——与自然语言(用于语义意图)及参考图像(用于对象身份的视觉锚定)相结合,从而能够生成连贯、可控的事件,包括多智能体交互、对象进出、参考引导的外观以及反直觉事件。生成的视频不仅展现出时间连贯性,还具备涌现一致性,即使在对象暂时消失的情况下也能保持对象身份和场景的稳定。通过支持富有表现力的世界事件生成,WorldCanvas将世界模型从被动预测器推进为交互式、用户可塑造的仿真器。项目页面请访问:https://worldcanvas.github.io/。