InfinityStar：面向视觉生成的统一时空自回归建模框架 (InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation)

We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Building on the recent success of autoregressive modeling in both vision and language, our purely discrete approach jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long interactive video synthesis via straightforward temporal autoregression. Extensive experiments demonstrate that InfinityStar scores 83.74 on VBench, outperforming all autoregressive models by large margins, even surpassing some diffusion competitors like HunyuanVideo. Without extra optimizations, our model generates a 5s, 720p video approximately 10x faster than leading diffusion-based methods. To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial level 720p videos. We release all code and models to foster further research in efficient, high-quality video generation.

翻译：本文提出InfinityStar，一种用于高分辨率图像与动态视频合成的统一时空自回归框架。基于自回归建模在视觉与语言领域的最新进展，我们采用纯离散化方法，在单一架构中联合捕捉空间与时间依赖性。这一统一设计通过直接的时间自回归，天然支持多种生成任务，如文本到图像、文本到视频、图像到视频以及长序列交互式视频合成。大量实验表明，InfinityStar在VBench评测中获得83.74分，显著超越所有自回归模型，甚至超过部分扩散模型竞争者（如HunyuanVideo）。在未引入额外优化的情况下，本模型生成5秒720p视频的速度比主流基于扩散的方法快约10倍。据我们所知，InfinityStar是首个能够生成工业级720p视频的离散自回归视频生成器。我们将公开全部代码与模型，以推动高效高质量视频生成领域的进一步研究。