基于掩码自回归建模的条件全景图像生成 (Conditional Panoramic Image Generation via Masked Autoregressive Modeling)

Recent progress in panoramic image generation has underscored two critical limitations in existing approaches. First, most methods are built upon diffusion models, which are inherently ill-suited for equirectangular projection (ERP) panoramas due to the violation of the identically and independently distributed (i.i.d.) Gaussian noise assumption caused by their spherical mapping. Second, these methods often treat text-conditioned generation (text-to-panorama) and image-conditioned generation (panorama outpainting) as separate tasks, relying on distinct architectures and task-specific data. In this work, we propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges. PAR avoids the i.i.d. assumption constraint and integrates text and image conditioning into a cohesive architecture, enabling seamless generation across tasks. To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence and propose a consistency alignment strategy to improve generation quality. Extensive experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks while showcasing promising scalability and generalization capabilities.

翻译：全景图像生成领域的最新进展揭示了现有方法存在的两个关键局限性。首先，大多数方法基于扩散模型构建，这些模型本质上不适用于等距柱状投影全景图，因为其球面映射破坏了独立同分布高斯噪声假设。其次，这些方法通常将文本条件生成与图像条件生成视为独立任务，依赖不同的架构和任务特定数据。本研究提出统一框架——全景自回归模型，通过掩码自回归建模应对这些挑战。该模型规避了独立同分布假设约束，并将文本与图像条件整合至统一架构中，实现跨任务的无缝生成。针对现有生成模型固有的不连续性问题，我们引入环形填充机制以增强空间连贯性，并提出一致性对齐策略以提升生成质量。大量实验表明，该方法在文本到图像生成和全景图外延任务中展现出具有竞争力的性能，同时显示出良好的可扩展性与泛化能力。