UniLayDiff：面向内容感知布局生成的统一扩散Transformer模型 (UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation)

Content-aware layout generation is a critical task in graphic design automation, focused on creating visually appealing arrangements of elements that seamlessly blend with a given background image. The variety of real-world applications makes it highly challenging to develop a single model capable of unifying the diverse range of input-constrained generation sub-tasks, such as those conditioned by element types, sizes, or their relationships. Current methods either address only a subset of these tasks or necessitate separate model parameters for different conditions, failing to offer a truly unified solution. In this paper, we propose UniLayDiff: a Unified Diffusion Transformer, that for the first time, addresses various content-aware layout generation tasks with a single, end-to-end trainable model. Specifically, we treat layout constraints as a distinct modality and employ Multi-Modal Diffusion Transformer framework to capture the complex interplay between the background image, layout elements, and diverse constraints. Moreover, we integrate relation constraints through fine-tuning the model with LoRA after pretraining the model on other tasks. Such a schema not only achieves unified conditional generation but also enhances overall layout quality. Extensive experiments demonstrate that UniLayDiff achieves state-of-the-art performance across from unconditional to various conditional generation tasks and, to the best of our knowledge, is the first model to unify the full range of content-aware layout generation tasks.

翻译：内容感知布局生成是图形设计自动化中的关键任务，旨在创建与给定背景图像无缝融合的视觉美观元素排列。现实应用的多样性使得开发一个能够统一各类输入约束生成子任务（如受元素类型、尺寸或其关系约束的任务）的单一模型极具挑战性。现有方法要么仅处理部分任务，要么需为不同条件配置独立的模型参数，未能提供真正统一的解决方案。本文提出UniLayDiff：一种统一扩散Transformer模型，首次通过单一端到端可训练模型处理多种内容感知布局生成任务。具体而言，我们将布局约束视为独立模态，并采用多模态扩散Transformer框架来捕捉背景图像、布局元素与多样化约束之间的复杂交互。此外，通过在模型预训练后使用LoRA进行微调，我们整合了关系约束。该方案不仅实现了统一的条件生成，还提升了整体布局质量。大量实验表明，UniLayDiff在从无条件到各类条件生成任务中均达到最先进性能，且据我们所知，这是首个能够统一全范围内容感知布局生成任务的模型。