从千词生成图像：通过结构化描述增强文本到图像生成 (Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions)

Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at https://huggingface.co/briaai/FIBO

翻译：文本到图像模型已从随意的创意工具迅速发展为专业级系统，实现了前所未有的图像质量和真实感。然而，大多数模型被训练为将简短提示映射为详细图像，导致稀疏的文本输入与丰富的视觉输出之间存在差距。这种不匹配降低了可控性，因为模型通常随意填充缺失的细节，偏向于平均用户偏好，并限制了专业使用的精确性。我们通过训练首个基于长结构化描述的开源文本到图像模型来解决这一局限，其中每个训练样本都通过同一组细粒度属性进行标注。这一设计最大化表达覆盖范围，并实现对视觉因素的可解耦控制。为高效处理长描述，我们提出了DimFusion，一种融合机制，可在不增加标记长度的情况下整合来自轻量级LLM的中间标记。我们还引入了文本作为瓶颈重建（TaBR）评估协议。通过评估真实图像在描述-生成循环中的重建效果，TaBR直接衡量可控性和表达力，即使在现有评估方法失效的极长描述场景下也适用。最后，我们通过训练大规模模型FIBO展示了我们的贡献，在开源模型中实现了最先进的提示对齐效果。模型权重已在https://huggingface.co/briaai/FIBO公开提供。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日