Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at https://huggingface.co/briaai/FIBO
翻译:文本到图像模型已从随意的创意工具迅速发展为专业级系统,实现了前所未有的图像质量和真实感。然而,大多数模型被训练为将简短提示映射为详细图像,导致稀疏的文本输入与丰富的视觉输出之间存在差距。这种不匹配降低了可控性,因为模型通常随意填充缺失的细节,偏向于平均用户偏好,并限制了专业使用的精确性。我们通过训练首个基于长结构化描述的开源文本到图像模型来解决这一局限,其中每个训练样本都通过同一组细粒度属性进行标注。这一设计最大化表达覆盖范围,并实现对视觉因素的可解耦控制。为高效处理长描述,我们提出了DimFusion,一种融合机制,可在不增加标记长度的情况下整合来自轻量级LLM的中间标记。我们还引入了文本作为瓶颈重建(TaBR)评估协议。通过评估真实图像在描述-生成循环中的重建效果,TaBR直接衡量可控性和表达力,即使在现有评估方法失效的极长描述场景下也适用。最后,我们通过训练大规模模型FIBO展示了我们的贡献,在开源模型中实现了最先进的提示对齐效果。模型权重已在https://huggingface.co/briaai/FIBO公开提供。