Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.
翻译:创新的视觉风格化是艺术创作的核心,然而生成新颖且一致的视觉风格仍是一项重大挑战。现有的生成方法通常依赖冗长的文本提示、参考图像或参数高效的微调来引导风格感知的图像生成,但往往在风格一致性、创意有限性和复杂风格表示方面存在困难。本文通过引入新颖的任务——代码到风格的图像生成,提出一种风格对应一个数值代码的观点,该任务仅基于数值风格代码生成具有新颖且一致视觉风格的图像。迄今为止,该领域主要由工业界(如Midjourney)探索,学术界尚未有开源研究。为填补这一空白,我们提出了CoTyle,这是该任务的首个开源方法。具体而言,我们首先从图像集合中训练一个离散风格码本以提取风格嵌入。这些嵌入作为文本到图像扩散模型(T2I-DM)的条件,用于生成风格化图像。随后,我们在离散风格嵌入上训练一个自回归风格生成器,以建模其分布,从而合成新颖的风格嵌入。在推理过程中,数值风格代码通过风格生成器映射到唯一的风格嵌入,该嵌入引导T2I-DM生成相应风格的图像。与现有方法不同,我们的方法提供了无与伦比的简洁性和多样性,从最小输入中解锁了广阔的可复现风格空间。大量实验验证了CoTyle能有效将数值代码转化为风格控制器,证明了一种风格确实对应一个代码。