MultiBanana：面向多参考文本到图像生成的挑战性基准 (MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation)

Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce $\textbf{MultiBanana}$, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at https://github.com/matsuolab/multibanana .

翻译：近期文本到图像生成模型已具备多参考生成与编辑能力，即能够从多张参考图像中继承目标物体的外观特征，并在新语境下进行重新渲染。然而，现有基准数据集通常聚焦于单张或少量参考图像的生成场景，这限制了我们在不同多参考条件下衡量模型性能进展或指认其缺陷的能力。此外，现有任务定义仍较为模糊，通常局限于“编辑对象”或“参考图像数量”等单一维度，未能充分捕捉多参考场景的内在复杂性。为填补这一空白，我们提出 $\textbf{MultiBanana}$ 基准，其通过大规模覆盖多参考特有问题来系统评估模型能力边界：（1）参考图像数量变化，（2）参考图像间的领域不匹配（如照片与动漫风格），（3）参考场景与目标场景的尺度不匹配，（4）参考图像包含罕见概念（如红色香蕉），（5）多语言文本参考的渲染需求。通过对多种文本到图像模型的系统性分析，我们揭示了其优势表现、典型失败模式及改进方向。MultiBanana 将作为开放基准发布，以推动多参考图像生成领域的发展，并为公平比较建立标准化评估基础。相关数据与代码已公开于 https://github.com/matsuolab/multibanana 。