We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.
翻译:我们提出了MIRA,这是一个旨在评估模型在生成中间视觉图像对成功推理至关重要的场景中的新基准。与仅依赖文本的传统思维链方法不同,MIRA中的任务要求模型生成并利用中间图像——如草图、结构图或路径图——来指导其推理过程。这种设置密切模拟了人类通过“绘图思考”解决复杂问题的方式。为此,MIRA专注于那些本质上具有挑战性、涉及复杂结构、空间关系或难以仅通过语言表达的推理步骤的任务。为确保评估数据的高质量,我们包含了546个多模态问题,并标注了中间视觉图像和最终答案。我们还为MIRA提出了一个统一的评估协议,涵盖三个评估输入级别:仅包含图像和问题的直接输入、包含图像和思维提示的纯文本思维链输入,以及同时包含标注图像线索和文本思维提示的视觉思维链输入。为探究模型在我们基准上的能力上限,我们还报告了不同k设置下的pass@k和多数投票准确率。实验结果表明,现有的多模态大语言模型,包括最强的私有模型以及强大的开源权重模型,在仅依赖文本提示时表现不佳。然而,当提供中间视觉线索时,模型性能持续提升,在所有模型和任务中平均相对增益达到33.7%。我们还通过扩展搜索空间和设计与视觉思维链对齐的文本提示来探究上限,但两者与我们的视觉思维链设置相比仅带来有限的改进。这些结果突显了想象视觉信息在MIRA上实现成功推理的关键作用。