How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
翻译:无关信息(即干扰物)如何影响视觉语言模型(VLMs)在测试时的尺度效应?先前针对语言模型的研究报告了逆尺度效应,即文本干扰物会导致推理过程更长但效果更差。为探究类似现象是否在多模态场景中出现,我们提出了Idis(含干扰物图像数据集),这是一个视觉问答数据集,系统性地在语义、数量和空间维度上引入干扰物。我们的分析表明,视觉干扰物与文本干扰物存在本质差异:尽管逆尺度效应依然存在,但添加视觉干扰物会降低准确性,而不会增加推理长度。进一步研究发现,通过追踪推理轨迹中的属性计数,可以深入理解干扰物、推理长度与准确性之间的相互作用机制。最后,我们证明这些趋势同样适用于成熟的视觉偏差基准测试(如Waterbirds),并提出一种简单的提示策略来缓解推理模型中由偏差驱动的预测问题。