表示层面的反事实校准用于去偏零样本识别 (Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition)

Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.

翻译：物体-上下文捷径在视觉-语言模型中仍然是一个持续存在的挑战，当测试场景与熟悉的训练共现模式不同时，会削弱零样本识别的可靠性。我们将此问题重新表述为一个因果推断问题并提问：如果物体出现在不同的环境中，预测结果是否会保持不变？为了在推理时回答这个问题，我们在CLIP的表示空间中估计物体和背景的期望，并通过将物体特征与从外部数据集、批次邻近样本或文本描述中采样的多样化替代上下文重新组合，合成反事实嵌入。通过估计总直接效应并模拟干预，我们进一步减去仅背景的激活，保留有益的物体-上下文交互，同时减轻幻觉分数的影响。无需重新训练或提示设计，我们的方法在上下文敏感基准测试中显著提高了最差组和平均准确率，确立了新的零样本最优水平。除了性能提升，我们的框架提供了一种轻量级的表示层面反事实方法，为去偏且可靠的多模态推理提供了一条实用的因果途径。