Video Temporal Grounding (VTG) aims to localize a temporal segment in a video corresponding to a natural language query. However, existing VTG models assume that a relevant segment always exists, causing them to always predict a target segment even when the query is irrelevant to the video. While recent approaches attempt to handle irrelevant queries, they can only reject those that are entirely unrelated to the video and still fail to handle hard-irrelevant queries that are semantically similar but not actually relevant. To address this, we propose Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) to effectively refuse hard-irrelevant queries in VTG. Our method is based on the Group Relative Policy Optimization (GRPO) framework and integrates four reward objectives-format, refuse-IoU, explain, and query correction-to improve both relevance discrimination and fine-grained semantic reasoning. In addition, to effectively support RA-RFT, we construct a Hard-Irrelevant VTG (HI-VTG) dataset, which includes hard-irrelevant queries and their refusal answers. We demonstrate the effectiveness of our method across various relevance-aware VTG scenarios, including hard-irrelevant VTG, simply-shuffled RA-VTG, and human-annotated RA-VTG settings. We also show that the proposed method is scalable by applying it to various LVLM-based VTG models. Our code is available at https://github.com/JINSUBY/RA-RFT.
翻译:视频时序定位(VTG)旨在根据自然语言查询定位视频中对应的时间片段。然而,现有的VTG模型假设相关片段总是存在,导致即使在查询与视频无关时,它们也总是预测一个目标片段。虽然最近的方法尝试处理无关查询,但它们只能拒绝那些与视频完全无关的查询,仍然无法处理语义相似但实际上不相关的困难无关查询。为解决此问题,我们提出拒绝感知强化微调(RA-RFT),以有效拒绝VTG中的困难无关查询。我们的方法基于组相对策略优化(GRPO)框架,并整合了四个奖励目标——格式、拒绝交并比、解释和查询修正——以提升相关性判别和细粒度语义推理能力。此外,为有效支持RA-RFT,我们构建了一个困难无关VTG(HI-VTG)数据集,其中包含困难无关查询及其拒绝回答。我们在多种相关性感知的VTG场景中验证了方法的有效性,包括困难无关VTG、简单重排RA-VTG和人工标注RA-VTG设置。我们还通过将所提方法应用于多种基于LVLM的VTG模型,证明了其可扩展性。我们的代码可在 https://github.com/JINSUBY/RA-RFT 获取。