Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods based on saliency produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature -- intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets -- KTH, Penn Action, HAA500, and UCF-101 -- demonstrate that DANCE significantly improves explanation clarity with competitive performance. We validate the superior interpretability of DANCE through a user study. Experimental results also show that DANCE is beneficial for model debugging, editing, and failure analysis.
翻译:视频动作识别模型的有效解释应能解耦动作随时间展开的运动过程与周围空间上下文。然而,基于显著性的现有方法产生纠缠式解释,难以区分预测依赖的是运动特征还是空间上下文。基于语言的方法虽提供结构化解释,但因运动特征的隐含性——即直觉可理解但难以言表——常无法有效解释运动模式。为应对这些挑战,我们提出基于解耦概念的可解释视频动作识别框架DANCE(Disentangled Action aNd Context concept-based Explainable),该框架通过三类解耦概念类型预测动作:运动动态、物体与场景。我们将运动动态概念定义为人体姿态序列,并采用大语言模型自动提取物体与场景概念。基于前向概念瓶颈设计,DANCE强制模型通过这些概念进行预测。在KTH、Penn Action、HAA500和UCF-101四个数据集上的实验表明,DANCE在保持竞争力性能的同时显著提升了解释清晰度。通过用户研究验证了DANCE卓越的可解释性。实验结果同时证明DANCE有助于模型调试、编辑与故障分析。