Multi-agent trajectory prediction is crucial for autonomous systems operating in dense, interactive environments. Existing methods often fail to jointly capture agents' long-term goals and their fine-grained social interactions, which leads to unrealistic multi-agent futures. We propose VISTA, a recursive goal-conditioned transformer for multi-agent trajectory forecasting. VISTA combines (i) a cross-attention fusion module that integrates long-horizon intent with past motion, (ii) a social-token attention mechanism for flexible interaction modeling across agents, and (iii) pairwise attention maps that make social influence patterns interpretable at inference time. Our model turns single-agent goal-conditioned prediction into a coherent multi-agent forecasting framework. Beyond standard displacement metrics, we evaluate trajectory collision rates as a measure of joint realism. On the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy and substantially fewer collisions. On MADRAS, it reduces the average collision rate of strong baselines from 2.14 to 0.03 percent, and on SDD it attains zero collisions while improving ADE, FDE, and minFDE. These results show that VISTA generates socially compliant, goal-aware, and interpretable trajectories, making it promising for safety-critical autonomous systems.
翻译:多智能体轨迹预测对于在密集交互环境中运行的自主系统至关重要。现有方法往往未能同时捕捉智能体的长期目标及其细粒度社交交互,导致预测的多智能体未来轨迹缺乏真实性。我们提出VISTA,一种用于多智能体轨迹预测的递归目标条件Transformer模型。VISTA融合了三个关键组件:(i)跨注意力融合模块,将长时域意图与历史运动轨迹相结合;(ii)社交令牌注意力机制,实现跨智能体的灵活交互建模;(iii)成对注意力图谱,在推理时使社交影响模式具备可解释性。我们的模型将单智能体目标条件预测扩展为连贯的多智能体预测框架。除标准位移指标外,我们引入轨迹碰撞率作为衡量联合真实性的指标。在高密度MADRAS基准和SDD数据集上,VISTA实现了最先进的预测精度,并显著降低了碰撞率。在MADRAS上,它将强基线模型的平均碰撞率从2.14%降至0.03%;在SDD上,VISTA在保持零碰撞的同时,显著提升了ADE、FDE和minFDE指标。这些结果表明,VISTA能够生成社交合规、目标感知且可解释的轨迹,为安全关键型自主系统提供了有前景的解决方案。