Multimodal trajectory prediction generates multiple plausible future trajectories to address vehicle motion uncertainty from intention ambiguity and execution variability. However, HD map-dependent models suffer from costly data acquisition, delayed updates, and vulnerability to corrupted inputs, causing prediction failures. Map-free approaches lack global context, with pairwise attention over-amplifying straight patterns while suppressing transitional patterns, resulting in motion-intention misalignment. This paper proposes GContextFormer, a plug-and-play encoder-decoder architecture with global context-aware hybrid attention and scaled additive aggregation achieving intention-aligned multimodal prediction without map reliance. The Motion-Aware Encoder builds scene-level intention prior via bounded scaled additive aggregation over mode-embedded trajectory tokens and refines per-mode representations under shared global context, mitigating inter-mode suppression and promoting intention alignment. The Hierarchical Interaction Decoder decomposes social reasoning into dual-pathway cross-attention: a standard pathway ensures uniform geometric coverage over agent-mode pairs while a neighbor-context-enhanced pathway emphasizes salient interactions, with gating module mediating their contributions to maintain coverage-focus balance. Experiments on eight highway-ramp scenarios from TOD-VT dataset show GContextFormer outperforms state-of-the-art baselines. Compared to existing transformer models, GContextFormer achieves greater robustness and concentrated improvements in high-curvature and transition zones via spatial distributions. Interpretability is achieved through motion mode distinctions and neighbor context modulation exposing reasoning attribution. The modular architecture supports extensibility toward cross-domain multimodal reasoning tasks. Source: https://fenghy-chen.github.io/sources/.
翻译:多模态轨迹预测通过生成多个合理的未来轨迹,以应对车辆运动因意图模糊性和执行变异性带来的不确定性。然而,依赖高精地图的模型存在数据获取成本高、更新延迟以及对损坏输入敏感的问题,导致预测失败。无地图方法则缺乏全局上下文,其中成对注意力过度放大直线模式而抑制过渡模式,造成运动与意图错位。本文提出GContextFormer,一种即插即用的编码器-解码器架构,结合全局上下文感知混合注意力和缩放加性聚合,实现在不依赖地图的情况下进行意图对齐的多模态预测。运动感知编码器通过对模态嵌入的轨迹标记进行有界缩放加性聚合,构建场景级意图先验,并在共享全局上下文中细化各模态表示,从而缓解模态间抑制并促进意图对齐。分层交互解码器将社会推理分解为双路径交叉注意力:标准路径确保对智能体-模态对的均匀几何覆盖,而邻居上下文增强路径则强调显著交互,通过门控模块调节两者贡献以保持覆盖与聚焦的平衡。在TOD-VT数据集的八个高速公路匝道场景上的实验表明,GContextFormer优于现有最先进的基线方法。与现有Transformer模型相比,GContextFormer通过空间分布在高速曲率和过渡区域实现了更高的鲁棒性和集中性改进。可解释性通过运动模式区分和邻居上下文调制实现,揭示了推理归因。模块化架构支持向跨领域多模态推理任务的可扩展性。