Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.
翻译:将事件相机与多模态大语言模型(MLLMs)相结合,有望在具有挑战性的视觉条件下实现通用场景理解,但需要在保留事件数据独特优势与确保与基于帧的模型兼容性之间取得平衡。我们通过将重建作为桥梁来解决这一挑战,提出了一种直接的基于帧的重建与标记化(FRT)方法,并设计了一种利用事件稀疏性的高效自适应重建与标记化(ART)方法。为了进行稳健评估,我们引入了EvQA,这是首个面向事件驱动MLLMs的客观、真实世界基准,包含来自22个公共数据集的1,000个事件-问答对。我们的实验表明,我们的方法在EvQA上实现了最先进的性能,凸显了MLLMs在事件驱动视觉领域的巨大潜力。