Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions. Nonetheless, several challenges remain, particularly in detecting subtle or complex events in the video. Our findings and error analysis provide valuable insights that can contribute to the improvement of LVLM-based systems in this domain.
翻译:大规模视觉语言模型(LVLMs)在需要视觉信息的任务(包括目标检测)中展现出先进能力。这些能力在自动驾驶等多种工业领域具有广阔的应用前景。例如,LVLMs能够基于路面摄像头采集的视频生成面向安全的描述。然而,为确保全面的安全性,还需监控驾驶员视角以检测危险事件(如驾驶时使用手机)。因此,模型需具备同步处理驾驶员视角与路面视角摄像头输入的能力。本研究通过构建数据集并评估LVLMs在该数据集上的性能,开发了相关模型并探究其能力。实验结果表明:尽管预训练的LVLMs效果有限,但经过微调的LVLMs能够生成准确且具备安全意识的驾驶指令。然而,该领域仍存在若干挑战,尤其在检测视频中细微或复杂事件方面。我们的发现与错误分析为改进基于LVLM的系统提供了有价值的见解。