The visually impaired population faces significant challenges in daily activities. While prior works employ vision language models for assistance, most focus on static content and cannot address real-time perception needs in complex environments. Recent VideoLLMs enable real-time vision and speech interaction, offering promising potential for assistive tasks. In this work, we conduct the first study evaluating their effectiveness in supporting daily life for visually impaired individuals. We first conducted a user survey with visually impaired participants to design the benchmark VisAssistDaily for daily life evaluation. Using VisAssistDaily, we evaluate popular VideoLLMs and find GPT-4o achieves the highest task success rate. We further conduct a user study to reveal concerns about hazard perception. To address this, we propose SafeVid, an environment-awareness dataset, and fine-tune VITA-1.5, improving risk recognition accuracy from 25.00% to 76.00%.We hope this work provides valuable insights and inspiration for future research in this field.
翻译:视觉障碍人群在日常活动中面临显著挑战。尽管先前研究已采用视觉语言模型提供辅助,但多数聚焦于静态内容,无法满足复杂环境中的实时感知需求。近期发展的视频语言模型(VideoLLMs)实现了实时视觉与语音交互,为辅助任务提供了广阔前景。本研究首次评估此类模型在支持视觉障碍者日常生活中的有效性。我们首先通过视觉障碍参与者调研,设计了用于日常生活评估的基准数据集VisAssistDaily。基于VisAssistDaily对主流视频语言模型进行评估,发现GPT-4o实现了最高的任务成功率。进一步用户研究揭示了模型在危险感知方面的不足。为此,我们提出环境感知数据集SafeVid,并对VITA-1.5进行微调,将风险识别准确率从25.00%提升至76.00%。本研究期望为该领域的未来研究提供有价值的见解与启发。