We present Ego-EXTRA, a video-language Egocentric Dataset for EXpert-TRAinee assistance. Ego-EXTRA features 50 hours of unscripted egocentric videos of subjects performing procedural activities (the trainees) while guided by real-world experts who provide guidance and answer specific questions using natural language. Following a ``Wizard of OZ'' data collection paradigm, the expert enacts a wearable intelligent assistant, looking at the activities performed by the trainee exclusively from their egocentric point of view, answering questions when asked by the trainee, or proactively interacting with suggestions during the procedures. This unique data collection protocol enables Ego-EXTRA to capture a high-quality dialogue in which expert-level feedback is provided to the trainee. Two-way dialogues between experts and trainees are recorded, transcribed, and used to create a novel benchmark comprising more than 15k high-quality Visual Question Answer sets, which we use to evaluate Multimodal Large Language Models. The results show that Ego-EXTRA is challenging and highlight the limitations of current models when used to provide expert-level assistance to the user. The Ego-EXTRA dataset is publicly available to support the benchmark of egocentric video-language assistants: https://fpv-iplab.github.io/Ego-EXTRA/.
翻译:我们提出了Ego-EXTRA,一个用于专家-学员辅助的视频-语言自我中心数据集。Ego-EXTRA包含50小时的非脚本化自我中心视频,记录了学员在执行程序性活动时,由真实世界专家通过自然语言提供指导和回答特定问题的过程。采用“绿野仙踪”数据收集范式,专家模拟可穿戴智能助手,仅从学员的自我中心视角观察其活动,在学员提问时回答问题,或在过程中主动提出建议进行交互。这种独特的数据收集协议使Ego-EXTRA能够捕捉到专家向学员提供高质量反馈的对话。专家与学员之间的双向对话被录制、转录,并用于创建一个包含超过15,000个高质量视觉问答对的新基准,我们以此评估多模态大语言模型。结果表明,Ego-EXTRA具有挑战性,并突显了当前模型在为用户提供专家级辅助时的局限性。Ego-EXTRA数据集已公开,以支持自我中心视频-语言助手的基准测试:https://fpv-iplab.github.io/Ego-EXTRA/。