Understanding how humans evaluate robot behavior during human-robot interactions is crucial for developing socially aware robots that behave according to human expectations. While the traditional approach to capturing these evaluations is to conduct a user study, recent work has proposed utilizing machine learning instead. However, existing data-driven methods require large amounts of labeled data, which limits their use in practice. To address this gap, we propose leveraging the few-shot learning capabilities of Large Language Models (LLMs) to improve how well a robot can predict a user's perception of its performance, and study this idea experimentally in social navigation tasks. To this end, we extend the SEAN TOGETHER dataset with additional real-world human-robot navigation episodes and participant feedback. Using this augmented dataset, we evaluate the ability of several LLMs to predict human perceptions of robot performance from a small number of in-context examples, based on observed spatio-temporal cues of the robot and surrounding human motion. Our results demonstrate that LLMs can match or exceed the performance of traditional supervised learning models while requiring an order of magnitude fewer labeled instances. We further show that prediction performance can improve with more in-context examples, confirming the scalability of our approach. Additionally, we investigate what kind of sensor-based information an LLM relies on to make these inferences by conducting an ablation study on the input features considered for performance prediction. Finally, we explore the novel application of personalized examples for in-context learning, i.e., drawn from the same user being evaluated, finding that they further enhance prediction accuracy. This work paves the path to improving robot behavior in a scalable manner through user-centered feedback.
翻译:理解人类在人与机器人交互过程中如何评估机器人行为,对于开发符合人类期望的社交感知机器人至关重要。传统上,获取这些评估的主要方法是进行用户研究,而近期研究提出了利用机器学习替代这一过程。然而,现有的数据驱动方法需要大量标注数据,这限制了其在实际中的应用。为弥补这一不足,我们提出利用大语言模型(LLMs)的少样本学习能力,以提升机器人预测用户对其性能感知的能力,并在社交导航任务中通过实验验证这一构想。为此,我们在SEAN TOGETHER数据集的基础上,补充了更多真实世界的人机导航片段和参与者反馈。利用这一增强数据集,我们评估了多种LLMs基于少量上下文示例(即观察到的机器人时空线索及周围人类运动信息)来预测人类对机器人性能感知的能力。实验结果表明,LLMs在仅需一个数量级更少的标注实例的情况下,其性能可媲美甚至超越传统监督学习模型。我们进一步证明,增加上下文示例数量可提升预测性能,这证实了所提方法的可扩展性。此外,通过对用于性能预测的输入特征进行消融研究,我们探究了LLM在推断过程中依赖何种基于传感器的信息。最后,我们探索了在上下文学习中引入个性化示例(即来自同一被评估用户)的新颖应用,发现这能进一步提升预测准确性。本研究为通过以用户为中心的反馈,以可扩展的方式优化机器人行为开辟了路径。