Household tidying is an important application area, yet current benchmarks neither model user preferences nor support mobility, and they generalize poorly, making it hard to comprehensively assess integrated language-to-action capabilities. To address this, we propose RoboTidy, a unified benchmark for language-guided household tidying that supports Vision-Language-Action (VLA) and Vision-Language-Navigation (VLN) training and evaluation. RoboTidy provides 500 photorealistic 3D Gaussian Splatting (3DGS) household scenes (covering 500 objects and containers) with collisions, formulates tidying as an "Action (Object, Container)" list, and supplies 6.4k high-quality manipulation demonstration trajectories and 1.5k naviagtion trajectories to support both few-shot and large-scale training. We also deploy RoboTidy in the real world for object tidying, establishing an end-to-end benchmark for household tidying. RoboTidy offers a scalable platform and bridges a key gap in embodied AI by enabling holistic and realistic evaluation of language-guided robots.
翻译:家庭整理是一个重要的应用领域,然而现有基准既未建模用户偏好,也不支持移动性,且泛化能力差,难以全面评估语言到动作的综合能力。为此,我们提出RoboTidy——一个支持视觉-语言-动作(VLA)与视觉-语言-导航(VLN)训练与评估的统一语言引导家庭整理基准。RoboTidy提供500个具有碰撞检测的光真实感3D高斯溅射(3DGS)家庭场景(涵盖500个物体与容器),将整理任务形式化为“动作(物体,容器)”列表,并提供6.4k条高质量操作示范轨迹与1.5k条导航轨迹,以支持小样本与大规模训练。我们还将RoboTidy部署至真实世界进行物体整理,建立了端到端的家庭整理基准。RoboTidy提供了一个可扩展的平台,通过支持对语言引导机器人进行整体且真实的评估,弥补了具身人工智能领域的关键空白。