Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components, based on the object's geometry and functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.
翻译:3D生成式人工智能的进展已实现从文本提示创建物理对象,但在涉及多种组件类型的物体创建方面仍存在挑战。我们提出了一种将3D生成式人工智能与视觉语言模型(VLMs)相结合的流程,以实现从自然语言到多组件物体的机器人装配。该方法利用VLMs进行零样本、多模态的几何与功能推理,将AI生成的网格模型分解为使用预定义结构件与面板件的多组件3D模型。我们证明,基于物体的几何特性与功能需求,VLM能够判断除结构件外哪些网格区域需要面板件。在测试对象上的评估显示,用户对VLM生成的组件分配方案的偏好率达90.6%,而基于规则的方案为59.4%,随机分配方案仅为2.5%。最后,该系统允许用户通过对话反馈优化组件分配,从而在利用生成式人工智能与机器人技术制造物理对象时实现更强的人为控制与自主性。