Task-oriented conversational systems are essential for efficiently addressing diverse user needs, yet their development requires substantial amounts of high-quality conversational data that is challenging and costly to obtain. While large language models (LLMs) have demonstrated potential in generating synthetic conversations, the extent to which these agent-generated interactions can effectively substitute real human conversations remains unclear. This work presents the first systematic comparison between LLM-simulated users and human users in personalized task-oriented conversations. We propose a comprehensive analytical framework encompassing three key aspects (conversation strategy, interaction style, and conversation evaluation) and ten distinct dimensions for evaluating user behaviors, and collect parallel conversational datasets from both human users and LLM agent users across four representative scenarios under identical conditions. Our analysis reveals significant behavioral differences between the two user types in problem-solving approaches, question broadness, user engagement, context dependency, feedback polarity and promise, language style, and hallucination awareness. We found consistency in the agent users and human users across the depth-first or breadth-first dimensions, as well as the usefulness dimensions. These findings provide critical insights for advancing LLM-based user simulation. Our multi-dimensional taxonomy constructed a generalizable framework for analyzing user behavior patterns, offering insights from LLM agent users and human users. By this work, we provide perspectives on rethinking how to use user simulation in conversational systems in the future.
翻译:任务导向对话系统对于高效满足多样化用户需求至关重要,但其开发需要大量高质量对话数据,这些数据获取难度大且成本高昂。尽管大语言模型(LLM)在生成合成对话方面展现出潜力,但这些智能体生成的交互能在多大程度上有效替代真实人类对话仍不明确。本研究首次系统性地比较了LLM模拟用户与真实人类用户在个性化任务导向对话中的表现。我们提出了一个涵盖三个关键方面(对话策略、交互风格和对话评估)及十个不同维度的综合分析框架,用于评估用户行为,并在相同条件下收集了来自人类用户和LLM智能体用户在四种代表性场景中的平行对话数据集。分析揭示了两种用户类型在问题解决方法、问题广度、用户参与度、上下文依赖性、反馈极性及承诺、语言风格以及幻觉认知方面存在显著行为差异。我们发现智能体用户与人类用户在深度优先或广度优先维度以及实用性维度上表现出一致性。这些发现为推进基于LLM的用户模拟提供了关键见解。我们构建的多维度分类体系形成了可推广的用户行为模式分析框架,为理解LLM智能体用户与人类用户提供了洞见。通过本研究,我们为未来重新思考如何在对话系统中运用用户模拟提供了视角。