Large Language Models (LLMs) excel at static interactions, where they answer user queries by retrieving knowledge encoded in their parameters. However, in many real-world settings, such as educational tutoring or medical assistance, relevant information is not directly available and must be actively acquired through dynamic interactions. An interactive agent would recognize its own uncertainty, ask targeted questions, and retain new knowledge efficiently. Prior work has primarily explored effective ways for a teacher to instruct the student, where the teacher identifies student gaps and provides guidance. In this work, we shift the focus to the student and investigate effective strategies to actively query the teacher in seeking useful information. Across math and coding benchmarks, where baseline student models begin with near-zero performance, we show that student-led approaches consistently yield absolute Pass@k improvements of at least 0.5 over static baselines. To improve question quality, we train students using Direct Preference Optimization (DPO) with guidance from either self or stronger students. We find that this guided training enables smaller models to learn how to ask better questions, further enhancing learning efficiency.
翻译:大型语言模型(LLMs)在静态交互中表现出色,能够通过检索编码在其参数中的知识来回答用户查询。然而,在许多现实场景中,例如教育辅导或医疗协助,相关信息并非直接可得,必须通过动态交互主动获取。一个交互式智能体应能识别自身的不确定性,提出有针对性的问题,并高效地保留新知识。先前的研究主要探索了教师指导学生的有效方式,即教师识别学生的知识缺口并提供指导。在本研究中,我们将焦点转向学生,并研究主动向教师提问以获取有用信息的有效策略。在数学和编程基准测试中,基线学生模型初始表现接近零分,我们发现学生主导的方法相较于静态基线模型,在绝对Pass@k指标上持续带来至少0.5的提升。为提高提问质量,我们使用直接偏好优化(DPO)训练学生模型,并借助自身或更强学生的指导。我们发现,这种引导式训练能使较小模型学会如何提出更好的问题,从而进一步提升学习效率。