AIVisor, an agentic retrieval-augmented LLM for student advising, was used to examine how personalization affects system performance across multiple evaluation dimensions. Using twelve authentic advising questions intentionally designed to stress lexical precision, we compared ten personalized and non-personalized system configurations and analyzed outcomes with a Linear Mixed-Effects Model across lexical (BLEU, ROUGE-L), semantic (METEOR, BERTScore), and grounding (RAGAS) metrics. Results showed a consistent trade-off: personalization reliably improved reasoning quality and grounding, yet introduced a significant negative interaction on semantic similarity, driven not by poorer answers but by the limits of current metrics, which penalize meaningful personalized deviations from generic reference texts. This reveals a structural flaw in prevailing LLM evaluation methods, which are ill-suited for assessing user-specific responses. The fully integrated personalized configuration produced the highest overall gains, suggesting that personalization can enhance system effectiveness when evaluated with appropriate multidimensional metrics. Overall, the study demonstrates that personalization produces metric-dependent shifts rather than uniform improvements and provides a methodological foundation for more transparent and robust personalization in agentic AI.
翻译:本研究采用AIVisor——一种用于学生咨询的智能体检索增强大语言模型,探究个性化如何影响系统在多个评估维度上的表现。通过使用十二个专门设计以检验词汇精确性的真实咨询问题,我们比较了十种个性化与非个性化系统配置,并利用线性混合效应模型对词汇(BLEU、ROUGE-L)、语义(METEOR、BERTScore)及事实依据(RAGAS)指标进行了结果分析。研究结果显示出一致的权衡关系:个性化可靠地提升了推理质量与事实依据性,却在语义相似度上引入了显著的负面交互效应;这一现象并非源于回答质量下降,而是由于当前评估指标存在局限——这些指标会惩罚那些有意义但偏离通用参考文本的个性化表达。这揭示了主流大语言模型评估方法的结构性缺陷,即其不适用于评估用户特定导向的响应。完全集成的个性化配置产生了最高的综合增益,表明当采用恰当的多维指标评估时,个性化能够提升系统效能。总体而言,本研究表明个性化引发的是依赖度量标准的性能偏移而非均匀改进,并为智能体AI中更透明、更稳健的个性化实践提供了方法论基础。