The personalization of black-box large language models (LLMs) is a critical yet challenging task. Existing approaches predominantly rely on context injection, where user history is embedded into the prompt to directly guide the generation process. However, this single-step paradigm imposes a dual burden on the model: generating accurate content while simultaneously aligning with user-specific styles. This often results in a trade-off that compromises output quality and limits precise control. To address this fundamental tension, we propose Reflective Personalization Optimization (RPO), a novel framework that redefines the personalization paradigm by decoupling content generation from alignment. RPO operates in two distinct stages: first, a base model generates a high-quality, generic response; then, an external reflection module explicitly rewrites this output to align with the user's preferences. This reflection module is trained using a two-stage process. Initially, supervised fine-tuning is employed on structured rewriting trajectories to establish a core personalized reasoning policy that models the transformation from generic to user-aligned responses. Subsequently, reinforcement learning is applied to further refine and enhance the quality of the personalized outputs. Comprehensive experiments on the LaMP benchmark demonstrate that RPO, by decoupling content generation from personalization, significantly outperforms state-of-the-art baselines. These findings underscore the superiority of explicit response shaping over implicit context injection. Moreover, RPO introduces an efficient, model-agnostic personalization layer that can be seamlessly integrated with any underlying base model, paving the way for a new and effective direction in user-centric generation scenarios.
翻译:黑盒大语言模型(LLMs)的个性化是一项关键且具有挑战性的任务。现有方法主要依赖于上下文注入,即将用户历史嵌入提示中以直接引导生成过程。然而,这种单步范式给模型施加了双重负担:在生成准确内容的同时,还需与用户特定风格保持一致。这通常导致输出质量受损并限制精确控制的权衡。为解决这一根本性矛盾,我们提出了反射式个性化优化(RPO),这是一种通过解耦内容生成与对齐来重新定义个性化范式的新框架。RPO在两个独立阶段运行:首先,基础模型生成高质量、通用化的响应;随后,外部反射模块显式地改写该输出以适应用户偏好。该反射模块采用两阶段训练过程:首先,利用监督微调在结构化改写轨迹上建立核心个性化推理策略,该策略建模从通用响应到用户对齐响应的转换;随后,应用强化学习进一步优化和提升个性化输出的质量。在LaMP基准上的综合实验表明,RPO通过解耦内容生成与个性化,显著优于现有最先进的基线方法。这些发现凸显了显式响应塑造相对于隐式上下文注入的优越性。此外,RPO引入了一种高效、模型无关的个性化层,可无缝集成到任何底层基础模型中,为用户中心生成场景开辟了新的有效方向。