Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt--a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism--tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.
翻译:强化学习(RL)近期已成为一种有前景的方法,用于将文本到图像生成模型与人类偏好对齐。然而,一个关键挑战在于设计有效且可解释的奖励。现有方法通常依赖于具有固定权重的复合指标(如CLIP、OCR和真实感分数)或从人类偏好模型中提炼出的单一标量奖励,这可能限制可解释性和灵活性。我们提出RubricRL,一个基于评分标准的奖励设计框架,该框架简单通用,提供更强的可解释性、可组合性和用户控制能力。RubricRL不使用黑盒标量信号,而是为每个提示动态构建结构化评分标准——一个可分解的细粒度视觉准则清单(如对象正确性、属性准确性、OCR保真度和真实感),并根据输入文本定制。每个准则由多模态评判器(如o4-mini)独立评估,并通过提示自适应加权机制强调最相关的维度。这种设计不仅为策略优化(如GRPO或PPO)提供可解释且模块化的监督信号,还允许用户直接调整奖励或惩罚的方面。在自回归文本到图像模型上的实验表明,RubricRL提升了提示忠实度、视觉细节和泛化能力,同时为跨文本到图像架构的可解释RL对齐提供了灵活且可扩展的基础。