Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation method, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, we introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability. Through extensive experiments, we find that MRMBench strongly correlates with the alignment performance of large language models (LLMs), making it a reliable reference for developing advanced reward models. Our analysis of MRMBench evaluation results reveals that reward models often struggle to capture preferences across multiple dimensions, highlighting the potential of multi-objective optimization in reward modeling. Additionally, our findings show that the proposed inference-time probing method offers a reliable metric for assessing the confidence of reward predictions, which ultimately improves the alignment of LLMs.
翻译:以往的方法通过在固定的成对排序测试集上测试奖励模型来评估其性能,但通常无法提供各偏好维度的性能信息。在本工作中,我们通过探究偏好表征来解决奖励模型的评估挑战。为验证该评估方法的有效性,我们构建了多维度奖励模型基准(MRMBench),包含针对不同偏好维度的六项探究任务。我们设计该基准以支持并鼓励能更好地捕捉多维度偏好的奖励模型。此外,我们提出了一种分析方法——推理时探究,该方法可识别奖励预测过程中使用的维度,并增强其可解释性。通过大量实验,我们发现MRMBench与大型语言模型(LLMs)的对齐性能高度相关,使其成为开发先进奖励模型的可靠参考。对MRMBench评估结果的分析表明,奖励模型常难以捕捉多维度偏好,这凸显了多目标优化在奖励建模中的潜力。同时,我们的研究结果显示,所提出的推理时探究方法为评估奖励预测的置信度提供了可靠指标,最终提升了LLMs的对齐效果。