用于悖论式道德自我修正的话语启发式方法 (Discourse Heuristics For Paradoxically Moral Self-Correction)

Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect heuristic shortcuts, and that the presence of these heuristic shortcuts during self-correction leads to inconsistency when attempting to enhance both self-correction and self-diagnosis capabilities jointly. Based on our findings, we propose a solution to improve moral self-correction by leveraging the heuristics of curated datasets. We also highlight the generalization challenges of this capability, particularly in terms of learning from situated context and model scales.

翻译：道德自我修正已成为一种将大型语言模型（LLMs）输出与人类道德价值观对齐的有前景的方法。然而，道德自我修正技术面临两个主要悖论：首先，尽管有实证与理论证据支持自我修正的有效性，但LLMs的这种能力仅停留在表面层次；其次，尽管LLMs具备自我诊断输出中不道德方面的能力，但在自我修正过程中难以识别道德不一致性的根源。为深入理解并解决这些悖论，我们分析了旨在增强道德自我修正的微调语料库中的话语结构，揭示了有效结构背后存在的启发式机制。我们证明，道德自我修正依赖于反映启发式捷径的话语结构，且这些启发式捷径在自我修正过程中的存在，导致同时提升自我修正与自我诊断能力时产生不一致性。基于研究发现，我们提出一种利用精编数据集启发式特性来改进道德自我修正的解决方案，并强调了该能力在泛化方面的挑战，特别是在情境化语境学习和模型规模扩展方面。