Humans display significant uncertainty when confronted with moral dilemmas, yet the extent of such uncertainty in machines and AI agents remains underexplored. Recent studies have confirmed the overly confident tendencies of machine-generated responses, particularly in large language models (LLMs). As these systems are increasingly embedded in ethical decision-making scenarios, it is important to understand their moral reasoning and the inherent uncertainties in building reliable AI systems. This work examines how uncertainty influences moral decisions in the classical trolley problem, analyzing responses from 32 open-source models and 9 distinct moral dimensions. We first find that variance in model confidence is greater across models than within moral dimensions, suggesting that moral uncertainty is predominantly shaped by model architecture and training method. To quantify uncertainty, we measure binary entropy as a linear combination of total entropy, conditional entropy, and mutual information. To examine its effects, we introduce stochasticity into models via "dropout" at inference time. Our findings show that our mechanism increases total entropy, mainly through a rise in mutual information, while conditional entropy remains largely unchanged. Moreover, this mechanism significantly improves human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLMs' confidence in morally complex scenarios.
翻译:人类在面对道德困境时表现出显著的不确定性,然而机器与人工智能代理中此类不确定性的程度仍未被充分探索。近期研究证实了机器生成回答(特别是大型语言模型)存在过度自信的倾向。随着这些系统日益嵌入伦理决策场景,理解其道德推理及构建可靠人工智能系统所固有的不确定性至关重要。本研究探讨了不确定性如何影响经典电车难题中的道德决策,分析了32个开源模型在9个不同道德维度上的响应。我们首先发现模型间置信度方差大于道德维度内部方差,表明道德不确定性主要受模型架构与训练方法影响。为量化不确定性,我们通过总熵、条件熵与互信息的线性组合测量二元熵。为考察其效应,我们在推理阶段通过“随机失活”机制向模型引入随机性。实验结果表明,该机制通过互信息的提升显著增加了总熵,而条件熵基本保持不变。此外,该机制显著改善了人类与大型语言模型的道德对齐性,互信息与对齐分数变化呈现相关性。我们的研究结果揭示了通过主动调节不确定性、降低大型语言模型在道德复杂场景中的置信度,从而更好地对齐模型生成决策与人类偏好的潜力。