We introduce the first complete formal solution to corrigibility in the off-switch game, with provable guarantees in multi-step, partially observed environments. Our framework consists of five *structurally separate* utility heads -- deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward -- combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is *learned* to mean-squared error $\varepsilon$ and the planner is $\varepsilon$-sub-optimal, the probability of violating *any* safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits provably dominate even when incentives conflict. For settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon "decidable island" where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs.
翻译:我们首次提出了离线开关游戏中可修正性的完整形式化解法,并在多步、部分可观测环境中提供了可证明的保证。我们的框架由五个*结构分离*的效用头组成——遵从性、开关访问保持性、真实性、通过基于信念的可达效用保持扩展实现的低影响行为,以及有界任务奖励——通过严格的权重间隙按字典序组合。定理1证明了在部分可观测的离线开关游戏中精确的单轮可修正性;定理3将该保证扩展到多步、自衍生智能体,表明即使每个效用头被*学习*到均方误差$\varepsilon$且规划器是$\varepsilon$次优的,违反*任何*安全属性的概率仍有界,同时仍确保人类的净收益。与将全部规范合并为一个学习标量的宪法AI或RLHF/RLAIF相比,我们的分离设计使得服从性和影响限制即使在激励冲突时也能被证明占主导地位。对于对手可能修改智能体的场景,我们通过归约到停机问题证明了判定任意被入侵后智能体是否会违反可修正性是不可判定的,随后划定了一个有限时域的“可判定区域”,其中安全性可在随机多项式时间内被认证,并通过保护隐私的常数轮零知识证明进行验证。