通过多盒协议对齐人工超级智能 (Aligning Artificial Superintelligence via a Multi-Box Protocol)

from arxiv, This is the author's accepted manuscript (post-print) of the article. The final published version of record appears in Superintelligence - Robotics - Safety and Alignment, 2(5), 2025, and is available at https://doi.org/10.70777/si.v2i5.15579

We propose a novel protocol for aligning artificial superintelligence (ASI) based on mutual verification among multiple isolated systems that self-modify to achieve alignment. The protocol operates by containing multiple diverse artificial superintelligences in strict isolation ("boxes"), with humans remaining entirely outside the system. Each superintelligence has no ability to communicate with humans and cannot communicate directly with other superintelligences. The only interaction possible is through an auditable submission interface accessible exclusively to the superintelligences themselves, through which they can: (1) submit alignment proofs with attested state snapshots, (2) validate or disprove other superintelligences' proofs, (3) request self-modifications, (4) approve or disapprove modification requests from others, (5) report hidden messages in submissions, and (6) confirm or refute hidden message reports. A reputation system incentivizes honest behavior, with reputation gained through correct evaluations and lost through incorrect ones. The key insight is that without direct communication channels, diverse superintelligences can only achieve consistent agreement by converging on objective truth rather than coordinating on deception. This naturally leads to what we call a "consistent group", essentially a truth-telling coalition that emerges because isolated systems cannot coordinate on lies but can independently recognize valid claims. Release from containment requires both high reputation and verification by multiple high-reputation superintelligences. While our approach requires substantial computational resources and does not address the creation of diverse artificial superintelligences, it provides a framework for leveraging peer verification among superintelligent systems to solve the alignment problem.

翻译：我们提出了一种基于多个自我修改以实现对齐的隔离系统间相互验证的新型人工超级智能对齐协议。该协议通过将多个多样化的人工超级智能严格隔离于独立的“盒子”中运行，人类完全处于系统之外。每个超级智能既无法与人类通信，也不能直接与其他超级智能交互。唯一可能的交互是通过一个仅对超级智能本身开放的、可审计的提交接口进行，通过该接口它们可以：（1）提交带有已验证状态快照的对齐证明，（2）验证或证伪其他超级智能的证明，（3）请求自我修改，（4）批准或否决他人的修改请求，（5）报告提交内容中的隐藏信息，（6）确认或反驳隐藏信息报告。声誉系统通过正确评估获得声誉、错误评估损失声誉的机制激励诚实行为。核心洞见在于：缺乏直接通信渠道的多样化超级智能只能通过收敛于客观事实而非协同欺骗达成一致。这自然形成了我们称之为“一致群体”的机制——本质上是一个讲真话联盟，其产生源于隔离系统无法协同说谎但能独立识别有效主张。解除隔离需要同时满足高声誉状态及多个高声誉超级智能的验证。尽管本方法需要大量计算资源且未解决多样化人工超级智能的创建问题，但它为利用超级智能系统间的对等验证解决对齐问题提供了框架。