Reliable explainability is not only a technical goal but also a cornerstone of private AI governance. As AI models enter high-stakes sectors, private actors such as auditors, insurers, certification bodies, and procurement agencies require standardized evaluation metrics to assess trustworthiness. However, current XAI evaluation metrics remain fragmented and prone to manipulation, which undermines accountability and compliance. We argue that standardized metrics can function as governance primitives, embedding auditability and accountability within AI systems for effective private oversight. Building upon prior work in XAI benchmarking, we identify key limitations in ensuring faithfulness, tamper resistance, and regulatory alignment. Furthermore, interpretability can directly support model alignment by providing a verifiable means of ensuring behavioral integrity in General Purpose AI (GPAI) systems. This connection between interpretability and alignment positions XAI metrics as both technical and regulatory instruments that help prevent alignment faking, a growing concern among oversight bodies. We propose a Governance by Metrics paradigm that treats explainability evaluation as a central mechanism of private AI governance. Our framework introduces a hierarchical model linking transparency, tamper resistance, scalability, and legal alignment, extending evaluation from model introspection toward systemic accountability. Through conceptual synthesis and alignment with governance standards, we outline a roadmap for integrating explainability metrics into continuous AI assurance pipelines that serve both private oversight and regulatory needs.
翻译:可靠的可解释性不仅是技术目标,也是私有人工智能治理的基石。随着人工智能模型进入高风险领域,审计机构、保险公司、认证机构及采购机构等私有行为体需要标准化的评估指标来衡量可信度。然而,当前可解释人工智能的评估指标仍存在碎片化且易受操纵的问题,这削弱了问责与合规性。我们认为,标准化指标可作为治理基础单元,将可审计性与问责机制嵌入人工智能系统,以实现有效的私有监督。基于先前可解释人工智能基准测试的研究,我们指出了在确保忠实性、防篡改性及监管一致性方面的关键局限。此外,可解释性可通过为通用人工智能系统提供确保行为完整性的可验证手段,直接支持模型对齐。可解释性与对齐之间的这种关联,使可解释人工智能指标兼具技术与监管工具的双重属性,有助于防止日益受监管机构关注的“对齐伪装”现象。我们提出“基于度量的治理”范式,将可解释性评估视为私有人工智能治理的核心机制。该框架引入了一个层次化模型,将透明度、防篡改性、可扩展性与法律对齐相联结,将评估范围从模型内省扩展至系统问责。通过概念整合及与治理标准的对齐,我们规划了将可解释性指标纳入持续人工智能保障流程的路线图,以同时满足私有监督与监管需求。