Defensibility Signals Escape Agreement Trap in Rule-Governed AI

Defensibility signals address the agreement trap in AI evaluation by prioritizing logical derivability from explicit rules over human label consensus, revealing 80% of apparent errors as policy-consistent and filling a key gap in alignment benchmarks.

Defensibility signals offer a new paradigm for evaluating rule-following AI by focusing on policy-grounded correctness rather than label agreement.

The arXiv:2604.20972 paper analyzed 193,000+ Reddit moderation decisions across communities and found a 33-46.6pp gap between agreement-based and policy-grounded metrics, with 79.8-80.6% of false negatives corresponding to valid policy derivations; rule specificity drove ambiguity, with auditing identical decisions under three rule tiers reducing the Ambiguity Index by 10.8pp while the Defensibility Index stayed stable (O'Herlihy, 2026).

This builds on Constitutional AI harnesses of rule hierarchies to guide harmlessness (Bai et al., arXiv:2212.08073) and LLM-as-a-judge critiques showing preference models conflate ambiguity with error (Zheng et al., arXiv:2306.05685); original coverage missed how agreement traps exacerbate scalable oversight failures in alignment pipelines that mainstream benchmarks such as HELM ignore.

Probabilistic Defensibility Signals derived from audit-model token logprobs estimate reasoning stability without extra passes, attributing variance primarily to governance ambiguity per repeated-sampling analysis; the resulting Governance Gate reaches 78.6% automation coverage with 64.9% risk reduction, exposing a critical gap current safety evaluations systematically miss.

THE FACTUM

Defensibility Signals Escape Agreement Trap in Rule-Governed AI

Sources (3)