Why does agreement with human labels fail for rule-governed AI?

When multiple decisions logically satisfy the same policy, agreement metrics treat valid alternatives as errors. A post might violate Rule A but not Rule B; both interpretations are defensible. Agreement-based evaluation penalizes this ambiguity as model failure, when it reflects rule ambiguity instead. Policy-grounded evaluation asks whether the decision follows from stated rules, not whether it matches a historical label.

What is the Probabilistic Defensibility Signal and how does it work?

The PDS derives reasoning stability from an LLM's internal token probabilities (logprobs) without requiring additional human audits. It measures how confidently the model assigns probability to rule-consistent decision paths. High PDS indicates the model's reasoning aligns with policy logic; low PDS flags ambiguous or unstable reasoning. This allows continuous governance monitoring at scale.

How much automation coverage can policy-grounded evaluation unlock?

In the Reddit moderation test, a Governance Gate built on defensibility signals achieved 78.6% automation coverage while reducing risk by 64.9% compared to agreement-based approaches. The key insight: 79.8–80.6% of decisions flagged as false negatives under agreement metrics were actually policy-consistent, so removing them from automation was unnecessary. Policy grounding recovers these valid decisions for safe automation.

← Content

AI · 8 min read · April 26, 2026

Rule-Based AI Needs Policy Grounding, Not Label Agreement

Content moderation systems fail when evaluated by human agreement alone. A new framework measures whether decisions logically follow stated rules instead.

Source: arxiv/cs.AI · Michael O'Herlihy, Rosa Catal\`a · open original ↗ ↗

Share: X LinkedIn

Agreement-based evaluation of rule-governed AI systems masks valid decisions as errors; policy-grounded correctness with defensibility signals fixes this.

— Agreement metrics penalize logically valid decisions when multiple rule-consistent outcomes exist.
— Defensibility Index measures whether a decision follows from stated policy rules.
— Ambiguity Index quantifies rule specificity gaps driving disagreement.
— Probabilistic Defensibility Signal derives reasoning stability from LLM token probabilities without extra audits.
— Reddit moderation test found 33–46.6 pp gap between agreement and policy-grounded scores.
— 79.8–80.6% of flagged false negatives were actually policy-consistent decisions.
— Governance Gate automation achieved 78.6% coverage with 64.9% risk reduction.
— Rule clarity directly reduces measured ambiguity; defensibility remains stable.

Frequently asked

When multiple decisions logically satisfy the same policy, agreement metrics treat valid alternatives as errors. A post might violate Rule A but not Rule B; both interpretations are defensible. Agreement-based evaluation penalizes this ambiguity as model failure, when it reflects rule ambiguity instead. Policy-grounded evaluation asks whether the decision follows from stated rules, not whether it matches a historical label.

#moderation #evaluation #governance #llm #policy #defensibility

Rule-Based AI Needs Policy Grounding, Not Label Agreement

Frequently asked

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs