← Content
AI · 8 min read · April 26, 2026

Rule-Based AI Needs Policy Grounding, Not Label Agreement

Content moderation systems fail when evaluated by human agreement alone. A new framework measures whether decisions logically follow stated rules instead.

Source: arxiv/cs.AI · Michael O'Herlihy, Rosa Catal\`a · open original ↗ ↗
Share: X LinkedIn

Agreement-based evaluation of rule-governed AI systems masks valid decisions as errors; policy-grounded correctness with defensibility signals fixes this.

  • Agreement metrics penalize logically valid decisions when multiple rule-consistent outcomes exist.
  • Defensibility Index measures whether a decision follows from stated policy rules.
  • Ambiguity Index quantifies rule specificity gaps driving disagreement.
  • Probabilistic Defensibility Signal derives reasoning stability from LLM token probabilities without extra audits.
  • Reddit moderation test found 33–46.6 pp gap between agreement and policy-grounded scores.
  • 79.8–80.6% of flagged false negatives were actually policy-consistent decisions.
  • Governance Gate automation achieved 78.6% coverage with 64.9% risk reduction.
  • Rule clarity directly reduces measured ambiguity; defensibility remains stable.

Frequently asked

  • When multiple decisions logically satisfy the same policy, agreement metrics treat valid alternatives as errors. A post might violate Rule A but not Rule B; both interpretations are defensible. Agreement-based evaluation penalizes this ambiguity as model failure, when it reflects rule ambiguity instead. Policy-grounded evaluation asks whether the decision follows from stated rules, not whether it matches a historical label.

Related