Yapay Zeka · 8 dk okuma · 26 Nisan 2026
Rule-Based AI Needs Policy Grounding, Not Label Agreement
Content moderation systems fail when evaluated by human agreement alone. A new framework measures whether decisions logically follow stated rules instead.
Agreement-based evaluation of rule-governed AI systems masks valid decisions as errors; policy-grounded correctness with defensibility signals fixes this.
- — Agreement metrics penalize logically valid decisions when multiple rule-consistent outcomes exist.
- — Defensibility Index measures whether a decision follows from stated policy rules.
- — Ambiguity Index quantifies rule specificity gaps driving disagreement.
- — Probabilistic Defensibility Signal derives reasoning stability from LLM token probabilities without extra audits.
- — Reddit moderation test found 33–46.6 pp gap between agreement and policy-grounded scores.
- — 79.8–80.6% of flagged false negatives were actually policy-consistent decisions.
- — Governance Gate automation achieved 78.6% coverage with 64.9% risk reduction.
- — Rule clarity directly reduces measured ambiguity; defensibility remains stable.
Sık sorulanlar
- When multiple decisions logically satisfy the same policy, agreement metrics treat valid alternatives as errors. A post might violate Rule A but not Rule B; both interpretations are defensible. Agreement-based evaluation penalizes this ambiguity as model failure, when it reflects rule ambiguity instead. Policy-grounded evaluation asks whether the decision follows from stated rules, not whether it matches a historical label.