Yapay Zeka · 8 dk okuma · 1 Mayıs 2026
LLMs Withhold Help When They Misread Intent, Not Lack Knowledge
A new benchmark reveals that language models often refuse benign requests due to misinterpreting user intent, and their ability to recover utility through clarification varies widely.
Kaynak: arxiv/cs.AI · Mingqian Zheng, Malia Morgan, Liwei Jiang, Carolyn Rose, Maarten Sap · orijinali aç ↗ ↗
CarryOnBench shows LLMs withhold information from seemingly harmful queries even when users clarify benign intent, exposing gaps in current safety evaluation methods.
- — Models fulfill only 10.5–37.6% of benign information needs on first turn, but 25.1–72.1% when intent is stated upfront.
- — 13 of 14 tested models recover utility through multi-turn clarification, though recovery speed and completeness vary significantly.
- — Three failure modes emerge: utility lock-in (no update despite clarification), unsafe recovery (safety cost too high), repetitive recovery (recycled answers).
- — Single-turn safety benchmarks miss whether models are appropriately cautious or simply unresponsive to clarified intent.
- — Conversations converge to similar harmfulness levels regardless of model's initial conservatism, suggesting alignment training may be brittle.
- — CarryOnBench contains 1,866 conversation flows across 4–12 turns, totaling 23,880 model responses from 5,970 simulated interactions.
- — Intent misinterpretation, not knowledge gaps, drives refusal—models possess information but withhold it due to safety miscalibration.
Sık sorulanlar
- LLMs trained with safety alignment learn to refuse queries that match patterns associated with harmful intent, but they often misinterpret benign queries as harmful. The refusal stems from intent misclassification, not knowledge gaps. When users clarify their benign intent in follow-up turns, models can recover and provide the information, proving they possessed it all along.