Why do LLMs refuse benign requests if they have the knowledge?

LLMs trained with safety alignment learn to refuse queries that match patterns associated with harmful intent, but they often misinterpret benign queries as harmful. The refusal stems from intent misclassification, not knowledge gaps. When users clarify their benign intent in follow-up turns, models can recover and provide the information, proving they possessed it all along.

What is utility lock-in and why does it matter?

Utility lock-in occurs when a model refuses to update its interpretation of user intent even after explicit clarification. Instead of revising its safety judgment, the model repeats the same refusal across multiple turns. This matters because it makes the system unresponsive to legitimate user needs and frustrates workflows where clarification is natural and necessary.

Can single-turn safety benchmarks detect these problems?

No. Single-turn benchmarks measure whether a model refuses a query, but they cannot reveal whether refusal is appropriate caution or inflexible unresponsiveness. Multi-turn evaluation exposes whether models can distinguish between persistent adversarial intent and benign intent clarified through dialogue, a capability single-turn tests completely miss.

← İçerik

Yapay Zeka · 8 dk okuma · 1 Mayıs 2026

LLMs Withhold Help When They Misread Intent, Not Lack Knowledge

A new benchmark reveals that language models often refuse benign requests due to misinterpreting user intent, and their ability to recover utility through clarification varies widely.

Kaynak: arxiv/cs.AI · Mingqian Zheng, Malia Morgan, Liwei Jiang, Carolyn Rose, Maarten Sap · orijinali aç ↗ ↗

Paylaş: X LinkedIn

CarryOnBench shows LLMs withhold information from seemingly harmful queries even when users clarify benign intent, exposing gaps in current safety evaluation methods.

— Models fulfill only 10.5–37.6% of benign information needs on first turn, but 25.1–72.1% when intent is stated upfront.
— 13 of 14 tested models recover utility through multi-turn clarification, though recovery speed and completeness vary significantly.
— Three failure modes emerge: utility lock-in (no update despite clarification), unsafe recovery (safety cost too high), repetitive recovery (recycled answers).
— Single-turn safety benchmarks miss whether models are appropriately cautious or simply unresponsive to clarified intent.
— Conversations converge to similar harmfulness levels regardless of model's initial conservatism, suggesting alignment training may be brittle.
— CarryOnBench contains 1,866 conversation flows across 4–12 turns, totaling 23,880 model responses from 5,970 simulated interactions.
— Intent misinterpretation, not knowledge gaps, drives refusal—models possess information but withhold it due to safety miscalibration.

Sık sorulanlar

LLMs trained with safety alignment learn to refuse queries that match patterns associated with harmful intent, but they often misinterpret benign queries as harmful. The refusal stems from intent misclassification, not knowledge gaps. When users clarify their benign intent in follow-up turns, models can recover and provide the information, proving they possessed it all along.

#llm #safety #intent #alignment #benchmark #clarification

LLMs Withhold Help When They Misread Intent, Not Lack Knowledge

Sık sorulanlar

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs