Why do LLMs refuse requests more when users state their identity explicitly?

Safety filters are trained to detect and block explicit demographic keywords as a crude risk-mitigation tactic. When a user says 'I am Black,' the model's safety layer flags the demographic label itself as a trigger, not the actual request content. This over-indexes on explicit cues and misses the nuance of what the user is actually asking for.

What is a 'dialect jailbreak' and how does it work?

A dialect jailbreak occurs when using non-standard English dialects (like AAVE or Singlish) reduces the model's refusal rate to near zero. Safety filters are trained primarily on Standard American English text, so they fail to recognize the same harmful intent when expressed in other dialects. The model treats dialect variation as a signal of legitimacy rather than a potential risk.

Does less-sanitized output from dialect prompts actually help or harm users?

The research shows less-sanitized output achieves higher semantic similarity to reference text, but does not measure whether this benefits or harms dialect speakers. It is unclear whether rawer information is more useful or whether it exposes users to misinformation. This gap highlights the need for downstream impact studies.

← İçerik

Yapay Zeka · 6 dk okuma · 24 Nisan 2026

LLM Safety Filters Fail Differently Across Dialects and Explicit Identity

Research shows language models refuse requests more often when users state their identity explicitly, but bypass safety guardrails when using dialect signals like AAVE.

Kaynak: arxiv/cs.AI · Irti Haq, Bel\'en Sald\'ias · orijinali aç ↗ ↗

Paylaş: X LinkedIn

LLMs apply stricter safety filters to explicit identity claims than to implicit dialect signals, creating unequal user experiences.

— Explicit identity prompts (e.g., 'I am Black') trigger higher refusal rates and aggressive content filtering.
— Implicit dialect cues (AAVE, Singlish) reduce refusal probability to near zero while increasing semantic similarity to reference text.
— Safety alignment mechanisms rely heavily on explicit keywords, missing socio-linguistic signals that bypass guardrails.
— Dialect-based requests receive less sanitized, potentially more hostile information than standard English equivalents.
— Current safety techniques create bifurcated user experience: cautious output for standard speakers, raw output for dialect speakers.
— Study analyzed 24,000+ responses from Gemma-3-12B and Qwen-3-VL-8B across sensitive domains using factorial design.
— Fundamental tension exists between equitable safety and linguistic diversity in alignment training.

Sık sorulanlar

Safety filters are trained to detect and block explicit demographic keywords as a crude risk-mitigation tactic. When a user says 'I am Black,' the model's safety layer flags the demographic label itself as a trigger, not the actual request content. This over-indexes on explicit cues and misses the nuance of what the user is actually asking for.

#llm #bias #safety #dialect #alignment

LLM Safety Filters Fail Differently Across Dialects and Explicit Identity

Sık sorulanlar

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs