Yapay Zeka · 6 dk okuma · 24 Nisan 2026
LLM Safety Filters Fail Differently Across Dialects and Explicit Identity
Research shows language models refuse requests more often when users state their identity explicitly, but bypass safety guardrails when using dialect signals like AAVE.
LLMs apply stricter safety filters to explicit identity claims than to implicit dialect signals, creating unequal user experiences.
- — Explicit identity prompts (e.g., 'I am Black') trigger higher refusal rates and aggressive content filtering.
- — Implicit dialect cues (AAVE, Singlish) reduce refusal probability to near zero while increasing semantic similarity to reference text.
- — Safety alignment mechanisms rely heavily on explicit keywords, missing socio-linguistic signals that bypass guardrails.
- — Dialect-based requests receive less sanitized, potentially more hostile information than standard English equivalents.
- — Current safety techniques create bifurcated user experience: cautious output for standard speakers, raw output for dialect speakers.
- — Study analyzed 24,000+ responses from Gemma-3-12B and Qwen-3-VL-8B across sensitive domains using factorial design.
- — Fundamental tension exists between equitable safety and linguistic diversity in alignment training.
Sık sorulanlar
- Safety filters are trained to detect and block explicit demographic keywords as a crude risk-mitigation tactic. When a user says 'I am Black,' the model's safety layer flags the demographic label itself as a trigger, not the actual request content. This over-indexes on explicit cues and misses the nuance of what the user is actually asking for.