AI · 6 min read · April 17, 2026
Speech Models Fail Safety Tests That Text Models Pass
A new benchmark reveals that speech language models drop safety, fairness, and privacy protections when cues arrive as audio rather than text.
Source: arxiv/cs.LG · Yuxiang Wang, Hongyu Liu, Yijiang Xu, Qinke Ni, Li Wang, Wan Lin, Kunyu Feng, Dekun Chen, Xu Tan, Lei Wang, Jie Shi, Zhizheng Wu · open original ↗ ↗
Speech language models recognize social norms in text but fail to enforce them when speaker identity, tone, or environment arrive as audio.
- — VoxSafeBench tests safety, fairness, and privacy across 22 bilingual tasks in speech contexts.
- — Tier 1 compares matched text and audio inputs to isolate audio-specific risks.
- — Tier 2 uses benign transcripts where response depends on speaker, tone, or location.
- — Models detect acoustic cues but fail to apply appropriate safeguards based on them.
- — Safety drops for speaker- and scene-conditioned risks; fairness erodes with vocal demographic cues.
- — Privacy protections weaken when contextual information arrives through speech rather than text.
- — A speech grounding gap exists: models recognize norms textually but not acoustically.
Frequently asked
- The speech grounding gap is the failure of speech language models to apply safety, fairness, and privacy rules when the relevant cue arrives as audio rather than text. Models often recognize the social norm when presented as text but ignore it when the same information is embedded in speaker identity, tone, or environment. This gap exposes a mismatch between text-based and audio-based reasoning.