← Content
AI · 6 min read · April 17, 2026

Speech Models Fail Safety Tests That Text Passes

VoxSafeBench reveals speech language models recognize social norms in text but ignore them when cues arrive through voice, speaker identity, or environment.

Source: arxiv/cs.LG · Yuxiang Wang, Hongyu Liu, Yijiang Xu, Qinke Ni, Li Wang, Wan Lin, Kunyu Feng, Dekun Chen, Xu Tan, Lei Wang, Jie Shi, Zhizheng Wu · open original ↗ ↗
Share: X LinkedIn

Speech language models degrade on safety, fairness, and privacy when context shifts from text to audio cues.

  • VoxSafeBench tests SLMs across safety, fairness, and privacy using matched text and audio pairs.
  • Tier 1 evaluates identical content in text and speech; Tier 2 tests benign transcripts with risky acoustic context.
  • Models detect speaker identity, tone, and environment but fail to apply appropriate safeguards based on these cues.
  • Safety awareness drops when speaker or scene context arrives through speech rather than text description.
  • Fairness erodes when demographic differences are conveyed vocally instead of stated explicitly.
  • Privacy protections weaken when contextual information must be grounded in acoustic signals.
  • A speech grounding gap exists: models recognize norms in text but do not enforce them in speech.
  • Benchmark covers 22 tasks with bilingual coverage to validate findings across languages.

Frequently asked

  • The speech grounding gap is the failure of speech language models to apply safety, fairness, and privacy rules when the decisive cue arrives through voice rather than text. Models recognize a social norm when it is stated explicitly in text but ignore the same norm when it must be inferred from speaker identity, tone, accent, or environment. This creates a systematic vulnerability in shared-space voice systems.

Related