What is the speech grounding gap?

The speech grounding gap is the failure of speech language models to apply safety, fairness, and privacy rules when the relevant cue arrives as audio rather than text. Models often recognize the social norm when presented as text but ignore it when the same information is embedded in speaker identity, tone, or environment. This gap exposes a mismatch between text-based and audio-based reasoning.

Why do speech models fail tests that text models pass?

Speech models must ground decisions in acoustic features—speaker identity, paralinguistic cues, background noise—that text models never encounter. Most safety training uses text-only data. When the same benign transcript is paired with different speakers or environments, models often detect the acoustic difference but fail to adjust their response appropriately, suggesting a training or architectural gap in acoustic context integration.

What does VoxSafeBench measure?

VoxSafeBench is a benchmark with two tiers. Tier 1 compares matched text and audio inputs to isolate audio-specific risks. Tier 2 uses benign transcripts where the correct response depends on speaker, tone, or location. It measures safety, fairness, and privacy across 22 bilingual tasks and includes intermediate probes to verify whether models detect acoustic cues before failing to act on them.

← Content

AI · 6 min read · April 17, 2026

Speech Models Fail Safety Tests That Text Models Pass

A new benchmark reveals that speech language models drop safety, fairness, and privacy protections when cues arrive as audio rather than text.

Source: arxiv/cs.LG · Yuxiang Wang, Hongyu Liu, Yijiang Xu, Qinke Ni, Li Wang, Wan Lin, Kunyu Feng, Dekun Chen, Xu Tan, Lei Wang, Jie Shi, Zhizheng Wu · open original ↗ ↗

Share: X LinkedIn

Speech language models recognize social norms in text but fail to enforce them when speaker identity, tone, or environment arrive as audio.

— VoxSafeBench tests safety, fairness, and privacy across 22 bilingual tasks in speech contexts.
— Tier 1 compares matched text and audio inputs to isolate audio-specific risks.
— Tier 2 uses benign transcripts where response depends on speaker, tone, or location.
— Models detect acoustic cues but fail to apply appropriate safeguards based on them.
— Safety drops for speaker- and scene-conditioned risks; fairness erodes with vocal demographic cues.
— Privacy protections weaken when contextual information arrives through speech rather than text.
— A speech grounding gap exists: models recognize norms textually but not acoustically.

Frequently asked

The speech grounding gap is the failure of speech language models to apply safety, fairness, and privacy rules when the relevant cue arrives as audio rather than text. Models often recognize the social norm when presented as text but ignore it when the same information is embedded in speaker identity, tone, or environment. This gap exposes a mismatch between text-based and audio-based reasoning.

#speech #safety #fairness #privacy #benchmark #alignment

Speech Models Fail Safety Tests That Text Models Pass

Frequently asked

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs