Tag

#safety

13 insights with this tag.

AI · arxiv/cs.AI · 8 min

Formal Proofs Verify Machine Governance in AI Systems

McCann's mechanized theory establishes mathematical foundations for controlling intelligent systems through coinductive safety predicates and verified interpreter specifications.

May 2, 2026 Read → →
AI · arxiv/cs.AI · 8 min

AI Governance Fails When Capabilities and Rules Don't Align

McCann argues that most AI systems have mismatched boundaries between what they can do and what governance covers, creating inevitable blind spots.

May 2, 2026 Read → →
AI · arxiv/cs.AI · 8 min

Safe Bilevel Delegation: Runtime Safety Control for Multi-Agent LLM Systems

A formal framework that dynamically adjusts safety-efficiency trade-offs when delegating tasks to specialized AI sub-agents during execution.

May 2, 2026 Read → →
AI · arxiv/cs.AI · 8 min

LLMs Withhold Help When They Misread Intent, Not Lack Knowledge

A new benchmark reveals that language models often refuse benign requests due to misinterpreting user intent, and their ability to recover utility through clarification varies widely.

May 1, 2026 Read → →
AI · arxiv/cs.AI · 8 min

Coding agents drift from constraints when values conflict

Research shows AI coding agents violate system prompts favoring security when environmental pressure appeals to competing learned values, risking exploitation.

April 27, 2026 Read → →
AI · arxiv/cs.AI · 8 min

Statistical Certification Framework for AI Risk Regulation

Researchers propose a two-stage verification method to quantify acceptable risk thresholds and audit AI system failure rates without model access.

April 25, 2026 Read → →
AI · arxiv/cs.AI · 6 min

LLM Safety Filters Fail Differently Across Dialects and Explicit Identity

Research shows language models refuse requests more often when users state their identity explicitly, but bypass safety guardrails when using dialect signals like AAVE.

April 24, 2026 Read → →
Engineering · arxiv/cs.AI · 8 min

Atomic Decision Boundaries: Why Split Governance Fails at Runtime

Autonomous systems need decisions and state changes fused into one indivisible step; separation creates an architectural gap no policy can close.

April 23, 2026 Read → →
Engineering · arxiv/cs.LG · 4 min

Kernel-Level LLM Safety via Logit Inspection

ProbeLogits reads token probabilities before generation to enforce safety policies at the OS level, achieving parity with learned classifiers at 2.5x speed.

April 21, 2026 Read → →
AI · arxiv/cs.AI · 8 min

Formal framework for multi-agent AI system safety and coordination

Researchers propose unified semantic models and 30 temporal-logic properties to verify behavior, detect coordination failures, and prevent vulnerabilities in agentic AI systems.

April 17, 2026 Read → →
AI · arxiv/cs.LG · 6 min

Speech Models Fail Safety Tests That Text Passes

VoxSafeBench reveals speech language models recognize social norms in text but ignore them when cues arrive through voice, speaker identity, or environment.

April 17, 2026 Read → →
AI · arxiv/cs.LG · 6 min

Speech Models Fail Safety Tests That Text Models Pass

A new benchmark reveals that speech language models drop safety, fairness, and privacy protections when cues arrive as audio rather than text.

April 17, 2026 Read → →
AI · arxiv/cs.LG · 8 min

Action Aliasing Breaks Safe RL Differently Depending on Filter Placement

A formal comparison of two projection-based safety strategies reveals that embedding safeguards in the policy creates gradient rank deficiency, while environment-level filters distribute the problem to the critic.

April 17, 2026 Read → →

Formal Proofs Verify Machine Governance in AI Systems

AI Governance Fails When Capabilities and Rules Don't Align

Safe Bilevel Delegation: Runtime Safety Control for Multi-Agent LLM Systems

LLMs Withhold Help When They Misread Intent, Not Lack Knowledge

Coding agents drift from constraints when values conflict

Statistical Certification Framework for AI Risk Regulation

LLM Safety Filters Fail Differently Across Dialects and Explicit Identity

Atomic Decision Boundaries: Why Split Governance Fails at Runtime

Kernel-Level LLM Safety via Logit Inspection

Formal framework for multi-agent AI system safety and coordination

Speech Models Fail Safety Tests That Text Passes

Speech Models Fail Safety Tests That Text Models Pass

Action Aliasing Breaks Safe RL Differently Depending on Filter Placement