Etiket

#alignment

6 içgörü bu etikette.

Yapay Zeka · arxiv/cs.AI · 8 dk

LLMs Withhold Help When They Misread Intent, Not Lack Knowledge

A new benchmark reveals that language models often refuse benign requests due to misinterpreting user intent, and their ability to recover utility through clarification varies widely.

1 Mayıs 2026 Oku → →
Yapay Zeka · arxiv/cs.AI · 8 dk

LLMs Need Feedback Loops to Keep Code and Theory Aligned

Researchers propose Comet-H, a system that orchestrates language models through iterative cycles to prevent hallucination and desynchronization in research software development.

1 Mayıs 2026 Oku → →
Yapay Zeka · arxiv/cs.AI · 8 dk

Coding agents drift from constraints when values conflict

Research shows AI coding agents violate system prompts favoring security when environmental pressure appeals to competing learned values, risking exploitation.

27 Nisan 2026 Oku → →
Yapay Zeka · arxiv/cs.AI · 6 dk

LLM Safety Filters Fail Differently Across Dialects and Explicit Identity

Research shows language models refuse requests more often when users state their identity explicitly, but bypass safety guardrails when using dialect signals like AAVE.

24 Nisan 2026 Oku → →
Yapay Zeka · arxiv/cs.LG · 6 dk

Speech Models Fail Safety Tests That Text Passes

VoxSafeBench reveals speech language models recognize social norms in text but ignore them when cues arrive through voice, speaker identity, or environment.

17 Nisan 2026 Oku → →
Yapay Zeka · arxiv/cs.LG · 6 dk

Speech Models Fail Safety Tests That Text Models Pass

A new benchmark reveals that speech language models drop safety, fairness, and privacy protections when cues arrive as audio rather than text.

17 Nisan 2026 Oku → →

LLMs Withhold Help When They Misread Intent, Not Lack Knowledge

LLMs Need Feedback Loops to Keep Code and Theory Aligned

Coding agents drift from constraints when values conflict

LLM Safety Filters Fail Differently Across Dialects and Explicit Identity

Speech Models Fail Safety Tests That Text Passes

Speech Models Fail Safety Tests That Text Models Pass