Etiket

#benchmark

10 içgörü bu etikette.

Yapay Zeka · arxiv/cs.AI · 8 dk

LLMs Withhold Help When They Misread Intent, Not Lack Knowledge

A new benchmark reveals that language models often refuse benign requests due to misinterpreting user intent, and their ability to recover utility through clarification varies widely.

1 Mayıs 2026 Oku → →
Yapay Zeka · arxiv/cs.AI · 8 dk

LATTICE: Measuring Crypto Agent Quality Beyond Accuracy

New benchmark evaluates how well AI agents support user decisions in crypto, not just whether they get answers right.

30 Nisan 2026 Oku → →
Yapay Zeka · arxiv/cs.LG · 8 dk

Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows

New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.

29 Nisan 2026 Oku → →
Yapay Zeka · arxiv/cs.AI · 4 dk

KuaiLive: First Real-Time Live Streaming Recommendation Dataset

Researchers release a 21-day interaction log from Kuaishou covering 23,772 users and 452,621 streamers to enable dynamic recommendation research.

27 Nisan 2026 Oku → →
Yapay Zeka · arxiv/cs.AI · 4 dk

MERRIN: Benchmark for Multimodal Search in Noisy Web Data

New benchmark reveals AI agents struggle with real-world web search, achieving only 22% accuracy when retrieving and reasoning across mixed media sources.

17 Nisan 2026 Oku → →
Yapay Zeka · arxiv/cs.AI · 8 dk

LLMs hit formal reasoning ceiling; Chomsky Hierarchy reveals efficiency gap

New benchmark shows large language models struggle with structured complexity tasks and require prohibitive compute to achieve reliability in formal reasoning.

17 Nisan 2026 Oku → →
Yapay Zeka · arxiv/cs.AI · 8 dk

Vision-Language Models Fail on Dense Visual Grids

A new benchmark reveals VLMs collapse sharply on simple grid-reading tasks, exposing a gap between visual encoding and language output called Digital Agnosia.

17 Nisan 2026 Oku → →
Yapay Zeka · arxiv/cs.LG · 6 dk

Speech Models Fail Safety Tests That Text Passes

VoxSafeBench reveals speech language models recognize social norms in text but ignore them when cues arrive through voice, speaker identity, or environment.

17 Nisan 2026 Oku → →
Yapay Zeka · arxiv/cs.LG · 6 dk

Speech Models Fail Safety Tests That Text Models Pass

A new benchmark reveals that speech language models drop safety, fairness, and privacy protections when cues arrive as audio rather than text.

17 Nisan 2026 Oku → →
Yapay Zeka · arxiv/cs.LG · 4 dk

Retrieval-Augmented Set Completion for Clinical Code Authoring

A two-stage approach retrieves similar clinical value sets then classifies candidates, outperforming direct LLM generation on standardized medical vocabularies.

17 Nisan 2026 Oku → →

LLMs Withhold Help When They Misread Intent, Not Lack Knowledge

LATTICE: Measuring Crypto Agent Quality Beyond Accuracy

Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows

KuaiLive: First Real-Time Live Streaming Recommendation Dataset

MERRIN: Benchmark for Multimodal Search in Noisy Web Data

LLMs hit formal reasoning ceiling; Chomsky Hierarchy reveals efficiency gap

Vision-Language Models Fail on Dense Visual Grids

Speech Models Fail Safety Tests That Text Passes

Speech Models Fail Safety Tests That Text Models Pass

Retrieval-Augmented Set Completion for Clinical Code Authoring