Yapay Zeka · 4 dk okuma · 17 Nisan 2026
MERRIN: Benchmark for Multimodal Search in Noisy Web Data
New benchmark reveals AI agents struggle with real-world web search, achieving only 22% accuracy when retrieving and reasoning across mixed media sources.
Kaynak: arxiv/cs.AI · Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal · orijinali aç ↗ ↗
MERRIN benchmark tests AI agents on retrieving and reasoning over multimodal, conflicting web evidence without explicit modality hints.
- — Benchmark uses natural language queries without telling agents which modalities to prioritize.
- — Incorporates video and audio alongside text, modalities often overlooked in prior benchmarks.
- — Tests three search modes: no search, native search, and agentic search with tool use.
- — Best agent achieves 40% accuracy; average across all agents is 22%.
- — Strong models like Gemini Deep Research over-explore, wasting resources on conflicting sources.
- — Agents rely too heavily on text and select sources inefficiently compared to human performance.
- — Benchmark reflects real-world web search: underspecified queries, heterogeneous results, conflicting claims.
Sık sorulanlar
- MERRIN is a benchmark that tests AI agents on retrieving and reasoning over multimodal evidence (text, video, audio) from noisy web sources. It matters because real-world search queries are ambiguous and web results often conflict. MERRIN measures whether agents can decide which modalities are relevant and integrate contradictory information—tasks current agents perform poorly at, with best-in-class models reaching only 40% accuracy.