← Content
AI · 4 min read · April 17, 2026

MERRIN: Benchmark for Multimodal Search in Noisy Web Data

New benchmark reveals AI agents struggle with real-world web search, achieving only 22% accuracy when retrieving and reasoning across mixed media sources.

Source: arxiv/cs.AI · Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal · open original ↗ ↗
Share: X LinkedIn

MERRIN benchmark tests AI agents on retrieving and reasoning over multimodal, conflicting web evidence without explicit modality hints.

  • Benchmark uses natural language queries without telling agents which modalities to prioritize.
  • Incorporates video and audio alongside text, modalities often overlooked in prior benchmarks.
  • Tests three search modes: no search, native search, and agentic search with tool use.
  • Best agent achieves 40% accuracy; average across all agents is 22%.
  • Strong models like Gemini Deep Research over-explore, wasting resources on conflicting sources.
  • Agents rely too heavily on text and select sources inefficiently compared to human performance.
  • Benchmark reflects real-world web search: underspecified queries, heterogeneous results, conflicting claims.

Frequently asked

  • MERRIN is a benchmark that tests AI agents on retrieving and reasoning over multimodal evidence (text, video, audio) from noisy web sources. It matters because real-world search queries are ambiguous and web results often conflict. MERRIN measures whether agents can decide which modalities are relevant and integrate contradictory information—tasks current agents perform poorly at, with best-in-class models reaching only 40% accuracy.

Related