What is MERRIN and why does it matter?

MERRIN is a benchmark that tests AI agents on retrieving and reasoning over multimodal evidence (text, video, audio) from noisy web sources. It matters because real-world search queries are ambiguous and web results often conflict. MERRIN measures whether agents can decide which modalities are relevant and integrate contradictory information—tasks current agents perform poorly at, with best-in-class models reaching only 40% accuracy.

Why do strong AI models perform poorly on MERRIN?

Strong models like Gemini Deep Research over-explore and waste resources on conflicting or partially relevant sources. They rely too heavily on text modalities and lack efficient source-selection strategies. Compared to humans, they consume more computational steps but achieve lower accuracy, suggesting the problem is not model capability but reasoning strategy and modality awareness.

How does MERRIN differ from prior search benchmarks?

MERRIN uses natural language queries without explicit hints about which modalities to prioritize, incorporates underexplored modalities like video and audio, and requires reasoning over genuinely conflicting multimodal evidence. Prior benchmarks often assume clean data or explicit modality labels, making MERRIN more reflective of real-world web search complexity.

← Content

AI · 4 min read · April 17, 2026

MERRIN: Benchmark for Multimodal Search in Noisy Web Data

New benchmark reveals AI agents struggle with real-world web search, achieving only 22% accuracy when retrieving and reasoning across mixed media sources.

Source: arxiv/cs.AI · Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal · open original ↗ ↗

Share: X LinkedIn

MERRIN benchmark tests AI agents on retrieving and reasoning over multimodal, conflicting web evidence without explicit modality hints.

— Benchmark uses natural language queries without telling agents which modalities to prioritize.
— Incorporates video and audio alongside text, modalities often overlooked in prior benchmarks.
— Tests three search modes: no search, native search, and agentic search with tool use.
— Best agent achieves 40% accuracy; average across all agents is 22%.
— Strong models like Gemini Deep Research over-explore, wasting resources on conflicting sources.
— Agents rely too heavily on text and select sources inefficiently compared to human performance.
— Benchmark reflects real-world web search: underspecified queries, heterogeneous results, conflicting claims.

Frequently asked

MERRIN is a benchmark that tests AI agents on retrieving and reasoning over multimodal evidence (text, video, audio) from noisy web sources. It matters because real-world search queries are ambiguous and web results often conflict. MERRIN measures whether agents can decide which modalities are relevant and integrate contradictory information—tasks current agents perform poorly at, with best-in-class models reaching only 40% accuracy.

#multimodal #search #reasoning #benchmark #agents #retrieval

MERRIN: Benchmark for Multimodal Search in Noisy Web Data

Frequently asked

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs