AI · 8 min read · April 29, 2026
Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows
New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.
Source: arxiv/cs.LG · Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov · open original ↗ ↗
Odysseys benchmark tests web agents on realistic multi-site, multi-hour tasks; frontier models succeed 44.5% of the time with poor efficiency.
- — Existing benchmarks use short single-site tasks where frontier models near saturation.
- — Real web work requires sustained context across multiple domains over hours.
- — Odysseys contains 200 tasks from actual browsing sessions evaluated live.
- — Binary pass/fail metrics inadequate; rubric-based evaluation with 6.1 graders per task provides finer signal.
- — Frontier models achieve 44.5% success rate on long-horizon tasks.
- — Trajectory Efficiency metric shows agents succeed slowly: only 1.15% rubric score per step.
- — Efficiency matters as much as correctness for practical agent deployment.
- — Benchmark released with tasks, evaluation code, and results.
Frequently asked
- Current benchmarks focus on short, single-site tasks where frontier models are already near saturation. Real web work—comparing products across domains, planning trips, synthesizing research—requires sustained context and cross-site reasoning over hours. Short benchmarks miss these long-horizon challenges and don't reveal efficiency gaps that matter in production.