← Content
AI · 8 min read · April 29, 2026

Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows

New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.

Source: arxiv/cs.LG · Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov · open original ↗ ↗
Share: X LinkedIn

Odysseys benchmark tests web agents on realistic multi-site, multi-hour tasks; frontier models succeed 44.5% of the time with poor efficiency.

  • Existing benchmarks use short single-site tasks where frontier models near saturation.
  • Real web work requires sustained context across multiple domains over hours.
  • Odysseys contains 200 tasks from actual browsing sessions evaluated live.
  • Binary pass/fail metrics inadequate; rubric-based evaluation with 6.1 graders per task provides finer signal.
  • Frontier models achieve 44.5% success rate on long-horizon tasks.
  • Trajectory Efficiency metric shows agents succeed slowly: only 1.15% rubric score per step.
  • Efficiency matters as much as correctness for practical agent deployment.
  • Benchmark released with tasks, evaluation code, and results.

Frequently asked

  • Current benchmarks focus on short, single-site tasks where frontier models are already near saturation. Real web work—comparing products across domains, planning trips, synthesizing research—requires sustained context and cross-site reasoning over hours. Short benchmarks miss these long-horizon challenges and don't reveal efficiency gaps that matter in production.

Related