Why do existing web agent benchmarks fail to measure real-world performance?

Current benchmarks focus on short, single-site tasks where frontier models are already near saturation. Real web work—comparing products across domains, planning trips, synthesizing research—requires sustained context and cross-site reasoning over hours. Short benchmarks miss these long-horizon challenges and don't reveal efficiency gaps that matter in production.

What is Trajectory Efficiency and why is it important?

Trajectory Efficiency measures rubric score per step taken by an agent. Odysseys found that frontier models achieve only 1.15%, meaning they succeed slowly and inefficiently. This metric matters because an agent that solves tasks in 100 steps is economically unviable compared to one solving them in 10 steps, even if both succeed eventually.

How does Odysseys evaluate long-horizon tasks differently from other benchmarks?

Odysseys uses rubric-based evaluation instead of binary pass/fail, annotating each task with an average of 6.1 graded rubrics. This provides finer-grained signal and higher agreement with human judgment than trajectory-level LLM-as-a-judge metrics, capturing partial progress and reasoning quality on complex multi-step tasks.

← Content

AI · 8 min read · April 29, 2026

Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows

New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.

Source: arxiv/cs.LG · Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov · open original ↗ ↗

Share: X LinkedIn

Odysseys benchmark tests web agents on realistic multi-site, multi-hour tasks; frontier models succeed 44.5% of the time with poor efficiency.

— Existing benchmarks use short single-site tasks where frontier models near saturation.
— Real web work requires sustained context across multiple domains over hours.
— Odysseys contains 200 tasks from actual browsing sessions evaluated live.
— Binary pass/fail metrics inadequate; rubric-based evaluation with 6.1 graders per task provides finer signal.
— Frontier models achieve 44.5% success rate on long-horizon tasks.
— Trajectory Efficiency metric shows agents succeed slowly: only 1.15% rubric score per step.
— Efficiency matters as much as correctness for practical agent deployment.
— Benchmark released with tasks, evaluation code, and results.

Frequently asked

Current benchmarks focus on short, single-site tasks where frontier models are already near saturation. Real web work—comparing products across domains, planning trips, synthesizing research—requires sustained context and cross-site reasoning over hours. Short benchmarks miss these long-horizon challenges and don't reveal efficiency gaps that matter in production.

#webagents #benchmark #longhorizon #evaluation #efficiency

Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows

Frequently asked

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs