Tag

#longhorizon

1 insights with this tag.

AI · arxiv/cs.LG · 8 min

Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows

New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.

April 29, 2026 Read → →

astrobobo

Bu site JavaScript gerektirir. Tarayıcında JavaScript'i etkinleştir.

This site requires JavaScript. Please enable it in your browser.