AI · 8 min read · April 17, 2026
Vision-Language Models Fail on Dense Visual Grids
A new benchmark reveals VLMs collapse sharply on simple grid-reading tasks, exposing a gap between visual encoding and language output called Digital Agnosia.
Source: arxiv/cs.AI · Yunkai Zhang, Linda Li, Yingxin Cui, Xiyuan Ruan, Zeyu Zheng, Kezhen Chen, Yi Zhang, Diji Yang · open original ↗ ↗
Vision-language models abruptly fail on dense grid-to-matrix tasks despite preserving visual information in encoders.
- — Grid2Matrix benchmark tests VLMs on color grids mapped to numbers, scaling visual complexity cleanly.
- — Models exhibit sharp early collapse rather than gradual degradation as grid density increases.
- — Visual encoders retain substantially more grid data than final language outputs reveal.
- — Failure stems from a gap between recoverable visual features and expressed language, termed Digital Agnosia.
- — Errors correlate strongly with how grid cells align with visual patch boundaries.
- — Model scaling and multimodal alignment do not fully resolve this failure mode.
- — Benchmark applies to real tasks: tables, charts, forms, and GUI interpretation.
Frequently asked
- Digital Agnosia is a gap between visual information preserved in a VLM's encoder and what the model ultimately expresses in language output. The encoder retains grid details, but the language decoder fails to translate them faithfully. This suggests the bottleneck lies not in vision but in the bridge between vision and language.