AI · 8 min read · April 29, 2026
Model Architecture Controls Whether Errors Stay Hidden
Transformer design determines if internal decision signals remain observable after training, independent of output confidence metrics.
Transformer architecture, not just training, determines whether mid-layer activations expose token-level decision quality hidden from confidence scores.
- — Output confidence absorbs 57.7% of raw probe signal, masking true decision quality in frozen activations.
- — 24-layer 16-head configurations collapse to near-zero observability across parameter scales; other configs maintain healthy signal.
- — Observability collapse emerges during training despite improving loss, suggesting architectural constraints erase internal signals.
- — Qwen 2.5 and Llama differ by 2.9x observability at matched 3B scale with non-overlapping probe distributions.
- — Error-detection probes trained on WikiText catch 10.9–13.4% of errors confidence misses across downstream tasks.
- — Nonlinear probes and layer sweeps fail to recover signal in collapsed configurations.
- — Architecture selection functions as a monitoring decision with measurable consequences for error detection.
Frequently asked
- Confidence (max-softmax) and activation norm absorb approximately 57.7% of the raw signal that probes can extract from mid-layer activations. This means a model can be confident in its output while the internal decision-making process—visible only in frozen activations—shows uncertainty or error. Controlling for these factors reveals hidden signal that confidence alone cannot expose.