← Content
AI · 8 min read · April 29, 2026

Model Architecture Controls Whether Errors Stay Hidden

Transformer design determines if internal decision signals remain observable after training, independent of output confidence metrics.

Source: arxiv/cs.LG · Thomas Carmichael · open original ↗ ↗
Share: X LinkedIn

Transformer architecture, not just training, determines whether mid-layer activations expose token-level decision quality hidden from confidence scores.

  • Output confidence absorbs 57.7% of raw probe signal, masking true decision quality in frozen activations.
  • 24-layer 16-head configurations collapse to near-zero observability across parameter scales; other configs maintain healthy signal.
  • Observability collapse emerges during training despite improving loss, suggesting architectural constraints erase internal signals.
  • Qwen 2.5 and Llama differ by 2.9x observability at matched 3B scale with non-overlapping probe distributions.
  • Error-detection probes trained on WikiText catch 10.9–13.4% of errors confidence misses across downstream tasks.
  • Nonlinear probes and layer sweeps fail to recover signal in collapsed configurations.
  • Architecture selection functions as a monitoring decision with measurable consequences for error detection.

Frequently asked

  • Confidence (max-softmax) and activation norm absorb approximately 57.7% of the raw signal that probes can extract from mid-layer activations. This means a model can be confident in its output while the internal decision-making process—visible only in frozen activations—shows uncertainty or error. Controlling for these factors reveals hidden signal that confidence alone cannot expose.

Related