Why does output confidence mask internal decision signals?

Confidence (max-softmax) and activation norm absorb approximately 57.7% of the raw signal that probes can extract from mid-layer activations. This means a model can be confident in its output while the internal decision-making process—visible only in frozen activations—shows uncertainty or error. Controlling for these factors reveals hidden signal that confidence alone cannot expose.

Which transformer architectures preserve observability?

In Pythia's controlled suite, 24-layer 16-head configurations collapse to near-zero observability, while other tested configurations maintain healthy signal (rho_partial 0.21–0.38). Across different model families, observability varies: Qwen 2.5 and Llama differ by 2.9x at 3B scale. No single architecture is universally safe; observability must be measured per model family and configuration.

Can error-detection probes trained on one task work on others?

Yes, partially. Probes trained on WikiText-based observability transfer to downstream QA tasks without retraining and catch 10.9–13.4% of errors that confidence scores miss. However, transfer is not perfect across all model-task pairs, suggesting that observability is somewhat task-dependent and requires validation for each deployment context.

← Content

AI · 8 min read · April 29, 2026

Model Architecture Controls Whether Errors Stay Hidden

Transformer design determines if internal decision signals remain observable after training, independent of output confidence metrics.

Source: arxiv/cs.LG · Thomas Carmichael · open original ↗ ↗

Share: X LinkedIn

Transformer architecture, not just training, determines whether mid-layer activations expose token-level decision quality hidden from confidence scores.

— Output confidence absorbs 57.7% of raw probe signal, masking true decision quality in frozen activations.
— 24-layer 16-head configurations collapse to near-zero observability across parameter scales; other configs maintain healthy signal.
— Observability collapse emerges during training despite improving loss, suggesting architectural constraints erase internal signals.
— Qwen 2.5 and Llama differ by 2.9x observability at matched 3B scale with non-overlapping probe distributions.
— Error-detection probes trained on WikiText catch 10.9–13.4% of errors confidence misses across downstream tasks.
— Nonlinear probes and layer sweeps fail to recover signal in collapsed configurations.
— Architecture selection functions as a monitoring decision with measurable consequences for error detection.

Frequently asked

Confidence (max-softmax) and activation norm absorb approximately 57.7% of the raw signal that probes can extract from mid-layer activations. This means a model can be confident in its output while the internal decision-making process—visible only in frozen activations—shows uncertainty or error. Controlling for these factors reveals hidden signal that confidence alone cannot expose.

#transformers #interpretability #observability #architecture #monitoring

Model Architecture Controls Whether Errors Stay Hidden

Frequently asked

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs