Why does cross-entropy training produce 15x larger logit norms than MSE?

Cross-entropy loss penalizes incorrect class logits more aggressively as confidence grows, pushing the model to produce larger magnitude outputs to satisfy the loss. MSE and bidirectional predictive coding do not apply this same pressure, resulting in smaller logit scales. This is a direct consequence of the loss function's gradient structure, not the model architecture.

What does the K-way energy probe measure?

The K-way energy probe measures a readout's ability to discriminate correct from incorrect predictions based on internal network activations. In this study, it approximates a monotone function of the log-softmax margin under standard conditions, but this relationship breaks down when cross-entropy is removed or inference becomes bidirectional.

Can temperature scaling fix the probe-softmax gap?

Temperature scaling removes approximately 66% of the gap by rescaling logit magnitudes. The remaining 34% reflects a scale-invariant ranking advantage of cross-entropy-trained representations. So temperature scaling helps but does not fully explain the difference; the underlying representation quality also matters.

← Content

AI · 4 min read · April 24, 2026

Cross-Entropy Loss Drives Neural Probe Performance, Not Architecture

Pre-registered study shows cross-entropy training inflates logit norms 15x, accounting for most K-way energy probe gains over softmax baselines.

Source: arxiv/cs.AI · Jon-Paul Cacioli · open original ↗ ↗

Share: X LinkedIn

Cross-entropy loss, not bidirectional inference, drives K-way energy probe performance gains; logit scaling explains two-thirds of the effect.

— K-way energy probe reduction depends critically on cross-entropy at output layer.
— Removing cross-entropy halves the probe-softmax gap; MSE training produces 15x smaller logit norms.
— Bidirectional predictive coding shows probe advantage but lacks expected latent movement increase.
— Temperature scaling removes 66% of probe-softmax gap; 34% reflects representation ranking quality.
— Study pre-registered to test sensitivity of theoretical reduction to architectural changes.
— Standard PC with MSE replicates negative result; bPC shows positive but mechanistically unclear result.
— Logit-scale effects dominate; scale-invariant ranking effects secondary.

Frequently asked

Cross-entropy loss penalizes incorrect class logits more aggressively as confidence grows, pushing the model to produce larger magnitude outputs to satisfy the loss. MSE and bidirectional predictive coding do not apply this same pressure, resulting in smaller logit scales. This is a direct consequence of the loss function's gradient structure, not the model architecture.

#neural-networks #loss-functions #predictive-coding #empirical-ml #interpretability

Cross-Entropy Loss Drives Neural Probe Performance, Not Architecture

Frequently asked

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs