How do LLMs detect their own errors without external feedback?

LLMs maintain an internal confidence signal at the post-answer newline (PANL) that operates independently of output probabilities. This second-order evaluative signal can disagree with the model's chosen response, allowing it to recognize when an answer is likely wrong. The signal encodes not only error likelihood but whether the model has the knowledge to fix it, enabling self-correction without human input.

What is the difference between first-order and second-order confidence in LLMs?

First-order confidence derives directly from the generation signal and is always highest for the chosen response, making error detection impossible. Second-order confidence involves a partially independent evaluative signal that can contradict the committed answer. LLMs appear to implement second-order architecture, where internal PANL activations provide an evaluative check separate from output probabilities, enabling genuine error awareness.

Can PANL signals predict which errors an LLM can actually fix?

Yes. Research shows PANL activations predict error correction success better than verbal confidence or log-probabilities alone. The signal encodes not just whether an error exists but whether the model possesses sufficient knowledge to correct it. Causal interventions confirm that PANL signals are necessary for error detection behavior, and when answer information is corrupted, PANL activation can restore correction capability.

← Content

AI · 4 min read · April 27, 2026

LLMs use hidden confidence signals to detect and fix their own errors

Research shows large language models maintain a second-order evaluative signal that predicts error detection and self-correction beyond what their output probabilities reveal.

Source: arxiv/cs.LG · Dharshan Kumaran, Viorica Patraucean, Simon Osindero, Petar Velickovic, Nathaniel Daw · open original ↗ ↗

Share: X LinkedIn

LLMs detect errors via internal confidence signals independent of output probabilities, enabling self-correction without external feedback.

— Models cache a confidence representation at post-answer newline (PANL) that drives error detection.
— PANL activations predict which errors the model can correct, outperforming verbal confidence signals.
— Second-order confidence architecture mirrors decision neuroscience frameworks with independent evaluative signals.
— Causal interventions show PANL signals rescue error detection when answer information is corrupted.
— Findings replicate across Gemma 3 27B, Qwen 2.5 7B, and tasks like TriviaQA and MNLI.
— Verbal confidence alone fails to predict correctable errors; internal signals encode fixability.
— First-order models cannot explain error detection since confidence would always favor chosen response.

Frequently asked

LLMs maintain an internal confidence signal at the post-answer newline (PANL) that operates independently of output probabilities. This second-order evaluative signal can disagree with the model's chosen response, allowing it to recognize when an answer is likely wrong. The signal encodes not only error likelihood but whether the model has the knowledge to fix it, enabling self-correction without human input.

#llm #confidence #error-detection #neuroscience #interpretability

LLMs use hidden confidence signals to detect and fix their own errors

Frequently asked

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs