← Content
AI · 4 min read · April 27, 2026

LLMs use hidden confidence signals to detect and fix their own errors

Research shows large language models maintain a second-order evaluative signal that predicts error detection and self-correction beyond what their output probabilities reveal.

Source: arxiv/cs.LG · Dharshan Kumaran, Viorica Patraucean, Simon Osindero, Petar Velickovic, Nathaniel Daw · open original ↗ ↗
Share: X LinkedIn

LLMs detect errors via internal confidence signals independent of output probabilities, enabling self-correction without external feedback.

  • Models cache a confidence representation at post-answer newline (PANL) that drives error detection.
  • PANL activations predict which errors the model can correct, outperforming verbal confidence signals.
  • Second-order confidence architecture mirrors decision neuroscience frameworks with independent evaluative signals.
  • Causal interventions show PANL signals rescue error detection when answer information is corrupted.
  • Findings replicate across Gemma 3 27B, Qwen 2.5 7B, and tasks like TriviaQA and MNLI.
  • Verbal confidence alone fails to predict correctable errors; internal signals encode fixability.
  • First-order models cannot explain error detection since confidence would always favor chosen response.

Frequently asked

  • LLMs maintain an internal confidence signal at the post-answer newline (PANL) that operates independently of output probabilities. This second-order evaluative signal can disagree with the model's chosen response, allowing it to recognize when an answer is likely wrong. The signal encodes not only error likelihood but whether the model has the knowledge to fix it, enabling self-correction without human input.

Related