What are the two regions of important tokens in on-policy distillation?

High-entropy tokens (where the student is uncertain) and low-entropy, high-divergence tokens (where the student is overconfident but wrong). Together, these regions contain the densest learning signal. Entropy alone captures about 50% of useful tokens, but the second region—overconfident errors—carries corrective information that entropy-only methods miss.

How much memory can token selection save in distillation?

Entropy-based sampling retaining 50% of tokens reduces peak memory by up to 47% while matching or exceeding full-token training performance. In some cases, using fewer than 20% of tokens (selected by both entropy and teacher–student disagreement) can surpass full-token baselines, depending on the task and model pair.

Why does entropy alone fail to identify all important tokens?

Entropy measures student uncertainty, which is useful for identifying ambiguous cases. However, it ignores tokens where the student is confident but wrong—these overconfident errors carry strong corrective signals from the teacher. A two-axis framework combining entropy and teacher–student divergence captures both uncertainty and disagreement, revealing the complete picture of token importance.

← Content

AI · 8 min read · April 17, 2026

Token Importance in On-Policy Distillation: Entropy and Disagreement

Research identifies two regions of high-value tokens in knowledge distillation: high-entropy positions and low-entropy positions where student and teacher disagree, enabling 50–80% token reduction.

Source: arxiv/cs.AI · Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard · open original ↗ ↗

Share: X LinkedIn

On-policy distillation learns most effectively from tokens with high student uncertainty or low confidence paired with teacher disagreement.

— Student entropy alone captures ~50% of useful tokens; entropy-based sampling matches full-token training with 47% less peak memory.
— Low-entropy, high-divergence tokens (overconfident errors) carry dense corrective signal despite being rare.
— TIP framework organizes token importance across two axes: student entropy and teacher–student divergence.
— Type-aware selection combining uncertainty and disagreement outperforms entropy-only rules.
— Experiments on Qwen, Llama, and Qwen2.5 show <20% token retention can exceed full-token baselines on math and planning tasks.
— Overconfident-wrong tokens are structurally invisible to entropy-only methods but critical for learning.
— Memory savings enable distillation of larger models under constrained GPU budgets.

Frequently asked

High-entropy tokens (where the student is uncertain) and low-entropy, high-divergence tokens (where the student is overconfident but wrong). Together, these regions contain the densest learning signal. Entropy alone captures about 50% of useful tokens, but the second region—overconfident errors—carries corrective information that entropy-only methods miss.

#distillation #tokenization #efficiency #llm #training

Token Importance in On-Policy Distillation: Entropy and Disagreement

Frequently asked

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs