AI · 8 min read · April 17, 2026
Token Importance in On-Policy Distillation: Entropy and Disagreement
Research identifies two regions of high-value tokens in knowledge distillation: high-entropy positions and low-entropy positions where student and teacher disagree, enabling 50–80% token reduction.
Source: arxiv/cs.AI · Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard · open original ↗ ↗
On-policy distillation learns most effectively from tokens with high student uncertainty or low confidence paired with teacher disagreement.
- — Student entropy alone captures ~50% of useful tokens; entropy-based sampling matches full-token training with 47% less peak memory.
- — Low-entropy, high-divergence tokens (overconfident errors) carry dense corrective signal despite being rare.
- — TIP framework organizes token importance across two axes: student entropy and teacher–student divergence.
- — Type-aware selection combining uncertainty and disagreement outperforms entropy-only rules.
- — Experiments on Qwen, Llama, and Qwen2.5 show <20% token retention can exceed full-token baselines on math and planning tasks.
- — Overconfident-wrong tokens are structurally invisible to entropy-only methods but critical for learning.
- — Memory savings enable distillation of larger models under constrained GPU budgets.
Frequently asked
- High-entropy tokens (where the student is uncertain) and low-entropy, high-divergence tokens (where the student is overconfident but wrong). Together, these regions contain the densest learning signal. Entropy alone captures about 50% of useful tokens, but the second region—overconfident errors—carries corrective information that entropy-only methods miss.