← İçerik
Yapay Zeka · 8 dk okuma · 17 Nisan 2026

Token Importance in On-Policy Distillation: Entropy and Disagreement

Research identifies two regions of high-value tokens in knowledge distillation: high-entropy positions and low-entropy positions where student and teacher disagree, enabling 50–80% token reduction.

Kaynak: arxiv/cs.AI · Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard · orijinali aç ↗ ↗
Paylaş: X LinkedIn

On-policy distillation learns most effectively from tokens with high student uncertainty or low confidence paired with teacher disagreement.

  • Student entropy alone captures ~50% of useful tokens; entropy-based sampling matches full-token training with 47% less peak memory.
  • Low-entropy, high-divergence tokens (overconfident errors) carry dense corrective signal despite being rare.
  • TIP framework organizes token importance across two axes: student entropy and teacher–student divergence.
  • Type-aware selection combining uncertainty and disagreement outperforms entropy-only rules.
  • Experiments on Qwen, Llama, and Qwen2.5 show <20% token retention can exceed full-token baselines on math and planning tasks.
  • Overconfident-wrong tokens are structurally invisible to entropy-only methods but critical for learning.
  • Memory savings enable distillation of larger models under constrained GPU budgets.

Sık sorulanlar

  • High-entropy tokens (where the student is uncertain) and low-entropy, high-divergence tokens (where the student is overconfident but wrong). Together, these regions contain the densest learning signal. Entropy alone captures about 50% of useful tokens, but the second region—overconfident errors—carries corrective information that entropy-only methods miss.

İlgili