← İçerik
Yapay Zeka · 8 dk okuma · 17 Nisan 2026

Distilling Transformers into Mamba via Linearized Attention

A two-stage knowledge transfer method preserves Transformer performance in State Space Models by routing through linearized attention as an intermediate step.

Kaynak: arxiv/cs.LG · Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodr\'iguez, Luca Zappella, Federico Danieli · orijinali aç ↗ ↗
Paylaş: X LinkedIn

Two-stage distillation through linearized attention enables Mamba models to match Transformer performance without hybrid architectures.

  • Naive Transformer-to-Mamba distillation fails; hybrid models combining both architectures were prior workaround.
  • Principled initialization of Mamba weights during distillation recovers performance loss.
  • Stage one: distill Transformer into linearized attention using kernel trick adaptation.
  • Stage two: distill linearized attention into pure Mamba without any Attention blocks.
  • Distilled 1B Mamba maintains teacher perplexity (14.11 vs 13.86) on downstream tasks.
  • Ablations test sequence mixer variants, model scaling, token allocation, and total distillation budget.
  • Method avoids hybrid solutions, enabling deployment of efficient SSM-only models.

Sık sorulanlar

  • The architectural mismatch between Attention (which learns token-to-token relationships) and State Space Models (which use recurrent state transitions) means gradients do not flow effectively. The student SSM lacks the inductive bias to capture Attention patterns directly. The two-stage approach solves this by using linearized attention as a bridge—a mathematically compatible intermediate form that both Attention and SSM can learn from.

İlgili