Why does naive Transformer-to-Mamba distillation fail?

The architectural mismatch between Attention (which learns token-to-token relationships) and State Space Models (which use recurrent state transitions) means gradients do not flow effectively. The student SSM lacks the inductive bias to capture Attention patterns directly. The two-stage approach solves this by using linearized attention as a bridge—a mathematically compatible intermediate form that both Attention and SSM can learn from.

What is linearized attention and how does the kernel trick help?

Linearized attention approximates standard Attention by replacing the softmax with a kernel function, reducing quadratic complexity to linear. The kernel trick reformulates this as a low-rank matrix operation. In distillation, this creates a representation that preserves Attention's learned patterns while being structurally closer to SSM recurrence, making the second-stage transfer more stable.

Does the distilled Mamba model match the original Transformer's speed and memory?

The paper reports matching perplexity (14.11 vs 13.86) and downstream task performance, but does not measure wall-clock latency or peak memory usage. SSMs theoretically offer lower memory and faster generation, but actual gains depend on sequence length, batch size, and hardware. Readers should benchmark on their own infrastructure to confirm practical efficiency improvements.

← İçerik

Yapay Zeka · 8 dk okuma · 17 Nisan 2026

Distilling Transformers into Mamba via Linearized Attention

A two-stage knowledge transfer method preserves Transformer performance in State Space Models by routing through linearized attention as an intermediate step.

Kaynak: arxiv/cs.LG · Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodr\'iguez, Luca Zappella, Federico Danieli · orijinali aç ↗ ↗

Paylaş: X LinkedIn

Two-stage distillation through linearized attention enables Mamba models to match Transformer performance without hybrid architectures.

— Naive Transformer-to-Mamba distillation fails; hybrid models combining both architectures were prior workaround.
— Principled initialization of Mamba weights during distillation recovers performance loss.
— Stage one: distill Transformer into linearized attention using kernel trick adaptation.
— Stage two: distill linearized attention into pure Mamba without any Attention blocks.
— Distilled 1B Mamba maintains teacher perplexity (14.11 vs 13.86) on downstream tasks.
— Ablations test sequence mixer variants, model scaling, token allocation, and total distillation budget.
— Method avoids hybrid solutions, enabling deployment of efficient SSM-only models.

Sık sorulanlar

The architectural mismatch between Attention (which learns token-to-token relationships) and State Space Models (which use recurrent state transitions) means gradients do not flow effectively. The student SSM lacks the inductive bias to capture Attention patterns directly. The two-stage approach solves this by using linearized attention as a bridge—a mathematically compatible intermediate form that both Attention and SSM can learn from.

#distillation #mamba #transformers #ssm #efficiency

Distilling Transformers into Mamba via Linearized Attention

Sık sorulanlar

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs