AI · 6 min read · April 20, 2026
OjaKV: Online Low-Rank Compression for LLM Key-Value Caches
A hybrid storage and adaptive subspace method reduces KV cache memory by compressing intermediate tokens while preserving critical anchors, compatible with FlashAttention.
Source: arxiv/cs.AI · Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen · open original ↗ ↗
OjaKV compresses LLM key-value caches by selectively preserving first and recent tokens while adaptively compressing intermediate tokens using online PCA.
- — KV cache memory often exceeds model weights; Llama-3.1-8B at 32K tokens needs ~16GB.
- — Static offline-learned compression subspaces degrade when input distribution shifts.
- — OjaKV keeps first and most recent tokens uncompressed as high-fidelity attention anchors.
- — Intermediate tokens undergo low-rank projection with incrementally updated basis via Oja's algorithm.
- — Subspace adapts comprehensively during prompt prefilling, lightly during decoding.
- — Framework integrates with FlashAttention without model retraining.
- — Maintains or improves zero-shot accuracy at high compression ratios on long-context tasks.
Frequently asked
- During autoregressive generation, the model must store the key and value vectors for every token in the context to compute attention. For a 32K-token prompt, this storage scales linearly with context length and batch size. For Llama-3.1-8B with batch size 4, the KV cache alone requires ~16GB, exceeding the model's parameter memory. This memory bottleneck limits deployment on resource-constrained devices.