Why does the KV cache grow so large in long-context LLM inference?

During autoregressive generation, the model must store the key and value vectors for every token in the context to compute attention. For a 32K-token prompt, this storage scales linearly with context length and batch size. For Llama-3.1-8B with batch size 4, the KV cache alone requires ~16GB, exceeding the model's parameter memory. This memory bottleneck limits deployment on resource-constrained devices.

How does OjaKV decide which tokens to compress and which to keep full-rank?

OjaKV uses a hybrid strategy: it preserves the first token and the most recent tokens in full-rank to serve as high-fidelity anchors for attention computation. All intermediate tokens are compressed using low-rank projection. This selective approach recognizes that not all tokens contribute equally to attention; anchors at the boundaries are critical for maintaining accuracy while intermediate tokens can be safely compressed.

What is online subspace adaptation and why does it matter for KV cache compression?

Online subspace adaptation updates the low-rank projection basis incrementally as new tokens arrive, rather than using a static basis learned offline. OjaKV uses Oja's algorithm for online PCA to adapt the subspace during prompt prefilling and periodically during decoding. This matters because input distributions shift across requests; a static basis degrades under distribution shift, while online adaptation keeps the compression subspace aligned with the evolving context, maintaining accuracy at high compression ratios.

← Content

AI · 6 min read · April 20, 2026

OjaKV: Online Low-Rank Compression for LLM Key-Value Caches

A hybrid storage and adaptive subspace method reduces KV cache memory by compressing intermediate tokens while preserving critical anchors, compatible with FlashAttention.

Source: arxiv/cs.AI · Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen · open original ↗ ↗

Share: X LinkedIn

OjaKV compresses LLM key-value caches by selectively preserving first and recent tokens while adaptively compressing intermediate tokens using online PCA.

— KV cache memory often exceeds model weights; Llama-3.1-8B at 32K tokens needs ~16GB.
— Static offline-learned compression subspaces degrade when input distribution shifts.
— OjaKV keeps first and most recent tokens uncompressed as high-fidelity attention anchors.
— Intermediate tokens undergo low-rank projection with incrementally updated basis via Oja's algorithm.
— Subspace adapts comprehensively during prompt prefilling, lightly during decoding.
— Framework integrates with FlashAttention without model retraining.
— Maintains or improves zero-shot accuracy at high compression ratios on long-context tasks.

Frequently asked

During autoregressive generation, the model must store the key and value vectors for every token in the context to compute attention. For a 32K-token prompt, this storage scales linearly with context length and batch size. For Llama-3.1-8B with batch size 4, the KV cache alone requires ~16GB, exceeding the model's parameter memory. This memory bottleneck limits deployment on resource-constrained devices.

#llm #memory #compression #inference #kvcache #lowrank

OjaKV: Online Low-Rank Compression for LLM Key-Value Caches

Frequently asked

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs