Understanding KV Cache Optimization for LLM Inference

Understanding KV Cache Optimization for LLM Inference

📄 Paper Walkthrough

KV Cache Optimization Strategies for Scalable and Efficient LLM Inference Yichun Xu, Navjot K. Khaira, Tejinder Singh (Dell Technologies) arXiv:2603.20397 · March 24, 2026 · 24 pages · 78 references

Why KV cache is a problem

  • During autoregressive generation, every new token must “see” the K/V vectors of all prior tokens.
  • Without caching, every step recomputes everything from scratch → O(N²).
  • With caching, KV memory grows linearly with context length: KV size = 2·H·D·L·B·N.
  • As context windows stretch from 2K to 100K, 1M, 10M, KV cache consumes all GPU memory, making inference slow, expensive, and impractical.
Transformer self-attention with KV cache

Five technique categories

#CategoryOne-line ideaRepresentative workTypical gain
1Cache EvictionDrop unimportant tokens during generationH₂O, SnapKV, Ada-KV~80% memory cut, near-lossless
2Cache CompressionQuantize KV to 2–4 bit or low-rank projectionKIVI, KVQuant, Palu×4–×8 memory, lossless or <2% accuracy drop
3Hybrid MemoryMove KV to CPU/SSD; keep only hot entries on GPUvLLM/PagedAttention, FlexGen, OneirosRun huge models on one GPU, ×6 batch, ×3–×33 throughput
4New AttentionReplace softmax attention: O(N²) → O(N log N)Linear, Log-Linear, Kimi Linear×6.3 throughput, 75% memory cut (Kimi)
5CombinationMix the above fourRocketKV, KVzip, ShadowKV, TailorKVBest overall; no single technique wins everywhere
ScenarioRecommendedWhy
Ultra-long context (>1M) single requestEviction + Compression; Kimi LinearMemory is the bottleneck; must shrink the cache
Minimal model modificationAda-KV, SnapKV, KIVIAll fine-tuning-free, plug-and-play
High-throughput datacenterPagedAttention/vLLM, Oneiros, ShadowKVLarge batches, lossless, multi-tenant
Edge / memory-limited devicesInfiniPot, TailorKV8B/128K on a 24GB GPU
Multi-turn conversationsRocketKV-MT, KVzip, ShadowKVCannot permanently drop tokens like H₂O
Prefill-heavy (long prompts)NACL, HashEvict, LayerKV, MiniCacheFocus on TTFT (time to first token)
Accuracy-critical reasoningPagedAttentionLossless offload; avoid eviction/compression/linear attention

Key conclusion

There is no silver bullet.

  • Ultra-long context (>1M) → eviction + compression dominate.
  • High-throughput serving → hybrid memory dominates (vLLM is still the de facto standard).
  • Bandwidth-bound → standalone compression wins.
  • New attention mechanisms → the future, but require retraining.
  • Future direction: adaptive, multi-stage pipelines that combine techniques dynamically based on context length, load, and hardware.