LLM KV Cache Compression: Quantization, Eviction & Paging Strategies for Cost-Throughput Optimization in 2026

Large Language Models (LLMs) have grown exponentially in parameter count and context window length. While this enables richer, more coherent generations, it has created a fundamental bottleneck: the Key-Value (KV) cache.

The KV cache stores intermediate key and value tensors for every token processed during inference. For modern LLMs with 70B+ parameters and context windows spanning thousands of tokens, the KV cache can dominate GPU memory consumption—often surpassing the model weights themselves. This has profound implications for inference cost, throughput, and latency.

In 2026, the industry faces intensifying pressure to reduce serving costs while maintaining or improving quality. Cloud providers report that KV cache memory accounts for 40-60% of inference costs on high-context workloads. Meanwhile, emerging applications like real-time conversational agents, long-document analysis, and RAG-powered search demand higher throughput without proportional cost increases.

KV cache compression techniques address this bottleneck by compressing or efficiently managing the cache without significantly compromising model quality. This article surveys three major families of approaches that have matured in 2026:

Quantization — Reducing the bit width of keys and values (e.g., 4-bit AWQ/GPTQ/TurboQuant)
Eviction Policies — Strategically removing or archiving less important KV pairs
Paging — Enabling non-contiguous memory allocation (PagedAttention and derivatives)

Each approach offers distinct tradeoffs in compression ratio, latency overhead, and quality preservation. In what follows, we examine the mechanisms, 2026 benchmarks, and practical deployment considerations for each technique.

Key insight: No single method dominates across all workloads. The optimal strategy depends on your specific context length distribution, throughput requirements, and latency budget.

Quantization: Compressing KV Tensors to 4-bit Precision

Quantization reduces memory by representing each floating-point KV element with fewer bits. In 2026, the focus has shifted from naive int8 quantization to specialized methods that maintain quality while achieving aggressive compression ratios.

4-bit AWQ (Activation-Aware Quantization)

AWQ gained traction after its introduction in 2023, but by 2026 it has become a production staple for KV cache compression. The key innovation is activation-aware weight adjustment: AWQ identifies sensitive weights (those most affected by quantization) and scales them back before quantizing, then restores them at runtime.

For KV cache compression specifically, AWQ operates on the value tensors (V) more aggressively than keys (K), since values carry the bulk of information while being more robust to coarser quantization.

Results (2026 benchmarks):

Memory reduction: 75% for KV cache (from FP16 to 4-bit)
Throughput gain: 2.1x on A100/H100 with batch size 32
Quality degradation: Less than 1.5% drop in perplexity on standard benchmarks (MMLU, HumanEval)
Latency overhead: <2ms additional per-token latency for dequantization

A 2024 ACM TOPLAS study [1] demonstrated that AWQ-based KV caching on LLaMA-70B reduced memory requirements from 128GB to 32GB for a 4K-context workload, enabling 4x more concurrent requests on identical hardware.

GPTQ for KV Cache

GPTQ (Group-wise Proximal Quantization) was originally designed for model weights, but the same principles apply to KV tensors. The algorithm quantizes groups of elements jointly, preserving correlations within each group.

For KV cache, GPTQ typically operates with group sizes of 64-128 elements, which strikes a balance between compression and numerical stability.

2026 deployment notes:

Best suited for static inference workloads where the same prompts repeat
Dynamic prompting patterns can erode compression quality due to quantization noise accumulation
Implementation in vLLM 0.6.x (released Q1 2026) brought GPTQ-KV support to production
Memory savings: 75% (4-bit), throughput gains of 1.8x with <3% perplexity increase

TurboQuant: The 2026 Breakthrough

The most significant development in 2026 was TurboQuant, introduced by the vLLM team in May 2026 [3]. TurboQuant extends AWQ with three innovations:

Layer-adaptive bit allocation: Earlier layers (closer to input) use 6-bit quantization for preservation of fine-grained positional information. Later layers use 3-4 bit for maximum compression.
Context-aware dequantization: The dequantization scale factors are computed on-the-fly based on the current context length, compensating for noise amplification in long contexts.
Hardware-aware fused kernels: Custom CUDA kernels that fuse dequantization with attention computation, eliminating separate memory bandwidth overhead.

Benchmarks from the TurboQuant paper show:

55% average KV cache compression across all layers
Up to 80% compression on the last 4 transformer layers where attention patterns stabilize
Throughput gains of 3.7x on H100 clusters for long-context workloads (>8K tokens)
Perplexity increase of only 0.8 points on the Llama-3-70B eval set

TurboQuant has become the default KV quantization method in major cloud providers' managed LLM inference services as of Q2 2026.

Quantization Tradeoffs Summary

Method	Compression	Throughput Gain	Quality Impact	Latency Overhead
8-bit (baseline)	50%	1.4x	<0.5% degradation	~1ms
4-bit AWQ	75%	2.1x	~1.5% degradation	2-3ms
GPTQ (KV)	75%	1.8x	~2-3% degradation	3-5ms
TurboQuant (mixed)	55-80%	3.7x	~0.8% degradation	<1ms (fused)

Eviction Policies: Managing KV Cache Growth with Intelligence

Eviction policies add an intelligent layer that decides which KV pairs to retain and which to remove, addressing the fundamental issue of unbounded context growth.

The Problem: Unbounded Context Growth

Traditional attention computes queries against all past tokens. This O(n²) complexity manifests both in compute and memory: each additional token adds key-value pairs to the cache. For long conversations or document processing, this becomes untenable.

A typical 4K-context conversation with Llama-3-70B requires ~2GB of KV cache per request. A 16K context balloons to ~8GB—often exceeding GPU memory on standard instances.

SLIDE: Selective Long-Context Decoding with Dynamic Eviction

SLIDE, introduced in late 2025 and refined in 2026 [4], represents one of the most successful eviction frameworks. Its core insight is that not all tokens contribute equally to future token prediction.

SLIDE's eviction strategy has three components:

Attention-weighted scoring: Each KV pair is scored by the sum of attention weights it receives from future queries (estimated via a lightweight proxy network).
Temporal decay: Older tokens receive exponentially decaying scores to prevent retention of outdated information.
Structured eviction groups: Tokens are evicted in groups (e.g., sentence or paragraph boundaries) to preserve coherence.

Implementation details:

SLIDE operates at the token group level, not per-token, to minimize fragmentation
A lightweight transformer (5% the size of main model) computes attention scores for eviction scoring
Eviction frequency is configurable—every 10-50 tokens for real-time apps, every 200+ for batch processing

Benchmark results from the original paper:

Memory reduction: 65% for long-context (16K+) workloads
Quality preservation: 92-94% of baseline BLEU/ROUGE scores on long-document summarization
Throughput: 2.3x improvement due to reduced memory pressure and faster attention

RL-based KVP (KV Policy)

A more recent approach from researchers at UC Berkeley and Meta [4] frames eviction as a reinforcement learning problem. The agent (the eviction policy) observes the current KV cache state and decides which tokens to evict or compress.

Key innovations in 2026's RL-KVP:

Reward shaping: Combines downstream task quality (e.g., answer accuracy), memory usage, and inference latency into a multi-objective reward
Meta-learning initialization: Policies are pre-trained across multiple LLM architectures, enabling rapid adaptation to new models
Hybrid policy: Combines learned policies with heuristic baselines (e.g., LRU) for robustness

Production deployment notes:

RL-KVP requires ~10% additional compute for policy inference, but this is amortized by the memory savings
Best suited for high-throughput, low-latency applications where consistent quality matters
Benchmarks show 1.9x throughput improvement on Llama-3-70B with <2% quality loss

Attention-Weighted Eviction (AWE)

AWE is a simpler, weight-free alternative to SLIDE. Instead of training an additional network, AWE uses the existing attention weights from the main LLM to estimate token importance.

The algorithm:

For each new query, compute attention weights to all past keys
Sum the attention weights for each key across recent queries (e.g., last 10)
Tokens with lowest cumulative attention are candidates for eviction
Evict in structured groups (sentence/paragraph) to avoid fragmentation

Advantages:

No additional model overhead (uses existing attention mechanism)
Easy to implement as a preprocessing layer in existing inference stacks
Performance接近 SLIDE with 50% less implementation complexity

2026 benchmarks show AWE achieves 60-65% memory reduction on long-context workloads with quality degradation comparable to SLIDE.

Eviction Tradeoffs Summary

Policy	Memory Reduction	Throughput Gain	Quality Impact	Implementation Complexity
LRU (baseline)	40%	1.5x	Minimal overhead	Low
SLIDE	65%	2.3x	~6-8% quality loss on edge cases	Medium
RL-KVP	70%	1.9x	<5% quality loss (tunable via reward)	High
AWE	60-65%	2.1x	<3% quality loss	Low

Paging: PagedAttention and Beyond

While quantization and eviction optimize what gets stored, paging optimizes how it's stored—moving away from contiguous memory allocation.

The Problem with Contiguous KV Cache

Traditional attention implementations allocate a single contiguous block of memory for the KV cache. This creates two problems:

Fragmentation: When sequences finish and their memory is freed, the resulting free blocks may be too small to accommodate new requests—even if total free memory is sufficient.
Over-allocation: Pre-allocating for worst-case context length leads to wasted memory on typical workloads.

For example, a 16K-context allocation may only use 4K tokens on average, wasting 75% of allocated memory.

PagedAttention: Non-Contiguous KV Cache

PagedAttention, introduced by vLLM in 2023 and refined through 2026, solves this by treating the KV cache as a set of physical blocks that can be mapped to non-contiguous logical addresses.

Key features:

Physical blocks: KV cache is divided into fixed-size blocks (e.g., 16 or 32 tokens each)
Block table: A mapping structure translates logical token positions to physical block addresses
Shared blocks: Multiple sequences can share the same physical block for common prefixes (prefix caching)

Benefits:

No fragmentation: Free blocks can be reused by any sequence
Efficient prefix caching: Common prefixes are stored once, reducing memory and computation
Dynamic sizing: Each sequence can use exactly the memory it needs

2026 benchmarks:

Memory savings: 35-45% on mixed-context workloads
Throughput gain: 2.1x for batched inference with variable sequence lengths
Scalability: Enables scheduling of 3.7x more concurrent requests on H100 clusters

Continuous Batching Integration

PagedAttention works synergistically with continuous batching (also called dynamic batching). When combined:

PagedAttention manages memory efficiently for existing requests
Continuous batching adds/removes sequences from the batch dynamically
The block table allows seamless reallocation of freed blocks to new sequences

This combination has become the foundation for high-throughput LLM serving in 2026, with major cloud providers reporting:

4x higher GPU utilization for LLM inference
50% lower cost per token for high-concurrency workloads
Sub-linear latency scaling with concurrent request count

Comparison to Traditional Approaches

Approach	Memory Efficiency	Throughput	Latency	Implementation Complexity
Contiguous KV	Low (baseline)	Baseline	Low	Low
PagedAttention	High (+35-45%)	2.1x+	+1-2ms (block lookup)	Medium

Practical Deployment Guide for 2026

With the theoretical foundations established, let's turn to practical guidance.

Choosing the Right Strategy

Scenario 1: High-throughput, low-latency chatbots (e.g., customer support)

Primary constraint: Latency <50ms per response
Recommended: TurboQuant (mixed 3-6 bit) + PagedAttention
Reason: TurboQuant's fused kernels minimize overhead; PagedAttention enables efficient batching

Scenario 2: Long-document analysis (e.g., legal, research)

Primary constraint: Memory budget per GPU
Recommended: AWQ (4-bit) + SLIDE eviction at paragraph boundaries
Reason: Maximum compression with controlled quality loss; structured eviction preserves document structure

Scenario 3: Mixed workload (short and long contexts)

Primary constraint: Overall cost efficiency
Recommended: PagedAttention + AWE eviction (simple, no additional models)
Reason: Efficient memory reuse across diverse workloads

Recommended Configuration for Popular Models

Model	Quantization	Eviction	Paging	Expected Memory Use (4K ctx)	Throughput (H100)
Llama-3-8B	TurboQuant (6-bit mid, 4-bit late)	None needed	Enabled	~3GB	120+ tok/s
Llama-3-70B	AWQ (4-bit)	SLIDE (paragraph groups)	Enabled	~25GB	30-40 tok/s
Mistral-7B	AWQ (4-bit)	AWE (token groups)	Enabled	~5GB	80+ tok/s

Tuning Parameters

Quantization:

Start with 4-bit AWQ and gradually increase layer granularity if quality drops
Monitor perplexity on a validation set for your domain

Eviction:

Begin with aggressive eviction (e.g., SLIDE paragraph groups every 10 tokens)
Relax eviction frequency if response quality degrades
Track eviction ratio (tokens evicted / tokens processed) as a diagnostic metric

Paging:

Block size: 16 tokens for latency-sensitive apps, 32 for throughput
Enable prefix caching if your workload has common prefixes (e.g., templates)

Future Directions: What's Next After 2026?

While the techniques surveyed here represent the state of the art in 2026, research continues on:

Hybrid compression: Combining quantization with structured pruning of attention heads
Adaptive bit-width per token: Using token difficulty predictors to assign variable precision
Hardware-specific compression: Custom ASIC/GPU kernels for future inference chips
KV cache streaming: Moving older KV pairs to CPU/NVMe when GPU memory is exhausted

The ultimate goal remains unchanged: enabling LLM inference at the scale and cost of traditional web services, with comparable latency to human-scale response times.

References

ACM TOPLAS 2024 — "Efficient KV Cache Management for Long-Context LLM Inference" https://dl.acm.org/doi/10.1145/3778534.3778567
arXiv 2026 — "TurboQuant: Layer-Adaptive Quantization for KV Cache Compression" https://arxiv.org/html/2603.20397
vLLM Blog May 2026 — "TurboQuant: 5x Throughput for Long-Context LLMs" https://vllm.ai/blog/2026-05-11-turboquant
arXiv 2026 — "SLIDE: Selective Long-Context Decoding with Dynamic Eviction" https://arxiv.org/html/2602.10238
arXiv 2023 — "GPTQ: Accurate Post-Training Quantization" https://arxiv.org/abs/2309.06180
Build Fast with AI — "KV Cache in LLMs: A Comprehensive Guide" https://www.buildfastwithai.com/blogs/kv-cache-llms-explained

This article was written by Taylor Kim as part of the SpendLens AI Infrastructure series. Published June 2026.

Introduction: Why KV Cache Compression Matters in 2026