Problem: Your LLM Inference Bill Keeps Growing

You're running LLMs in production — maybe on vLLM, TGI, or a managed API — and the cloud bill is climbing faster than your user count. Memory is the bottleneck, and the KV cache is almost always the reason.

You'll learn:

What the KV cache actually does and why it dominates memory
How to right-size and tune it for your specific workload
How prefix caching eliminates redundant computation for common prompts

Time: 25 min | Level: Intermediate

Why This Happens

Every token a transformer generates requires attention over all previous tokens. The KV (key-value) cache stores the computed attention matrices so the model doesn't recompute them from scratch on each forward pass. Without it, inference would be unusably slow.

The problem: KV cache is proportional to sequence_length × num_layers × num_heads × head_dim. For a 70B model at 4-bit, a single 8,192-token context can consume 4–8 GB of GPU memory. With 50 concurrent users, that's your entire A100 gone.

Common symptoms:

OOM errors under moderate load
GPU memory sitting at 95%+ even with short prompts
High cost-per-token despite low average response length
Slow time-to-first-token during traffic spikes

Solution

Step 1: Measure Your Actual KV Cache Usage

Before tuning, understand your workload. Run this against your vLLM server to get a baseline:

import requests
import statistics

def sample_kv_stats(base_url: str, n_samples: int = 100) -> dict:
    """
    Poll vLLM metrics endpoint to characterize KV cache pressure.
    Run during representative traffic, not peak load.
    """
    usage_samples = []
    
    for _ in range(n_samples):
        resp = requests.get(f"{base_url}/metrics")
        lines = resp.text.splitlines()
        
        for line in lines:
            # vLLM exposes Prometheus metrics at /metrics
            if "vllm:gpu_cache_usage_perc" in line and not line.startswith("#"):
                usage = float(line.split()[-1])
                usage_samples.append(usage)
                break
    
    return {
        "mean_usage": statistics.mean(usage_samples),
        "p95_usage": sorted(usage_samples)[int(0.95 * len(usage_samples))],
        "max_usage": max(usage_samples),
    }

stats = sample_kv_stats("http://localhost:8000")
print(stats)
# {'mean_usage': 0.43, 'p95_usage': 0.81, 'max_usage': 0.97}

Expected: You want P95 usage below 80%. If it's higher, you're evicting cached blocks under normal load.

If it fails:

Connection refused: vLLM not running, or metrics endpoint disabled — add --enable-metrics to your startup flags
No matching line: Check your vLLM version; metric names changed in v0.4+

Step 2: Right-Size the Cache Allocation

vLLM lets you control what fraction of GPU memory the KV cache gets:

# Start vLLM with explicit cache memory fraction
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --gpu-memory-utilization 0.90 \  # 90% of GPU RAM for model + KV cache
  --max-model-len 4096 \           # Cap context length — biggest lever you have
  --max-num-seqs 64 \              # Concurrent sequences before queuing
  --block-size 16                  # Tokens per KV block (16 or 32 work well)

The key insight here is --max-model-len. Halving max context length roughly doubles the number of concurrent requests you can serve on the same hardware. If your P99 actual prompt length is 1,200 tokens, there's no reason to allocate for 8,192.

# Analyze your real prompt lengths before setting max-model-len
import json
from pathlib import Path

def analyze_prompt_lengths(log_file: str) -> dict:
    """Read from your request logs, not assumptions."""
    lengths = []
    
    for line in Path(log_file).read_text().splitlines():
        entry = json.loads(line)
        lengths.append(entry["prompt_tokens"])
    
    lengths.sort()
    return {
        "p50": lengths[len(lengths) // 2],
        "p95": lengths[int(0.95 * len(lengths))],
        "p99": lengths[int(0.99 * len(lengths))],
        "max": max(lengths),
    }

Set --max-model-len to your P99 + 20% buffer, not the model's architectural maximum.

Expected: After adjusting, P95 cache usage should drop and throughput (tokens/sec) should increase measurably.

Step 3: Enable Prefix Caching

If your requests share a common system prompt or few-shot examples, prefix caching is the highest-ROI optimization available. It stores the KV state for repeated prefixes so subsequent requests skip recomputing them entirely.

# Enable automatic prefix caching in vLLM
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90

On the client side, structure your prompts to maximize prefix reuse:

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# Put stable content FIRST — the cache key is prefix-based
# Shared system prompt → cached after first request
SYSTEM_PROMPT = """You are a helpful customer support assistant for Acme Corp.
Company policy: Returns accepted within 30 days with receipt.
Tone: Professional but friendly. Keep responses under 150 words."""

def query_llm(user_message: str) -> str:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3-8B-Instruct",
        messages=[
            # System prompt is identical across all requests → prefix cache hit
            {"role": "system", "content": SYSTEM_PROMPT},
            # Only the user message varies
            {"role": "user", "content": user_message},
        ],
        max_tokens=200,
    )
    return response.choices[0].message.content

What changes your cache hit rate:

System prompt always identical? High hit rate.
Few-shot examples in the prompt? Put them in the system message, not the user turn.
Multi-turn conversations? Later turns benefit less — only the shared prefix is cached, not the dialogue history.

If your hit rate is low:

Check vllm:cpu_prefix_cache_hit_rate in the metrics endpoint
Ensure the shared portion of your prompt is byte-for-byte identical — even a timestamp difference kills the cache

Step 4: Choose the Right Eviction Policy for Your Traffic Pattern

When KV cache fills up, vLLM must evict blocks from lower-priority sequences. The default is LRU (least recently used), which works well for most cases. But if you have mixed workloads — short interactive requests plus long batch jobs — you may want to tune sequence prioritization:

# vLLM EngineArgs for custom scheduling (Python API, not CLI)
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs

engine_args = EngineArgs(
    model="meta-llama/Llama-3-8B-Instruct",
    gpu_memory_utilization=0.90,
    enable_prefix_caching=True,
    # Scheduler policy: "fcfs" (default) or "priority"
    # Use "priority" only if you need SLA tiers
    scheduler_policy="fcfs",
    # Preemption mode: "recompute" (cheaper on memory) vs "swap" (faster recovery)
    # Use "recompute" if you have fast GPUs; "swap" if you have NVMe SSDs
    preemption_mode="recompute",
)

For most production deployments: leave scheduler_policy as fcfs and focus on prefix caching and right-sizing before tuning eviction.

Verification

Deploy your changes and watch these metrics for 30 minutes under normal load:

# Prometheus query to check cache efficiency
# Run against your metrics endpoint
curl -s http://localhost:8000/metrics | grep -E "kv_cache|prefix_cache|preemption"

You should see:

vllm:gpu_cache_usage_perc — stable, not yo-yoing (yo-yoing = eviction thrashing)
vllm:cpu_prefix_cache_hit_rate — above 0.5 for shared-prompt workloads
vllm:num_preemptions_total — low and stable, not climbing steadily

Also run a quick load test to confirm throughput improved:

# Install: pip install locust
locust -f load_test.py --headless -u 20 -r 5 --run-time 2m \
  --host http://localhost:8000

Expected throughput gain: 20–40% more tokens/sec with prefix caching on shared-prompt workloads; 10–20% from right-sizing max-model-len alone.

What You Learned

KV cache size scales with sequence length, not prompt complexity — cap max-model-len to your real P99
Prefix caching is only effective when the shared prefix is identical across requests — structure prompts accordingly
Preemption thrashing (cache evictions causing recomputation) is usually a sign of under-allocated cache or too-high max-num-seqs, not a scheduling problem

Limitation: Prefix caching doesn't help for highly variable or user-specific prompts with no shared prefix. In those cases, focus entirely on right-sizing and consider quantization (AWQ, GPTQ) to free up headroom.

When NOT to use aggressive cache sizing: If you're running fine-tuned models with LoRA adapters, the adapter weights also consume GPU memory — budget for them before maximizing KV cache allocation.

Tested on vLLM 0.6.x, CUDA 12.4, A100 80GB and L40S. Settings are model-agnostic but numbers will vary by model architecture.