Problem: Your LLM Inference Bill Keeps Growing
You're running LLMs in production — maybe on vLLM, TGI, or a managed API — and the cloud bill is climbing faster than your user count. Memory is the bottleneck, and the KV cache is almost always the reason.
You'll learn:
- What the KV cache actually does and why it dominates memory
- How to right-size and tune it for your specific workload
- How prefix caching eliminates redundant computation for common prompts
Time: 25 min | Level: Intermediate
Why This Happens
Every token a transformer generates requires attention over all previous tokens. The KV (key-value) cache stores the computed attention matrices so the model doesn't recompute them from scratch on each forward pass. Without it, inference would be unusably slow.
The problem: KV cache is proportional to sequence_length × num_layers × num_heads × head_dim. For a 70B model at 4-bit, a single 8,192-token context can consume 4–8 GB of GPU memory. With 50 concurrent users, that's your entire A100 gone.
Common symptoms:
- OOM errors under moderate load
- GPU memory sitting at 95%+ even with short prompts
- High cost-per-token despite low average response length
- Slow time-to-first-token during traffic spikes
Solution
Step 1: Measure Your Actual KV Cache Usage
Before tuning, understand your workload. Run this against your vLLM server to get a baseline:
import requests
import statistics
def sample_kv_stats(base_url: str, n_samples: int = 100) -> dict:
"""
Poll vLLM metrics endpoint to characterize KV cache pressure.
Run during representative traffic, not peak load.
"""
usage_samples = []
for _ in range(n_samples):
resp = requests.get(f"{base_url}/metrics")
lines = resp.text.splitlines()
for line in lines:
# vLLM exposes Prometheus metrics at /metrics
if "vllm:gpu_cache_usage_perc" in line and not line.startswith("#"):
usage = float(line.split()[-1])
usage_samples.append(usage)
break
return {
"mean_usage": statistics.mean(usage_samples),
"p95_usage": sorted(usage_samples)[int(0.95 * len(usage_samples))],
"max_usage": max(usage_samples),
}
stats = sample_kv_stats("http://localhost:8000")
print(stats)
# {'mean_usage': 0.43, 'p95_usage': 0.81, 'max_usage': 0.97}
Expected: You want P95 usage below 80%. If it's higher, you're evicting cached blocks under normal load.
If it fails:
- Connection refused: vLLM not running, or metrics endpoint disabled — add
--enable-metricsto your startup flags - No matching line: Check your vLLM version; metric names changed in v0.4+
Step 2: Right-Size the Cache Allocation
vLLM lets you control what fraction of GPU memory the KV cache gets:
# Start vLLM with explicit cache memory fraction
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.90 \ # 90% of GPU RAM for model + KV cache
--max-model-len 4096 \ # Cap context length — biggest lever you have
--max-num-seqs 64 \ # Concurrent sequences before queuing
--block-size 16 # Tokens per KV block (16 or 32 work well)
The key insight here is --max-model-len. Halving max context length roughly doubles the number of concurrent requests you can serve on the same hardware. If your P99 actual prompt length is 1,200 tokens, there's no reason to allocate for 8,192.
# Analyze your real prompt lengths before setting max-model-len
import json
from pathlib import Path
def analyze_prompt_lengths(log_file: str) -> dict:
"""Read from your request logs, not assumptions."""
lengths = []
for line in Path(log_file).read_text().splitlines():
entry = json.loads(line)
lengths.append(entry["prompt_tokens"])
lengths.sort()
return {
"p50": lengths[len(lengths) // 2],
"p95": lengths[int(0.95 * len(lengths))],
"p99": lengths[int(0.99 * len(lengths))],
"max": max(lengths),
}
Set --max-model-len to your P99 + 20% buffer, not the model's architectural maximum.
Expected: After adjusting, P95 cache usage should drop and throughput (tokens/sec) should increase measurably.
Step 3: Enable Prefix Caching
If your requests share a common system prompt or few-shot examples, prefix caching is the highest-ROI optimization available. It stores the KV state for repeated prefixes so subsequent requests skip recomputing them entirely.
# Enable automatic prefix caching in vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--enable-prefix-caching \
--gpu-memory-utilization 0.90
On the client side, structure your prompts to maximize prefix reuse:
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
# Put stable content FIRST — the cache key is prefix-based
# Shared system prompt → cached after first request
SYSTEM_PROMPT = """You are a helpful customer support assistant for Acme Corp.
Company policy: Returns accepted within 30 days with receipt.
Tone: Professional but friendly. Keep responses under 150 words."""
def query_llm(user_message: str) -> str:
response = client.chat.completions.create(
model="meta-llama/Llama-3-8B-Instruct",
messages=[
# System prompt is identical across all requests → prefix cache hit
{"role": "system", "content": SYSTEM_PROMPT},
# Only the user message varies
{"role": "user", "content": user_message},
],
max_tokens=200,
)
return response.choices[0].message.content
What changes your cache hit rate:
- System prompt always identical? High hit rate.
- Few-shot examples in the prompt? Put them in the system message, not the user turn.
- Multi-turn conversations? Later turns benefit less — only the shared prefix is cached, not the dialogue history.
If your hit rate is low:
- Check
vllm:cpu_prefix_cache_hit_ratein the metrics endpoint - Ensure the shared portion of your prompt is byte-for-byte identical — even a timestamp difference kills the cache
Step 4: Choose the Right Eviction Policy for Your Traffic Pattern
When KV cache fills up, vLLM must evict blocks from lower-priority sequences. The default is LRU (least recently used), which works well for most cases. But if you have mixed workloads — short interactive requests plus long batch jobs — you may want to tune sequence prioritization:
# vLLM EngineArgs for custom scheduling (Python API, not CLI)
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs
engine_args = EngineArgs(
model="meta-llama/Llama-3-8B-Instruct",
gpu_memory_utilization=0.90,
enable_prefix_caching=True,
# Scheduler policy: "fcfs" (default) or "priority"
# Use "priority" only if you need SLA tiers
scheduler_policy="fcfs",
# Preemption mode: "recompute" (cheaper on memory) vs "swap" (faster recovery)
# Use "recompute" if you have fast GPUs; "swap" if you have NVMe SSDs
preemption_mode="recompute",
)
For most production deployments: leave scheduler_policy as fcfs and focus on prefix caching and right-sizing before tuning eviction.
Verification
Deploy your changes and watch these metrics for 30 minutes under normal load:
# Prometheus query to check cache efficiency
# Run against your metrics endpoint
curl -s http://localhost:8000/metrics | grep -E "kv_cache|prefix_cache|preemption"
You should see:
vllm:gpu_cache_usage_perc— stable, not yo-yoing (yo-yoing = eviction thrashing)vllm:cpu_prefix_cache_hit_rate— above 0.5 for shared-prompt workloadsvllm:num_preemptions_total— low and stable, not climbing steadily
Also run a quick load test to confirm throughput improved:
# Install: pip install locust
locust -f load_test.py --headless -u 20 -r 5 --run-time 2m \
--host http://localhost:8000
Expected throughput gain: 20–40% more tokens/sec with prefix caching on shared-prompt workloads; 10–20% from right-sizing max-model-len alone.
What You Learned
- KV cache size scales with sequence length, not prompt complexity — cap
max-model-lento your real P99 - Prefix caching is only effective when the shared prefix is identical across requests — structure prompts accordingly
- Preemption thrashing (cache evictions causing recomputation) is usually a sign of under-allocated cache or too-high
max-num-seqs, not a scheduling problem
Limitation: Prefix caching doesn't help for highly variable or user-specific prompts with no shared prefix. In those cases, focus entirely on right-sizing and consider quantization (AWQ, GPTQ) to free up headroom.
When NOT to use aggressive cache sizing: If you're running fine-tuned models with LoRA adapters, the adapter weights also consume GPU memory — budget for them before maximizing KV cache allocation.
Tested on vLLM 0.6.x, CUDA 12.4, A100 80GB and L40S. Settings are model-agnostic but numbers will vary by model architecture.