LLM response caching with Redis is the fastest way to cut your OpenAI or Anthropic bill without touching your prompts or switching models. If your app repeatedly asks the LLM the same or similar questions, you are paying full price for responses you already have.
This guide walks through two caching layers: exact-match caching for identical prompts, and semantic caching using embeddings for near-duplicate queries. By the end you'll have a production-ready Python class that slots in front of any LLM call and cuts redundant API spend by 40–60%.
You'll learn:
- How to set up Redis 7.2 as an LLM response cache with TTL expiry
- Exact-match caching using prompt hashing for zero-latency hits
- Semantic caching with OpenAI embeddings + cosine similarity for fuzzy matches
- Cache key design and eviction strategy for multi-tenant apps
- Measuring cache hit rate and estimating monthly USD savings
Time: 25 min | Difficulty: Intermediate | Stack: Python 3.12 · Redis 7.2 · OpenAI SDK 1.x · Docker
Why LLM API Costs Balloon Without Caching
Production LLM apps repeat themselves more than you think. Customer support bots answer the same five questions all day. RAG pipelines re-embed identical queries from different users. Code assistants regenerate boilerplate explanations for every new session.
At GPT-4o pricing ($5 per 1M input tokens, $15 per 1M output tokens as of Q1 2026 on AWS us-east-1), a 500-token prompt answered 1,000 times per day costs $75/day — even if the answer never changes.
Symptoms that caching will help:
- Token spend grows linearly with traffic, not with unique queries
- Logs show repeated identical or near-identical prompts
- Response latency varies wildly (network variance, not logic variance)
- Monthly OpenAI/Anthropic invoices exceed $200 USD with < 10k daily active users
Architecture: Two-Layer Redis Cache
Two-layer cache: exact-match hash lookup first, semantic embedding search second, LLM API only on full miss
The strategy is a two-layer fallback:
- Layer 1 — Exact match: SHA-256 hash of the normalized prompt. ~0ms lookup. Handles repeated verbatim queries.
- Layer 2 — Semantic match: Embed the prompt, search Redis for stored vectors within a cosine similarity threshold (default 0.97). Handles rephrased questions with identical intent.
- Layer 3 — LLM API: Only reached on a full miss. Response stored in both layers before returning.
This design keeps Redis as the single source of truth. No in-process dictionaries, no file caches — just Redis, which you can scale, inspect, and flush independently of your app.
Prerequisites
# Python 3.12 + uv (recommended over pip for speed)
uv pip install openai redis[hiredis] numpy tiktoken
# Redis 7.2 via Docker — production-ready with persistence
docker run -d \
--name llm-cache \
-p 6379:6379 \
-v redis-data:/data \
redis:7.2-alpine \
redis-server --appendonly yes --maxmemory 2gb --maxmemory-policy allkeys-lru
The allkeys-lru eviction policy means Redis auto-evicts least-recently-used responses when it hits the 2GB memory cap — no manual TTL management needed for overflow.
Solution
Step 1: Build the Cache Client
Create llm_cache.py. This module owns all Redis reads and writes.
import hashlib
import json
import numpy as np
import redis.asyncio as aioredis
from openai import AsyncOpenAI
SIMILARITY_THRESHOLD = 0.97 # tune down to 0.92 for more aggressive fuzzy hits
CACHE_TTL_SECONDS = 86400 # 24h — prompts older than this are stale
class LLMCache:
def __init__(self, redis_url: str = "redis://localhost:6379"):
# ConnectionPool caps concurrent Redis connections — prevents file descriptor exhaustion
pool = aioredis.ConnectionPool.from_url(
redis_url, max_connections=50, decode_responses=True
)
self.redis = aioredis.Redis(connection_pool=pool)
self.openai = AsyncOpenAI()
# ── Exact-match layer ──────────────────────────────────────────────────
def _hash(self, prompt: str) -> str:
# Normalize whitespace before hashing so "hello world" == "hello world"
normalized = " ".join(prompt.split()).lower()
return hashlib.sha256(normalized.encode()).hexdigest()
async def get_exact(self, prompt: str) -> str | None:
key = f"llm:exact:{self._hash(prompt)}"
return await self.redis.get(key)
async def set_exact(self, prompt: str, response: str) -> None:
key = f"llm:exact:{self._hash(prompt)}"
await self.redis.setex(key, CACHE_TTL_SECONDS, response)
# ── Semantic layer ─────────────────────────────────────────────────────
async def _embed(self, text: str) -> list[float]:
# text-embedding-3-small: $0.02/1M tokens — cheap enough to embed every query
result = await self.openai.embeddings.create(
model="text-embedding-3-small", input=text
)
return result.data[0].embedding
async def get_semantic(self, prompt: str) -> str | None:
query_vec = np.array(await self._embed(prompt))
# Scan all stored semantic keys — replace with Redis Vector Search in prod for > 50k entries
keys = await self.redis.keys("llm:sem:vec:*")
best_score, best_id = 0.0, None
for key in keys:
raw = await self.redis.get(key)
if not raw:
continue
stored_vec = np.array(json.loads(raw))
score = float(
np.dot(query_vec, stored_vec)
/ (np.linalg.norm(query_vec) * np.linalg.norm(stored_vec) + 1e-9)
)
if score > best_score:
best_score, best_id = score, key.replace("llm:sem:vec:", "")
if best_score >= SIMILARITY_THRESHOLD and best_id:
return await self.redis.get(f"llm:sem:resp:{best_id}")
return None
async def set_semantic(self, prompt: str, response: str) -> None:
vec = await self._embed(prompt)
entry_id = self._hash(prompt)
pipe = self.redis.pipeline()
pipe.setex(f"llm:sem:vec:{entry_id}", CACHE_TTL_SECONDS, json.dumps(vec))
pipe.setex(f"llm:sem:resp:{entry_id}", CACHE_TTL_SECONDS, response)
await pipe.execute()
Expected behavior: get_exact returns in under 1ms on a local Redis. get_semantic takes 20–80ms because it embeds the query first — still far cheaper than a full LLM round trip at 800–2000ms.
If it fails:
redis.exceptions.ConnectionError→ Redis container isn't running. Rundocker psand confirm port 6379 is bound.AuthenticationErrorfrom OpenAI →OPENAI_API_KEYenv var is missing.
Step 2: Build the Cached LLM Caller
# In llm_cache.py — add this method to LLMCache
async def chat(
self,
prompt: str,
model: str = "gpt-4o-mini", # gpt-4o-mini at $0.15/1M input is best for cacheable workloads
system: str = "You are a helpful assistant.",
) -> tuple[str, str]:
"""Returns (response_text, cache_status) where cache_status is 'exact' | 'semantic' | 'miss'."""
# Layer 1: exact match
if hit := await self.get_exact(prompt):
return hit, "exact"
# Layer 2: semantic match
if hit := await self.get_semantic(prompt):
# Also write back as exact so next identical call skips embedding
await self.set_exact(prompt, hit)
return hit, "semantic"
# Layer 3: LLM API — only on full miss
completion = await self.openai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt},
],
)
response = completion.choices[0].message.content
# Write to both layers concurrently
await asyncio.gather(
self.set_exact(prompt, response),
self.set_semantic(prompt, response),
)
return response, "miss"
# main.py — minimal usage example
import asyncio
from llm_cache import LLMCache
async def main():
cache = LLMCache()
prompts = [
"What is the capital of France?",
"Tell me the capital city of France.", # semantic hit
"What is the capital of France?", # exact hit
]
for p in prompts:
text, status = await cache.chat(p)
print(f"[{status:8s}] {p[:50]}")
print(f" → {text[:80]}\n")
asyncio.run(main())
Expected output:
[miss ] What is the capital of France?
→ Paris is the capital of France.
[semantic] Tell me the capital city of France.
→ Paris is the capital of France.
[exact ] What is the capital of France?
→ Paris is the capital of France.
Step 3: Add Hit-Rate Metrics
Tracking hit rate lets you tune SIMILARITY_THRESHOLD with real data.
# Add to LLMCache.__init__
self._stats = {"exact": 0, "semantic": 0, "miss": 0}
# Increment in chat() after each status determination
self._stats[status] += 1
# New method
def stats(self) -> dict:
total = sum(self._stats.values()) or 1
hit_rate = (self._stats["exact"] + self._stats["semantic"]) / total
return {**self._stats, "hit_rate": round(hit_rate, 3)}
Push self.stats() to your metrics system (Datadog, Prometheus, CloudWatch) every minute. A healthy production cache hits 40–65% after the first day of warm traffic.
Step 4: Docker Compose for Local Dev
# docker-compose.yml
services:
redis:
image: redis:7.2-alpine
ports:
- "6379:6379"
volumes:
- redis-data:/data
command: >
redis-server
--appendonly yes
--maxmemory 2gb
--maxmemory-policy allkeys-lru
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
app:
build: .
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- REDIS_URL=redis://redis:6379
depends_on:
redis:
condition: service_healthy
volumes:
redis-data:
OPENAI_API_KEY=sk-... docker compose up
Verification
# Confirm keys are being written after running main.py
docker exec llm-cache redis-cli KEYS "llm:*"
You should see output like:
1) "llm:exact:a3f9c2..."
2) "llm:sem:vec:a3f9c2..."
3) "llm:sem:resp:a3f9c2..."
# Check memory usage
docker exec llm-cache redis-cli INFO memory | grep used_memory_human
You should see: used_memory_human:1.20M (grows with each unique response cached)
Cache Invalidation Patterns
A common gap: cached responses go stale after you update your system prompt or switch models. Two strategies:
1. Namespace versioning — prefix all keys with a version string. Bump the version on any prompt or model change; old keys expire naturally via TTL.
CACHE_VERSION = "v3" # increment when system prompt or model changes
def _hash(self, prompt: str) -> str:
normalized = " ".join(prompt.split()).lower()
versioned = f"{CACHE_VERSION}:{normalized}"
return hashlib.sha256(versioned.encode()).hexdigest()
2. Manual flush — for emergency invalidation (wrong answer went viral):
docker exec llm-cache redis-cli FLUSHDB
Use FLUSHDB (current database only), not FLUSHALL, unless you share Redis with other services.
Redis Cache vs In-Memory Cache for LLM Responses
| Redis Cache | In-Memory Dict | |
|---|---|---|
| Survives restarts | ✅ (with persistence) | ❌ |
| Multi-instance sharing | ✅ | ❌ |
| TTL/eviction built-in | ✅ | ❌ manual |
| Memory cap | Configurable (2GB here) | Grows unbounded |
| Latency | < 1ms local | < 0.1ms |
| Best for | Production, multi-worker | Single-process dev |
In-memory caches work fine during development but break under load balancers where each worker process has its own dict — requests miss the cache even when the answer is already computed by a different worker.
Estimating USD Savings
Use this formula to project monthly savings after caching:
monthly_savings_usd =
(daily_requests × hit_rate × avg_input_tokens / 1_000_000 × input_price_per_1m)
+ (daily_requests × hit_rate × avg_output_tokens / 1_000_000 × output_price_per_1m)
× 30
Example for a support bot on GPT-4o (input $5/1M, output $15/1M), 10k req/day, 50% hit rate, 500 input + 300 output tokens avg:
= (10,000 × 0.50 × 500/1,000,000 × $5) + (10,000 × 0.50 × 300/1,000,000 × $15)
= $12.50/day + $22.50/day = $35/day → ~$1,050/month saved
Redis at 2GB on AWS ElastiCache costs roughly $15–25/month (us-east-1 cache.t4g.small). Net saving: ~$1,025/month from one cache layer.
What You Learned
- Exact-match caching using SHA-256 hashing handles verbatim repeats with sub-millisecond latency and zero OpenAI calls
- Semantic caching with cosine similarity over embedding vectors catches rephrased duplicates — tune
SIMILARITY_THRESHOLDbetween 0.92 (aggressive) and 0.99 (conservative) based on your domain allkeys-lrueviction makes Redis self-managing; you set a memory ceiling and let Redis handle the rest- Namespace versioning is the safest cache invalidation strategy when your prompts evolve
- Semantic scan over
KEYSworks up to ~50k entries; migrate to Redis Vector Search (redis-pyFT.CREATEwithVECTOR) beyond that
Tested on Python 3.12.3, Redis 7.2.4, OpenAI SDK 1.35, Docker 27 — macOS Sequoia & Ubuntu 24.04
FAQ
Q: Does this work with Anthropic Claude or other LLM providers?
A: Yes. The cache layer is provider-agnostic — swap the openai.chat.completions.create call for anthropic.messages.create or any other SDK. Only the embedding model call uses OpenAI; you can replace that with a local model via Ollama if you want zero external dependencies on the cache path.
Q: What similarity threshold should I start with? A: Start at 0.97. Lower it to 0.93 only after reviewing a sample of near-misses manually — too low and you'll return wrong cached answers for superficially similar but semantically different queries (e.g. "Paris France capital" vs "Paris Texas population").
Q: Does caching break if two users have different permissions or context? A: Yes, if user-specific context (roles, account data) is injected into the prompt, the hash includes that context and the cache is effectively per-user. To share cache across users, move user context to the system prompt and cache only the user turn — but only if the response is truly context-independent.
Q: How much RAM does Redis need for 1 million cached responses?
A: Each entry stores a 1536-float vector (6KB) plus the response text (avg 500 chars = 0.5KB). Roughly 6.5KB × 1,000,000 = 6.5GB. Plan for 8GB Redis instance on AWS ElastiCache cache.r7g.large ($120/month us-east-1) for 1M entries, or reduce TTL to keep the working set smaller.
Q: Can I use Redis Cloud instead of self-hosting? A: Redis Cloud free tier supports 30MB — enough for development. Production use starts at $7/month for 100MB on the fixed plan or pay-as-you-go at roughly $0.10/GB-hour in us-east-1. Self-hosting on EC2 or ECS is cheaper at scale but adds ops overhead.