Every 'What are your business hours?' question hits GPT-4o and costs $0.002. Redis semantic cache means only the first one does.
Your LLM API bill looks like a phone number from the 1990s, and half the queries are asking the same thing in slightly different words. You’re paying for the privilege of re-generating "We're open 9-5" a thousand times. The standard advice is to cache, but a naive GET user:query fails the moment someone asks "When do you open?" instead of "What are your hours?".
This is where a two-tier Redis semantic cache architecture cuts through the noise—and the cost. We’ll layer an O(1) exact-match cache over an embedding-based similarity search, using core Redis data structures and Redis Stack. The goal isn't just caching; it's building a cost-killing machine that knows "business hours" and "open tomorrow" deserve the same cached answer.
Why Your Basic String Cache Is Bleeding Money
You might already be doing something like this with redis-py:
import redis
import hashlib
r = redis.Redis()
def get_llm_response(query):
key = hashlib.sha256(query.encode()).hexdigest()
cached = r.get(key)
if cached:
return cached
# Expensive API call here
response = expensive_llm_call(query)
r.setex(key, 3600, response) # TTL 1 hour
return response
This catches exact duplicates. It’s fast—a GET against a Redis String takes about 0.3ms. But it’s brittle. The following queries all miss the cache and trigger separate, costly LLM calls:
- "What are your business hours?"
- "When do you open?"
- "Are you open right now?"
You need the second tier: semantic matching. But before we get to vectors, we must nail the exact-match layer. It’s our first and fastest line of defense.
Tier 1: The O(1) Exact-Match Guardrail
The exact-match cache is a classic cache-aside pattern using Redis Strings. The key is deterministic hashing of the normalized query.
Real Error & Fix: WRONGTYPE Operation against a key holding the wrong kind of value
This happens if you later try to use HSET on a key that was set with SET. Fix: Namespace your keys clearly or delete the key before changing its type. For a cache, use DEL key if you're switching strategies.
We use SETEX for automatic expiry. A SHA256 hash is fine for the key; collision isn't a practical concern. The real optimization is in the TTL strategy.
127.0.0.1:6379> SETEX "cache:exact:9f86d08..." 3600 "We are open from 9 AM to 5 PM, Monday through Friday."
OK
127.0.0.1:6379> TTL "cache:exact:9f86d08..."
(integer) 3572
127.0.0.1:6379> GET "cache:exact:9f86d08..."
"We are open from 9 AM to 5 PM, Monday through Friday."
TTL Strategy: Not all queries are equal. Use a Sorted Set to manage TTL tiers.
- Short TTL (e.g., 300 seconds): For time-sensitive queries ("news about OpenAI today", "current Bitcoin price"). Use
ZADD ttl:tier:short <timestamp> <cache_key>to track them. - Long TTL (e.g., 86400 seconds): For static FAQs ("business hours", "refund policy"). Add to
ZADD ttl:tier:long <timestamp> <cache_key>.
A background worker (using BullMQ or Celery) can periodically ZRANGEBYSCORE the "short" tier and delete keys whose TTLs have effectively expired, or update them with new API calls. This is cache warming for dynamic data.
Tier 2: Semantic Similarity with Redis Vector Search
When the exact match misses, we move to semantic lookup. This requires turning the user query into a vector embedding (using OpenAI's text-embedding-3-small, Cohere, or an open-source model) and searching for similar cached queries.
Redis Stack with the RedisSearch module (and natively in Redis 8.0) provides vector search. We store previous queries and their LLM responses. The data structure of choice is a Hash.
Step 1: Create an index for vector search.
FT.CREATE idx:semantic_cache ON HASH PREFIX 1 "cache:semantic:" SCHEMA
query TEXT
query_embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 1536 DISTANCE_METRIC COSINE
response TEXT
created_at NUMERIC SORTABLE
This creates a HNSW index on the query_embedding field for fast approximate nearest neighbor search.
Step 2: Store a new query and its response.
# Using redis-py and OpenAI's embedding API
import numpy as np
from openai import OpenAI
client = OpenAI()
embedding = client.embeddings.create(input=query, model="text-embedding-3-small").data[0].embedding
# Store as a Redis Hash
cache_key = f"cache:semantic:{hashlib.sha256(query.encode()).hexdigest()}"
r.hset(cache_key, mapping={
"query": query,
"query_embedding": np.array(embedding, dtype=np.float32).tobytes(), # Store as bytes
"response": llm_response,
"created_at": int(time.time())
})
# Set a TTL on the key itself
r.expire(cache_key, 86400)
Step 3: Search for similar cached queries before calling the LLM.
FT.SEARCH idx:semantic_cache
"(*)=>[KNN 3 @query_embedding $query_vec AS vector_score]"
PARAMS 2 query_vec "<BYTES_OF_EMBEDDING>"
DIALECT 2
SORTBY vector_score ASC
RETURN 3 query response vector_score
You get the top 3 most similar cached queries. The critical step is applying a cosine distance threshold. If the top result has a vector_score below, say, 0.15 (lower score = more similar in COSINE metric), it's a semantic hit. Return the cached response.
Real Error & Fix: OOM command not allowed when used memory > 'maxmemory'
Your Redis instance will scream this when it hits its memory limit with no eviction policy. Fix: For a pure cache like this, set maxmemory-policy allkeys-lru in your redis.conf. This tells Redis to evict least-recently-used keys when memory is full. For a queue (like BullMQ), you'd use noeviction.
Performance: Exact Match vs. Semantic Search vs. No Cache
Let's be clear about what each layer costs you. The exact match is virtually free. The semantic search adds latency but saves a massive LLM call. Here’s the breakdown:
| Operation | Latency (approx.) | Cost Implication | When It Happens |
|---|---|---|---|
| Redis GET (Exact Hit) | 0.3 ms | $0.000 | Query hash matches exactly. |
| Redis Vector Search (Semantic Hit) | 5-15 ms | $0.000 | Similar query found within similarity threshold. |
| LLM API Call (Cache Miss) | 500-2000 ms | $0.002 - $0.10 | No similar query found; must call GPT-4o, Claude, etc. |
| Database Query (No Cache) | 50 ms | (Indirect) | Hypothetical fallback without any cache. |
The math is simple: A cache hit rate above 90% reduces your primary "database" load (the LLM API) by 85–95%. If 90% of queries are served by Redis at sub-15ms, and only 10% incur a 1000ms, $0.002 LLM call, you've just cut your effective cost per query by nearly 90% and slashed latency.
Cache Warming and Invalidation: The Operational Playbook
A cold cache is a wasteful cache. At application startup or via a scheduled job, pre-populate the cache with known FAQs.
Cache Warming Script:
// Using ioredis and BullMQ for job scheduling
const Redis = require('ioredis');
const { Queue } = require('bullmq');
const redis = new Redis();
const warmupQueue = new Queue('cache-warmup', { connection: redis });
// Add jobs for each FAQ
const faqs = [
{ q: "What are your business hours?", a: "9-5 M-F" },
{ q: "What is your refund policy?", a: "30 days, no questions asked." }
];
for (const faq of faqs) {
warmupQueue.add('preload-faq', faq);
}
// Worker will process, generate embedding, and store in both exact and semantic caches.
Invalidation on Knowledge Updates: When your underlying data changes (e.g., a product description update in your CMS), you must invalidate related cached responses. This is best done event-driven.
- Emit a Redis Pub/Sub message or add to a Stream on data change:
PUBLISH knowledge_updates "product:123". - Have a cache service subscribed to that channel. It can then:
- Perform a brute-force
FT.SEARCHon the semantic index for queries related to "product 123" (using the TEXT field) andDELthose keys. - More sophisticated: Maintain a secondary Redis Set of cache keys tagged by topic for O(1) invalidation.
- Perform a brute-force
Monitoring Your Cost-Killing Machine
You can't optimize what you don't measure. Use Redis's built-in INFO command and custom metrics.
- Cache Hit Rate: The golden metric. Track it in your app.
hits = r.incr('stats:cache:hits') # On cache hit misses = r.incr('stats:cache:misses') # On cache miss hit_rate = hits / (hits + misses) - Latency: Use
redis-cli --latencyto monitor base Redis performance. Ensure your P99 forFT.SEARCHstays under 20ms. - Memory: Monitor
used_memoryviaINFO memory. Withallkeys-lru, memory usage should plateau at yourmaxmemorysetting. - Cost Savings: Log every LLM call. At the end of the month:
(Misses * Avg_Cost_Per_Call) = Actual Cost. Compare this to(Total_Queries * Avg_Cost_Per_Call)to see your savings. A well-tuned system should show at least a 50% reduction in LLM API costs.
Next Steps: From Prototype to Production
You now have a blueprint. To move this from a script to a system:
- Choose Your Vector Database Layer: If you're all-in on Redis, use Redis Stack today. Wait for Redis 8.0 (2026) for native vector support in core Redis. Alternatives like Upstash (serverless Redis) or Momento (serverless cache) abstract scaling but may have vector search limitations.
- Benchmark with Pipelining: A single
GETis fast. But when processing a batch of queries, use Redis pipelining. The benchmark doesn't lie: 110k ops/sec single-connection vs. 1.1 million ops/sec with a 10-connection pipeline. Useioredisorredis-pywith pipeline support. - Plan for Scale: A single Redis node can handle massive load (1M+ ops/sec). For redundancy, use Redis Sentinel. For horizontal scale and larger-than-memory vector indexes, you'll need Redis Cluster. Be aware of the
CLUSTERDOWN Hash slot not servederror—fix it withredis-cli --cluster fix <host>:<port>after a node failure. - Implement Circuit Breakers: If your LLM API is down, what does your cache do? Consider serving stale cached responses (with a flag) for a short period if the semantic search finds a match, even if it's slightly beyond your normal similarity threshold.
The path to a manageable LLM API bill isn't begging for credits; it's engineering smarter gateways. Your Redis semantic cache is that gateway—a layer that respects your budget as much as it respects the semantics of a user's question. Stop paying to answer the same question twice.