Cutting LLM API Costs 50% with Redis Semantic Cache: Exact Match and Embedding-Based Lookup

Every 'What are your business hours?' question hits GPT-4o and costs $0.002. Redis semantic cache means only the first one does.

Your LLM API bill looks like a phone number from the 1990s, and half the queries are asking the same thing in slightly different words. You’re paying for the privilege of re-generating "We're open 9-5" a thousand times. The standard advice is to cache, but a naive GET user:query fails the moment someone asks "When do you open?" instead of "What are your hours?".

This is where a two-tier Redis semantic cache architecture cuts through the noise—and the cost. We’ll layer an O(1) exact-match cache over an embedding-based similarity search, using core Redis data structures and Redis Stack. The goal isn't just caching; it's building a cost-killing machine that knows "business hours" and "open tomorrow" deserve the same cached answer.

Why Your Basic String Cache Is Bleeding Money

You might already be doing something like this with redis-py:

import redis
import hashlib

r = redis.Redis()
def get_llm_response(query):
    key = hashlib.sha256(query.encode()).hexdigest()
    cached = r.get(key)
    if cached:
        return cached
    # Expensive API call here
    response = expensive_llm_call(query)
    r.setex(key, 3600, response) # TTL 1 hour
    return response

This catches exact duplicates. It’s fast—a GET against a Redis String takes about 0.3ms. But it’s brittle. The following queries all miss the cache and trigger separate, costly LLM calls:

"What are your business hours?"
"When do you open?"
"Are you open right now?"

You need the second tier: semantic matching. But before we get to vectors, we must nail the exact-match layer. It’s our first and fastest line of defense.

Tier 1: The O(1) Exact-Match Guardrail

The exact-match cache is a classic cache-aside pattern using Redis Strings. The key is deterministic hashing of the normalized query.

Real Error & Fix: WRONGTYPE Operation against a key holding the wrong kind of value This happens if you later try to use HSET on a key that was set with SET. Fix: Namespace your keys clearly or delete the key before changing its type. For a cache, use DEL key if you're switching strategies.

We use SETEX for automatic expiry. A SHA256 hash is fine for the key; collision isn't a practical concern. The real optimization is in the TTL strategy.


127.0.0.1:6379> SETEX "cache:exact:9f86d08..." 3600 "We are open from 9 AM to 5 PM, Monday through Friday."
OK
127.0.0.1:6379> TTL "cache:exact:9f86d08..."
(integer) 3572
127.0.0.1:6379> GET "cache:exact:9f86d08..."
"We are open from 9 AM to 5 PM, Monday through Friday."

TTL Strategy: Not all queries are equal. Use a Sorted Set to manage TTL tiers.

Short TTL (e.g., 300 seconds): For time-sensitive queries ("news about OpenAI today", "current Bitcoin price"). Use ZADD ttl:tier:short <timestamp> <cache_key> to track them.
Long TTL (e.g., 86400 seconds): For static FAQs ("business hours", "refund policy"). Add to ZADD ttl:tier:long <timestamp> <cache_key>.

A background worker (using BullMQ or Celery) can periodically ZRANGEBYSCORE the "short" tier and delete keys whose TTLs have effectively expired, or update them with new API calls. This is cache warming for dynamic data.

Tier 2: Semantic Similarity with Redis Vector Search

When the exact match misses, we move to semantic lookup. This requires turning the user query into a vector embedding (using OpenAI's text-embedding-3-small, Cohere, or an open-source model) and searching for similar cached queries.

Redis Stack with the RedisSearch module (and natively in Redis 8.0) provides vector search. We store previous queries and their LLM responses. The data structure of choice is a Hash.

Step 1: Create an index for vector search.

FT.CREATE idx:semantic_cache ON HASH PREFIX 1 "cache:semantic:" SCHEMA
    query TEXT
    query_embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 1536 DISTANCE_METRIC COSINE
    response TEXT
    created_at NUMERIC SORTABLE

This creates a HNSW index on the query_embedding field for fast approximate nearest neighbor search.

Step 2: Store a new query and its response.

# Using redis-py and OpenAI's embedding API
import numpy as np
from openai import OpenAI

client = OpenAI()
embedding = client.embeddings.create(input=query, model="text-embedding-3-small").data[0].embedding

# Store as a Redis Hash
cache_key = f"cache:semantic:{hashlib.sha256(query.encode()).hexdigest()}"
r.hset(cache_key, mapping={
    "query": query,
    "query_embedding": np.array(embedding, dtype=np.float32).tobytes(), # Store as bytes
    "response": llm_response,
    "created_at": int(time.time())
})
# Set a TTL on the key itself
r.expire(cache_key, 86400)

Step 3: Search for similar cached queries before calling the LLM.

FT.SEARCH idx:semantic_cache
  "(*)=>[KNN 3 @query_embedding $query_vec AS vector_score]"
  PARAMS 2 query_vec "<BYTES_OF_EMBEDDING>"
  DIALECT 2
  SORTBY vector_score ASC
  RETURN 3 query response vector_score

You get the top 3 most similar cached queries. The critical step is applying a cosine distance threshold. If the top result has a vector_score below, say, 0.15 (lower score = more similar in COSINE metric), it's a semantic hit. Return the cached response.

Real Error & Fix: OOM command not allowed when used memory > 'maxmemory' Your Redis instance will scream this when it hits its memory limit with no eviction policy. Fix: For a pure cache like this, set maxmemory-policy allkeys-lru in your redis.conf. This tells Redis to evict least-recently-used keys when memory is full. For a queue (like BullMQ), you'd use noeviction.

Performance: Exact Match vs. Semantic Search vs. No Cache

Let's be clear about what each layer costs you. The exact match is virtually free. The semantic search adds latency but saves a massive LLM call. Here’s the breakdown:

Operation	Latency (approx.)	Cost Implication	When It Happens
Redis GET (Exact Hit)	0.3 ms	$0.000	Query hash matches exactly.
Redis Vector Search (Semantic Hit)	5-15 ms	$0.000	Similar query found within similarity threshold.
LLM API Call (Cache Miss)	500-2000 ms	$0.002 - $0.10	No similar query found; must call GPT-4o, Claude, etc.
Database Query (No Cache)	50 ms	(Indirect)	Hypothetical fallback without any cache.

The math is simple: A cache hit rate above 90% reduces your primary "database" load (the LLM API) by 85–95%. If 90% of queries are served by Redis at sub-15ms, and only 10% incur a 1000ms, $0.002 LLM call, you've just cut your effective cost per query by nearly 90% and slashed latency.

Cache Warming and Invalidation: The Operational Playbook

A cold cache is a wasteful cache. At application startup or via a scheduled job, pre-populate the cache with known FAQs.

Cache Warming Script:

// Using ioredis and BullMQ for job scheduling
const Redis = require('ioredis');
const { Queue } = require('bullmq');
const redis = new Redis();
const warmupQueue = new Queue('cache-warmup', { connection: redis });

// Add jobs for each FAQ
const faqs = [
    { q: "What are your business hours?", a: "9-5 M-F" },
    { q: "What is your refund policy?", a: "30 days, no questions asked." }
];

for (const faq of faqs) {
    warmupQueue.add('preload-faq', faq);
}
// Worker will process, generate embedding, and store in both exact and semantic caches.

Invalidation on Knowledge Updates: When your underlying data changes (e.g., a product description update in your CMS), you must invalidate related cached responses. This is best done event-driven.

Emit a Redis Pub/Sub message or add to a Stream on data change: PUBLISH knowledge_updates "product:123".
Have a cache service subscribed to that channel. It can then:
- Perform a brute-force FT.SEARCH on the semantic index for queries related to "product 123" (using the TEXT field) and DEL those keys.
- More sophisticated: Maintain a secondary Redis Set of cache keys tagged by topic for O(1) invalidation.

Monitoring Your Cost-Killing Machine

You can't optimize what you don't measure. Use Redis's built-in INFO command and custom metrics.

Cache Hit Rate: The golden metric. Track it in your app.

hits = r.incr('stats:cache:hits') # On cache hit
misses = r.incr('stats:cache:misses') # On cache miss
hit_rate = hits / (hits + misses)

Latency: Use redis-cli --latency to monitor base Redis performance. Ensure your P99 for FT.SEARCH stays under 20ms.
Memory: Monitor used_memory via INFO memory. With allkeys-lru, memory usage should plateau at your maxmemory setting.
Cost Savings: Log every LLM call. At the end of the month: (Misses * Avg_Cost_Per_Call) = Actual Cost. Compare this to (Total_Queries * Avg_Cost_Per_Call) to see your savings. A well-tuned system should show at least a 50% reduction in LLM API costs.

Next Steps: From Prototype to Production

You now have a blueprint. To move this from a script to a system:

Choose Your Vector Database Layer: If you're all-in on Redis, use Redis Stack today. Wait for Redis 8.0 (2026) for native vector support in core Redis. Alternatives like Upstash (serverless Redis) or Momento (serverless cache) abstract scaling but may have vector search limitations.
Benchmark with Pipelining: A single GET is fast. But when processing a batch of queries, use Redis pipelining. The benchmark doesn't lie: 110k ops/sec single-connection vs. 1.1 million ops/sec with a 10-connection pipeline. Use ioredis or redis-py with pipeline support.
Plan for Scale: A single Redis node can handle massive load (1M+ ops/sec). For redundancy, use Redis Sentinel. For horizontal scale and larger-than-memory vector indexes, you'll need Redis Cluster. Be aware of the CLUSTERDOWN Hash slot not served error—fix it with redis-cli --cluster fix <host>:<port> after a node failure.
Implement Circuit Breakers: If your LLM API is down, what does your cache do? Consider serving stale cached responses (with a flag) for a short period if the semantic search finds a match, even if it's slightly beyond your normal similarity threshold.

The path to a manageable LLM API bill isn't begging for credits; it's engineering smarter gateways. Your Redis semantic cache is that gateway—a layer that respects your budget as much as it respects the semantics of a user's question. Stop paying to answer the same question twice.