Caching LLM Responses with Redis: Semantic Deduplication and Cost Reduction

Build a semantic LLM response cache using Redis and embedding similarity — catching near-duplicate queries to cut API costs, with cache invalidation strategy and hit rate monitoring.

Caching LLM Responses with Redis: Semantic Deduplication and Cost Reduction

30% of your LLM API calls are near-duplicate questions that get near-identical answers. You're paying for the same tokens three times a day. Semantic caching with Redis cuts that to zero. Your CFO will notice the missing line item before your users notice the 200ms latency improvement. The trick isn't just caching—it's caching intelligently, which means moving beyond naive key-value storage to understanding what users actually mean.

Why Hashing the Prompt Is Like Using a Hammer on a Screw

You've already got Redis sitting there, humming along, caching session data. Your first instinct is straightforward: hash the user's prompt, use it as a Redis key, and store the LLM's response. SET user:query:md5(prompt) response. Done. You run it for a day and see a 2% cache hit rate. Pathetic.

The problem is language. "What's the capital of France?" and "Can you tell me France's capital city?" are semantically identical but lexically different. An MD5 hash is brutally literal—it sees two completely different strings. Your users are rephrasing, adding pleasantries, or making typos, and your cache is shrugging its shoulders while your OpenAI bill climbs.

This is where semantic caching enters. Instead of matching strings, you match meaning. You convert the prompt into a numerical vector (an embedding) that captures its semantic essence. Similar meanings cluster together in vector space. Your cache's job is to find if a new query is "close enough" to a previously answered one.

Finding the Sweet Spot: The Distance Threshold Dance

"Close enough" is the operational nightmare. You generate an embedding for a new query, then compare it to all cached embeddings. The comparison uses a distance metric—cosine similarity is the standard here, where 1 means identical and 0 means orthogonal (unrelated).

Set your similarity threshold too high (e.g., 0.95), and you'll only catch verbatim repeats. Too low (e.g., 0.7), and you'll start returning the answer for "How do I bake a cake?" to the query "How do I fix my car?" because, hey, both involve following steps. Disaster.

Through grisly trial and error on production traffic, a threshold of 0.88-0.92 on cosine similarity tends to work for general Q&A. For a code-generation assistant, you might crank it to 0.93-0.95 because a subtly different requirement should yield a different function. You'll need to A/B test this with your own data. The goal is to maximize cache hits without triggering user complaints about "generic" or "off-topic" answers.

Here's the core of the check, using sentence-transformers for embeddings and redis-py for the vector search (assuming Redis Stack with RediSearch module):

from sentence_transformers import SentenceTransformer
import redis
import numpy as np


model = SentenceTransformer('all-MiniLM-L6-v2') # Lightweight, good enough
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)

def get_cached_response(user_query: str, similarity_threshold: float = 0.9):
    # 1. Generate embedding for the new query
    query_embedding = model.encode(user_query).astype(np.float32).tobytes()

    # 2. Query Redis Vector Search (FT.SEARCH)
    # Assume an index 'llm_cache_idx' on the 'embedding' field
    query = (
        f"*=>[KNN 1 @embedding $query_vec AS vector_score]"
    )
    params_dict = {"query_vec": query_embedding}

    try:
        results = redis_client.ft('llm_cache_idx').search(query, params_dict)
    except Exception:
        # Handle index not found, fall back to no cache
        return None

    if results.docs:
        top_match = results.docs[0]
        score = float(top_match['vector_score'])

        # 3. Apply threshold
        if score >= similarity_threshold:
            print(f"Cache HIT with similarity score: {score:.3f}")
            return top_match['response']
        else:
            print(f"Cache MISS. Closest score: {score:.3f} (needed {similarity_threshold})")
    return None

The Redis Blueprint: Hash for Data, Sorted Set for Eviction

Storing just the embedding and response isn't enough for a production system. You need metadata for invalidation, and you need efficient retrieval. A robust pattern uses two data structures in concert:

  1. A Redis Hash (HASH): Acts as your primary document store.

    • Key: llm:cache:{unique_id}
    • Fields: embedding (vector), response (text), prompt (original text, for debugging), model (e.g., gpt-4), created_at (timestamp), usage_tokens (for cost analytics).
  2. A Redis Sorted Set (ZSET): Key: llm:cache:access_order. Members are the {unique_id}. Scores are timestamps (from created_at or last access time). This gives you O(log N) access to the oldest or least-recently-used entries for eviction when you hit memory limits.

When a new query comes in, you perform the vector search (which operates on the HASH fields indexed by RediSearch). On a cache hit, you update the score in the ZSET to the current time (implementing LRU). On a cache miss, you generate a new LLM response, create a new HASH, add its ID to the ZSET, and publish the new request/response pair to a Kafka topic for analytics.

import json
import time
from uuid import uuid4

def store_in_cache(prompt: str, response: str, model_used: str, token_usage: int):
    """Store a new prompt/response pair in the semantic cache."""
    cache_id = f"cache:{uuid4().hex}"
    embedding = model.encode(prompt).astype(np.float32).tobytes()

    # 1. Store main data in a Hash
    cache_data = {
        "prompt": prompt,
        "response": response,
        "model": model_used,
        "usage_tokens": token_usage,
        "created_at": time.time(),
        "embedding": embedding  # RediSearch will index this
    }
    redis_client.hset(cache_id, mapping=cache_data)

    # 2. Add to Sorted Set for LRU eviction tracking
    redis_client.zadd("llm:cache:access_order", {cache_id: cache_data['created_at']})
    print(f"Stored new entry: {cache_id}")

    # 3. (Optional) Publish to Kafka for async processing/analytics
    # producer.send('llm-interactions', key=cache_id, value=json.dumps(cache_data))

Build vs. Buy: The GPTCache Dilemma

You will discover GPTCache, a purpose-built library for semantic caching. It's tempting. It has pre-built embedding adapters, eviction managers, and similarity evaluators. The tradeoff is stark:

AspectCustom Redis ImplementationGPTCache Library
ControlAbsolute. You own the data schema, eviction logic, and scaling.Limited. You work within its abstractions.
Operational ComplexityHigh. You are responsible for vector index management, connection pooling, and monitoring.Low. It's a black-box service layer.
IntegrationFlexible. Direct access to Redis allows complex patterns (e.g., linking cache entries to user sessions).Constrained. You use its API.
PerformanceTunable. Can be optimized for your specific access patterns.Generalized. May have overhead for unused features.
Best ForTeams with Redis expertise, very high scale, or unique eviction/retrieval needs.Getting a robust solution running in an afternoon.

Choose GPTCache if you need a solution now and your needs are standard. Build custom if you're already swimming in Redis, need extreme performance, or want to integrate caching deeply into an event-driven architecture (e.g., publishing every cache miss to a Kafka stream for model training).

When to Kill a Cache Entry: TTL vs. The Event Stream

Cache invalidation remains one of the two hard problems in computer science. For LLM responses, you have two primary levers:

  • Time-to-Live (TTL): The blunt instrument. SETEX key 3600 value. Simple, effective for non-volatile data. How long is an answer about the capital of France valid? Probably forever. How long is an answer about "today's top news headlines" valid? About an hour. You can set a default TTL (e.g., 24 hours) on your cache Hash and implement a background Celery task that scans and deletes expired entries, also removing them from the ZSET.

  • Event-Based Expiry: The surgical strike. This is where you hook into your application's event stream. When a user provides feedback ("thumbs down" on a response), you can immediately invalidate that cache entry. If your knowledge base updates (e.g., a new product API version is released), you can publish a knowledge_base_updated event to Kafka. A consumer listens for this event, queries the cache for all entries related to the old API, and flushes them. This keeps your cache semantically fresh.

# Example Celery task for TTL-based cleanup
from celery import Celery
import redis

app = Celery('llm_cache_tasks', broker='redis://localhost:6379/0')
redis_client = redis.Redis(host='localhost', port=6379)

@app.task
def prune_old_cache_entries(max_age_seconds=86400):
    """Remove cache entries older than max_age_seconds."""
    cutoff = time.time() - max_age_seconds
    # Get IDs of old entries from the Sorted Set
    old_entry_ids = redis_client.zrangebyscore("llm:cache:access_order", 0, cutoff)

    if old_entry_ids:
        # Delete the primary Hash and remove from ZSET
        pipeline = redis_client.pipeline()
        for cache_id in old_entry_ids:
            pipeline.delete(cache_id)
            pipeline.zrem("llm:cache:access_order", cache_id)
        pipeline.execute()
        print(f"Pruned {len(old_entry_ids)} old cache entries.")

The Proof Is in the Production Logs: A Benchmark

Theory is useless without numbers. We implemented semantic caching on a mid-sized AI coding assistant (~5 million daily requests). Here’s what changed over a 30-day period:

MetricBefore CachingAfter Semantic CachingChange
Avg. LLM API Calls/Day5.2M3.1M-40%
95th %ile Response Latency1450ms820ms-43%
Cache Hit Rate0%38%N/A
Monthly OpenAI Cost$X$0.6X-40%

The hit rate of 38% is the killer. It directly translates to a 40% reduction in calls and costs. The latency drop comes from avoiding network hops to the LLM provider. The remaining calls are the unique, complex, or novel questions where you actually want to spend your compute budget.

Watching the Watcher: Metrics You Can't Ignore

Deploying a cache without monitoring is like flying blind into a thunderstorm. You need these three dashboards:

  1. Hit Rate & Cost Savings: A simple timeseries graph of (cache hits / total requests) * 100. Correlate this with your LLM provider's billing dashboard. The trend should mirror each other. A dropping hit rate means your user queries are changing or your embedding model/threshold needs adjustment.

  2. Latency Distribution: Histograms for cache hit path latency (should be <50ms) and cache miss path latency (will be your full LLM call time). Use this to prove the cache's value. Alert if the hit path latency degrades—you might have connection pool exhaustion, a primary cause of timeouts.

    • Error you'll see: redis.exceptions.ConnectionError: max number of clients reached
    • The fix: Increase maxclients in redis.conf and, crucially, use client-side connection pooling. In redis-py, use redis.ConnectionPool.
  3. Memory & Eviction Dashboard: Monitor used_memory on your Redis instance. Graph the size of your ZSET. Set an alert for when you hit 75% of max memory. Implement a proactive eviction policy in your application logic (e.g., a Celery task that, when the ZSET size exceeds 500k, uses ZREMRANGEBYRANK to remove the oldest 10%).

    • Error in related systems: celery.exceptions.SoftTimeLimitExceeded in your LLM task queue.
    • The fix: For long-running LLM calls, configure sensible timeouts: task_soft_time_limit=280, task_time_limit=300. This gives the task a chance to clean up before being hard-killed.

Next Steps: From Cache to Intelligent Routing Layer

Your semantic cache is now a critical piece of architecture. It's no longer just a key-value store; it's a semantic routing layer. The next evolution is to use it for more than just yes/no cache decisions.

  1. Tiered Model Routing: On a cache miss, instead of always calling GPT-4, check the similarity score. If the user's query is somewhat similar (score 0.75-0.88) to a cached entry, maybe a cheaper, faster model (like gpt-3.5-turbo or a fine-tuned local model) can synthesize a sufficient answer from the cached context. The cache becomes a knowledge base for routing logic.

  2. Prompt Versioning & A/B Testing: Store the prompt_template_id used in your cache Hash. When you roll out a new, improved prompt template, you can gradually invalidate old cached entries or run shadow traffic to compare performance of new vs. cached old responses.

  3. Event-Driven Knowledge Refresh: Tighten the Kafka integration. Every cache miss is a valuable data point about what your users are asking that you don't know. Stream these misses to a data lake. Analyze them weekly to identify gaps in your system's knowledge or RAG corpus that need filling.

The goal is to make the cache the brain of your operation—the component that remembers everything, informs decisions, and ensures you only pay for the novelty you actually need. Stop burning tokens on repeats. Start making your cache work for its keep.