Build Faster Apps with OpenAI Prompt Caching: How It Works 2026

OpenAI prompt caching cuts latency by up to 80% and cost by 50% on repeated context. Learn how cache hits work, when to use it, and how to structure prompts. Python + Node.js tested.

Problem: Every API Call Re-Processes the Same Context

OpenAI prompt caching lets the API reuse a previously computed KV cache for any prompt prefix that exceeds 1,024 tokens — instead of re-processing the full input on every request.

Without it, a 10,000-token system prompt gets fully tokenized and processed on every single call. At scale, that's wasted compute, ballooning latency, and unnecessary cost.

You'll learn:

  • Exactly how OpenAI's automatic prompt caching works under the hood
  • How to structure prompts to maximize cache hit rate
  • How to verify cache hits in API responses and track savings
  • When caching helps — and when it doesn't

Time: 15 min | Difficulty: Intermediate


How OpenAI Prompt Caching Works

OpenAI enables prompt caching automatically on all supported models — no configuration required. When you send a request, the API checks whether the leading prefix of your prompt matches a cached KV (key-value) state from a recent prior request.

OpenAI prompt caching request flow: prefix match, cache hit, and token generation pipeline How a cache hit works: the API skips re-processing cached prefix tokens and jumps straight to generating new completion tokens.

Cache hit conditions — all must be true:

  • The prompt prefix is identical to a prior request (byte-for-byte)
  • The cached prefix is at least 1,024 tokens long
  • The cache entry is still warm (within ~5–10 minutes of last use, though OpenAI does not publish a fixed TTL)
  • You are using a supported model (see below)

When a cache hit occurs, the cached tokens are billed at a 50% discount and processed with significantly reduced latency — OpenAI reports up to 80% faster time-to-first-token on long cached prefixes.

Supported Models (as of 2026)

ModelPrompt CachingMin Prefix
gpt-4o1,024 tokens
gpt-4o-mini1,024 tokens
o11,024 tokens
o3-mini1,024 tokens
gpt-4-turbo
gpt-3.5-turbo

Cache granularity is in 128-token increments above the 1,024-token floor. A 1,500-token prefix will cache 1,024 tokens; a 1,200-token prefix will also cache 1,024. The next increment kicks in at 1,152.


Step-by-Step: Structuring Prompts for Maximum Cache Hits

The core rule: stable content goes first, dynamic content goes last.

The API can only match prefixes — it cannot skip around in the middle of a prompt. If you place the user's unique query before your 5,000-token system prompt, you get zero cache benefit.

Step 1: Order Your Prompt Layers

Structure every request in this order:

[1] System prompt (static, rarely changes)
[2] Reference documents / RAG context (semi-static per session)
[3] Conversation history (grows, but oldest turns are stable)
[4] Current user message (always dynamic)
# WHY this order: OpenAI matches the longest stable prefix
# Putting system prompt first maximizes the cached token count

import openai

client = openai.OpenAI()

SYSTEM_PROMPT = """
You are a senior Python engineer at a fintech company.
[... 2000 tokens of style guide, coding standards, tool preferences ...]
"""

def chat(user_message: str, history: list[dict]) -> str:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},  # ← cached after first call
        *history,                                        # ← older turns also cached
        {"role": "user", "content": user_message},      # ← always fresh
    ]

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
    )

    # Inspect cache hit
    usage = response.usage
    cached = usage.prompt_tokens_details.cached_tokens
    total_prompt = usage.prompt_tokens
    print(f"Cached: {cached}/{total_prompt} prompt tokens ({cached/total_prompt*100:.0f}%)")

    return response.choices[0].message.content

Expected output on second call:

Cached: 2048/2156 prompt tokens (95%)

Step 2: Check Cache Hit Data in the Response

OpenAI returns cache statistics inside usage.prompt_tokens_details:

usage = response.usage

print(usage.prompt_tokens)                        # total input tokens billed
print(usage.prompt_tokens_details.cached_tokens)  # tokens served from cache (50% discount)
print(usage.completion_tokens)                    # output tokens (always full price)

A cached_tokens value of 0 on the first request is expected — there is no cache entry yet. On subsequent requests with the same prefix, you will see the count rise.

# WHY check this: cached_tokens = 0 on first call is not a bug
# The cache is populated by the first call and hit by the second

assert usage.prompt_tokens_details.cached_tokens == 0  # first call: cold
# second call with same prefix → cached_tokens > 0

Step 3: Node.js Implementation

// WHY typed response: prompt_tokens_details is nested and easy to miss
import OpenAI from "openai";

const client = new OpenAI();

const SYSTEM = `You are a technical documentation assistant...
[... 1500+ tokens of content ...]`;

async function complete(userMessage: string): Promise<string> {
  const response = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "system", content: SYSTEM },
      { role: "user", content: userMessage },
    ],
  });

  const cached = response.usage?.prompt_tokens_details?.cached_tokens ?? 0;
  const total = response.usage?.prompt_tokens ?? 0;

  console.log(`Cache hit: ${cached}/${total} tokens`);
  return response.choices[0].message.content ?? "";
}

When Prompt Caching Helps (and When It Doesn't)

High-impact use cases

RAG and document Q&A — you load the same 8,000-token document for every user question in a session. The document gets cached after the first query; every follow-up is fast and cheap.

Multi-turn chat with a long system prompt — a coding assistant with a detailed system prompt hits the cache on every turn after the first.

Batch processing with shared context — you process 500 user emails all evaluated against the same 2,000-token policy document. The policy is cached after the first item.

Few-shot prompts — a 20-example few-shot block stays constant; only the test input changes.

Low-impact or zero-impact use cases

ScenarioWhy caching doesn't help
Short system prompts (< 1,024 tokens)Below the minimum threshold
Every request has a unique prefixNo stable prefix to match
Low-volume or infrequent callsCache expires before next request
Streaming with long prefixes on cold startFirst request is always uncached

Cost example (USD)

GPT-4o pricing as of early 2026: $2.50 / 1M input tokens, $1.25 / 1M cached input tokens.

A RAG pipeline sending 10,000 input tokens per query at 1,000 queries/day:

Without cachingWith caching (90% hit)
Daily token cost$25.00~$3.75
Monthly$750/mo~$112/mo

Numbers are illustrative. Actual savings depend on cache hit rate and prefix length. Verify current pricing at platform.openai.com/pricing.


Verification

Run two identical requests back-to-back and compare cached_tokens:

import openai
import time

client = openai.OpenAI()

LONG_SYSTEM = "You are a helpful assistant. " + ("Context. " * 300)  # ~1,200 tokens

def probe(label: str):
    r = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": LONG_SYSTEM},
            {"role": "user", "content": "Hello"},
        ],
        max_tokens=5,
    )
    cached = r.usage.prompt_tokens_details.cached_tokens
    print(f"{label}: cached_tokens={cached}")

probe("Call 1 (cold)")
time.sleep(1)
probe("Call 2 (warm)")

You should see:

Call 1 (cold): cached_tokens=0
Call 2 (warm): cached_tokens=1152

If both show 0, the prefix is under 1,024 tokens — pad your system prompt or use a real document.


What You Learned

  • OpenAI prompt caching is automatic — no API flag, no configuration. Just structure your prompts correctly.
  • The cache matches leading prefixes only — stable content must always come before dynamic content.
  • usage.prompt_tokens_details.cached_tokens is the only reliable way to confirm a cache hit.
  • Caching is most valuable for RAG, long system prompts, and batch jobs — not for short or highly variable prompts.
  • Cache entries are ephemeral — they expire after a few minutes of inactivity. For scheduled batch jobs, keep requests close together.

Tested on gpt-4o (2025-01-01), openai-python 1.x, Node.js 22, macOS & Ubuntu 24.04


FAQ

Q: Do I need to enable prompt caching in the API request? A: No. Caching is automatic on all supported models. There is no parameter to set — OpenAI handles it server-side.

Q: Is cached content stored permanently on OpenAI's servers? A: No. Cache entries are ephemeral and tied to your organization. OpenAI states they are not used for training and expire after a short TTL (approximately 5–10 minutes of inactivity, though this is not guaranteed).

Q: What is the minimum system prompt length to get a cache hit? A: 1,024 tokens is the hard floor. Use tiktoken to count: len(tiktoken.encoding_for_model("gpt-4o").encode(your_prompt)).

Q: Does prompt caching work with streaming responses? A: Yes. Streaming affects only how completion tokens are returned — cache hit behavior on input tokens is identical.

Q: How does OpenAI prompt caching compare to Anthropic's? A: Both offer roughly 50% cost reduction on cached tokens. Anthropic requires explicit cache_control markers; OpenAI is fully automatic. Anthropic's cache TTL is longer (5 minutes base, extendable to 1 hour). OpenAI's is shorter but requires zero implementation effort.