Prompt Caching Explained: Saving 80% on Claude 4.5 API Costs

Learn how Claude's prompt caching slashes API costs by up to 80% and cuts latency in half — with working code you can ship today.

Problem: Your Claude API Bill Keeps Growing

You're sending the same system prompt — instructions, context, documents — on every API call. Claude re-processes it every time. That's 80% of your tokens wasted on repeated work.

You'll learn:

  • How prompt caching works under the hood
  • How to add cache breakpoints to your API calls
  • What to cache (and what not to)

Time: 15 min | Level: Intermediate


Why This Happens

Claude's API is stateless — each request is processed fresh with no memory of previous calls. If your system prompt is 2,000 tokens and you make 1,000 calls a day, you're paying to process 2 million tokens of identical content daily.

Prompt caching solves this by storing processed token states on Anthropic's servers for a defined TTL (time-to-live). Subsequent requests that hit the cache skip reprocessing entirely.

Common symptoms this is costing you:

  • Daily token usage dominated by system prompt tokens, not user content
  • Latency spiking on requests with large context windows
  • API costs scaling linearly even when user messages are short

Cache pricing vs. standard (Claude Sonnet 4.5):

  • Standard input: $3.00 / million tokens
  • Cache write: $3.75 / million tokens (25% premium)
  • Cache read: $0.30 / million tokens (90% discount)

The math is simple: one cache write pays for itself after a single re-read.


Solution

Step 1: Mark Your Cache Breakpoints

Add "cache_control": {"type": "ephemeral"} to the last block you want cached. Everything up to that point gets stored.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert technical writer. Follow these rules:\n\n[Your 2000-token style guide here]",
            # This cache_control marks the end of what gets cached
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Write a summary of this meeting transcript: [transcript]"}
    ]
)

Expected: First call has cache_creation_input_tokens > 0. Subsequent calls show cache_read_input_tokens > 0.

If it fails:

  • No cache tokens in response: Check that cache_control is on the last content block you want cached, not mid-string.
  • Cache never hits: TTL is 5 minutes by default. If calls are spaced further apart, the cache expires.

Step 2: Check Cache Usage in the Response

The API response tells you exactly what happened:

# After your API call:
usage = response.usage

print(f"Cache write tokens:  {usage.cache_creation_input_tokens}")
print(f"Cache read tokens:   {usage.cache_read_input_tokens}")
print(f"Regular input tokens:{usage.input_tokens}")
print(f"Output tokens:       {usage.output_tokens}")

# Calculate savings
cache_read = usage.cache_read_input_tokens or 0
savings = cache_read * (3.00 - 0.30) / 1_000_000  # dollars saved vs standard pricing
print(f"Saved on this call:  ${savings:.4f}")

Expected output on a cache hit:

Cache write tokens:  0
Cache read tokens:   2048
Regular input tokens: 12
Output tokens:       156
Saved on this call:  $0.0055

Step 3: Cache Large Documents Too

Caching works on the messages array, not just system. Use this for RAG pipelines where you load a document once per session:

# Load document once, cache it, then ask multiple questions
document_text = open("technical_spec.txt").read()

def ask_about_doc(question: str, conversation_history: list) -> str:
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"<document>\n{document_text}\n</document>",
                    # Cache the document — only process it once
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": question
                }
            ]
        },
        *conversation_history
    ]
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=messages
    )
    
    return response.content[0].text

Why this works: The document block is hashed and cached after the first call. Follow-up questions in the same session hit the cache and skip reprocessing the document entirely.


Step 4: Multi-Turn Conversation Caching

For chatbots, cache the growing conversation history to avoid reprocessing earlier turns:

messages = []

def chat(user_input: str) -> str:
    messages.append({"role": "user", "content": user_input})
    
    # Mark the last user message for caching
    cacheable_messages = messages.copy()
    if cacheable_messages:
        last_msg = cacheable_messages[-1]
        # Wrap content in cache_control
        cacheable_messages[-1] = {
            **last_msg,
            "content": [
                {
                    "type": "text",
                    "text": last_msg["content"] if isinstance(last_msg["content"], str) else "",
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        }
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system="You are a helpful assistant.",
        messages=cacheable_messages
    )
    
    assistant_reply = response.content[0].text
    messages.append({"role": "assistant", "content": assistant_reply})
    
    return assistant_reply

Verification

Run this test script to confirm caching is working:

import anthropic
import time

client = anthropic.Anthropic()

LARGE_SYSTEM_PROMPT = "You are a helpful assistant. " + ("Context. " * 500)  # ~600 tokens

def make_call(label: str):
    start = time.time()
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=50,
        system=[{"type": "text", "text": LARGE_SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": "Say hello."}]
    )
    elapsed = time.time() - start
    u = response.usage
    print(f"{label}: {elapsed:.2f}s | write={u.cache_creation_input_tokens} read={u.cache_read_input_tokens}")

make_call("Call 1 (cache write)")
time.sleep(1)
make_call("Call 2 (cache read)")
make_call("Call 3 (cache read)")

You should see:

Call 1 (cache write): 1.82s | write=612 read=0
Call 2 (cache read):  0.94s | write=0 read=612
Call 3 (cache read):  0.91s | write=0 read=612

Call 1 is slower (writing cache). Calls 2 and 3 are ~50% faster and 90% cheaper for those tokens.


What You Learned

  • cache_control: {type: "ephemeral"} marks the end of what gets stored — everything before it is cached as a unit.
  • Cache TTL is 5 minutes. Calls within that window hit the cache; calls after re-write it (charged at 25% premium again).
  • Cache writes cost 25% more than standard — profitable after just one re-read.
  • Works on system, messages, and tool definitions.

When NOT to use caching:

  • Short system prompts under ~200 tokens — the overhead isn't worth it.
  • Highly dynamic content that changes every call — you'll pay write costs with no reads.
  • Latency-sensitive single-shot calls where you won't make follow-up requests.

Limitation: Cache is per-model and per-API-key. Cached prompts from one model version don't transfer to another.


Tested on Claude Sonnet 4.5, Anthropic Python SDK 0.40+, February 2026