Prompt Caching Explained: Saving 80% on Claude 4.5 API Costs

Problem: Your Claude API Bill Keeps Growing

You're sending the same system prompt — instructions, context, documents — on every API call. Claude re-processes it every time. That's 80% of your tokens wasted on repeated work.

You'll learn:

How prompt caching works under the hood
How to add cache breakpoints to your API calls
What to cache (and what not to)

Time: 15 min | Level: Intermediate

Why This Happens

Claude's API is stateless — each request is processed fresh with no memory of previous calls. If your system prompt is 2,000 tokens and you make 1,000 calls a day, you're paying to process 2 million tokens of identical content daily.

Prompt caching solves this by storing processed token states on Anthropic's servers for a defined TTL (time-to-live). Subsequent requests that hit the cache skip reprocessing entirely.

Common symptoms this is costing you:

Daily token usage dominated by system prompt tokens, not user content
Latency spiking on requests with large context windows
API costs scaling linearly even when user messages are short

Cache pricing vs. standard (Claude Sonnet 4.5):

Standard input: $3.00 / million tokens
Cache write: $3.75 / million tokens (25% premium)
Cache read: $0.30 / million tokens (90% discount)

The math is simple: one cache write pays for itself after a single re-read.

Solution

Step 1: Mark Your Cache Breakpoints

Add "cache_control": {"type": "ephemeral"} to the last block you want cached. Everything up to that point gets stored.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert technical writer. Follow these rules:\n\n[Your 2000-token style guide here]",
            # This cache_control marks the end of what gets cached
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Write a summary of this meeting transcript: [transcript]"}
    ]
)

Expected: First call has cache_creation_input_tokens > 0. Subsequent calls show cache_read_input_tokens > 0.

If it fails:

No cache tokens in response: Check that cache_control is on the last content block you want cached, not mid-string.
Cache never hits: TTL is 5 minutes by default. If calls are spaced further apart, the cache expires.

Step 2: Check Cache Usage in the Response

The API response tells you exactly what happened:

# After your API call:
usage = response.usage

print(f"Cache write tokens:  {usage.cache_creation_input_tokens}")
print(f"Cache read tokens:   {usage.cache_read_input_tokens}")
print(f"Regular input tokens:{usage.input_tokens}")
print(f"Output tokens:       {usage.output_tokens}")

# Calculate savings
cache_read = usage.cache_read_input_tokens or 0
savings = cache_read * (3.00 - 0.30) / 1_000_000  # dollars saved vs standard pricing
print(f"Saved on this call:  ${savings:.4f}")

Expected output on a cache hit:

Cache write tokens:  0
Cache read tokens:   2048
Regular input tokens: 12
Output tokens:       156
Saved on this call:  $0.0055

Step 3: Cache Large Documents Too

Caching works on the messages array, not just system. Use this for RAG pipelines where you load a document once per session:

# Load document once, cache it, then ask multiple questions
document_text = open("technical_spec.txt").read()

def ask_about_doc(question: str, conversation_history: list) -> str:
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"<document>\n{document_text}\n</document>",
                    # Cache the document — only process it once
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": question
                }
            ]
        },
        *conversation_history
    ]
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=messages
    )
    
    return response.content[0].text

Why this works: The document block is hashed and cached after the first call. Follow-up questions in the same session hit the cache and skip reprocessing the document entirely.

Step 4: Multi-Turn Conversation Caching

For chatbots, cache the growing conversation history to avoid reprocessing earlier turns:

messages = []

def chat(user_input: str) -> str:
    messages.append({"role": "user", "content": user_input})
    
    # Mark the last user message for caching
    cacheable_messages = messages.copy()
    if cacheable_messages:
        last_msg = cacheable_messages[-1]
        # Wrap content in cache_control
        cacheable_messages[-1] = {
            **last_msg,
            "content": [
                {
                    "type": "text",
                    "text": last_msg["content"] if isinstance(last_msg["content"], str) else "",
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        }
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system="You are a helpful assistant.",
        messages=cacheable_messages
    )
    
    assistant_reply = response.content[0].text
    messages.append({"role": "assistant", "content": assistant_reply})
    
    return assistant_reply

Verification

Run this test script to confirm caching is working:

import anthropic
import time

client = anthropic.Anthropic()

LARGE_SYSTEM_PROMPT = "You are a helpful assistant. " + ("Context. " * 500)  # ~600 tokens

def make_call(label: str):
    start = time.time()
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=50,
        system=[{"type": "text", "text": LARGE_SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": "Say hello."}]
    )
    elapsed = time.time() - start
    u = response.usage
    print(f"{label}: {elapsed:.2f}s | write={u.cache_creation_input_tokens} read={u.cache_read_input_tokens}")

make_call("Call 1 (cache write)")
time.sleep(1)
make_call("Call 2 (cache read)")
make_call("Call 3 (cache read)")

You should see:

Call 1 (cache write): 1.82s | write=612 read=0
Call 2 (cache read):  0.94s | write=0 read=612
Call 3 (cache read):  0.91s | write=0 read=612

Call 1 is slower (writing cache). Calls 2 and 3 are ~50% faster and 90% cheaper for those tokens.

What You Learned

cache_control: {type: "ephemeral"} marks the end of what gets stored — everything before it is cached as a unit.
Cache TTL is 5 minutes. Calls within that window hit the cache; calls after re-write it (charged at 25% premium again).
Cache writes cost 25% more than standard — profitable after just one re-read.
Works on system, messages, and tool definitions.

When NOT to use caching:

Short system prompts under ~200 tokens — the overhead isn't worth it.
Highly dynamic content that changes every call — you'll pay write costs with no reads.
Latency-sensitive single-shot calls where you won't make follow-up requests.

Limitation: Cache is per-model and per-API-key. Cached prompts from one model version don't transfer to another.

Tested on Claude Sonnet 4.5, Anthropic Python SDK 0.40+, February 2026