Problem: Your Claude API Bill Keeps Growing
You're sending the same system prompt — instructions, context, documents — on every API call. Claude re-processes it every time. That's 80% of your tokens wasted on repeated work.
You'll learn:
- How prompt caching works under the hood
- How to add cache breakpoints to your API calls
- What to cache (and what not to)
Time: 15 min | Level: Intermediate
Why This Happens
Claude's API is stateless — each request is processed fresh with no memory of previous calls. If your system prompt is 2,000 tokens and you make 1,000 calls a day, you're paying to process 2 million tokens of identical content daily.
Prompt caching solves this by storing processed token states on Anthropic's servers for a defined TTL (time-to-live). Subsequent requests that hit the cache skip reprocessing entirely.
Common symptoms this is costing you:
- Daily token usage dominated by system prompt tokens, not user content
- Latency spiking on requests with large context windows
- API costs scaling linearly even when user messages are short
Cache pricing vs. standard (Claude Sonnet 4.5):
- Standard input: $3.00 / million tokens
- Cache write: $3.75 / million tokens (25% premium)
- Cache read: $0.30 / million tokens (90% discount)
The math is simple: one cache write pays for itself after a single re-read.
Solution
Step 1: Mark Your Cache Breakpoints
Add "cache_control": {"type": "ephemeral"} to the last block you want cached. Everything up to that point gets stored.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert technical writer. Follow these rules:\n\n[Your 2000-token style guide here]",
# This cache_control marks the end of what gets cached
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "Write a summary of this meeting transcript: [transcript]"}
]
)
Expected: First call has cache_creation_input_tokens > 0. Subsequent calls show cache_read_input_tokens > 0.
If it fails:
- No cache tokens in response: Check that
cache_controlis on the last content block you want cached, not mid-string. - Cache never hits: TTL is 5 minutes by default. If calls are spaced further apart, the cache expires.
Step 2: Check Cache Usage in the Response
The API response tells you exactly what happened:
# After your API call:
usage = response.usage
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Regular input tokens:{usage.input_tokens}")
print(f"Output tokens: {usage.output_tokens}")
# Calculate savings
cache_read = usage.cache_read_input_tokens or 0
savings = cache_read * (3.00 - 0.30) / 1_000_000 # dollars saved vs standard pricing
print(f"Saved on this call: ${savings:.4f}")
Expected output on a cache hit:
Cache write tokens: 0
Cache read tokens: 2048
Regular input tokens: 12
Output tokens: 156
Saved on this call: $0.0055
Step 3: Cache Large Documents Too
Caching works on the messages array, not just system. Use this for RAG pipelines where you load a document once per session:
# Load document once, cache it, then ask multiple questions
document_text = open("technical_spec.txt").read()
def ask_about_doc(question: str, conversation_history: list) -> str:
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": f"<document>\n{document_text}\n</document>",
# Cache the document — only process it once
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": question
}
]
},
*conversation_history
]
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=messages
)
return response.content[0].text
Why this works: The document block is hashed and cached after the first call. Follow-up questions in the same session hit the cache and skip reprocessing the document entirely.
Step 4: Multi-Turn Conversation Caching
For chatbots, cache the growing conversation history to avoid reprocessing earlier turns:
messages = []
def chat(user_input: str) -> str:
messages.append({"role": "user", "content": user_input})
# Mark the last user message for caching
cacheable_messages = messages.copy()
if cacheable_messages:
last_msg = cacheable_messages[-1]
# Wrap content in cache_control
cacheable_messages[-1] = {
**last_msg,
"content": [
{
"type": "text",
"text": last_msg["content"] if isinstance(last_msg["content"], str) else "",
"cache_control": {"type": "ephemeral"}
}
]
}
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system="You are a helpful assistant.",
messages=cacheable_messages
)
assistant_reply = response.content[0].text
messages.append({"role": "assistant", "content": assistant_reply})
return assistant_reply
Verification
Run this test script to confirm caching is working:
import anthropic
import time
client = anthropic.Anthropic()
LARGE_SYSTEM_PROMPT = "You are a helpful assistant. " + ("Context. " * 500) # ~600 tokens
def make_call(label: str):
start = time.time()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=50,
system=[{"type": "text", "text": LARGE_SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": "Say hello."}]
)
elapsed = time.time() - start
u = response.usage
print(f"{label}: {elapsed:.2f}s | write={u.cache_creation_input_tokens} read={u.cache_read_input_tokens}")
make_call("Call 1 (cache write)")
time.sleep(1)
make_call("Call 2 (cache read)")
make_call("Call 3 (cache read)")
You should see:
Call 1 (cache write): 1.82s | write=612 read=0
Call 2 (cache read): 0.94s | write=0 read=612
Call 3 (cache read): 0.91s | write=0 read=612
Call 1 is slower (writing cache). Calls 2 and 3 are ~50% faster and 90% cheaper for those tokens.
What You Learned
cache_control: {type: "ephemeral"}marks the end of what gets stored — everything before it is cached as a unit.- Cache TTL is 5 minutes. Calls within that window hit the cache; calls after re-write it (charged at 25% premium again).
- Cache writes cost 25% more than standard — profitable after just one re-read.
- Works on
system,messages, and tool definitions.
When NOT to use caching:
- Short system prompts under ~200 tokens — the overhead isn't worth it.
- Highly dynamic content that changes every call — you'll pay write costs with no reads.
- Latency-sensitive single-shot calls where you won't make follow-up requests.
Limitation: Cache is per-model and per-API-key. Cached prompts from one model version don't transfer to another.
Tested on Claude Sonnet 4.5, Anthropic Python SDK 0.40+, February 2026