Problem: Standard RAG Loses Context When Chunks Are Split
Contextual retrieval is Anthropic's technique for fixing the silent failure mode in every standard RAG pipeline — chunks that are semantically meaningless without the surrounding document context.
Here's the situation: you split a 50-page PDF into 512-token chunks, embed them, and store them in a vector DB. A user asks a question. Your retriever pulls the top-5 chunks by cosine similarity. Three of those chunks say things like "As described above, this approach…" or "The following table summarizes…" — stripped of the context that makes them useful.
Anthropic published results showing this degrades retrieval accuracy. Their fix: prepend each chunk with a short, Claude-generated summary that anchors it to the source document before embedding. The result was a 49% reduction in retrieval failures in their internal benchmarks.
You'll learn:
- Why standard chunking breaks retrieval for long documents
- How to implement Anthropic's contextual retrieval in Python 3.12
- How to combine it with BM25 hybrid search for maximum recall
- How to keep Claude API costs under control at scale
Time: 25 min | Difficulty: Intermediate
Why Standard RAG Fails on Long Documents
Most RAG pipelines treat chunking as a solved problem. Split on token count, maybe add overlap, done. The issue isn't the split itself — it's that the resulting chunks are orphaned from their source document during retrieval.
Symptoms:
- Retrieved chunks reference "the section above" or "as mentioned earlier" — meaning missing from the chunk
- Questions about document-wide themes return low-confidence results
- Chunks from the middle of dense technical docs score poorly despite being highly relevant
- Rerankers can't rescue chunks that lost critical context during splitting
Root cause: Embedding models encode the chunk in isolation. If the chunk says "This method reduces latency by 40%", the embedding has no signal for which method, which system, or which benchmark — because that context lived 800 tokens earlier in the document.
How Contextual Retrieval Works
Each chunk gets a Claude-generated context prefix before embedding. Both the dense vector index and the BM25 sparse index receive the enriched chunk.
The pipeline has three stages that differ from standard RAG:
Stage 1 — Contextual chunk enrichment. Before embedding, each chunk is sent to Claude with the full document and a prompt asking for a 1–2 sentence situating summary. That summary is prepended to the chunk text.
Stage 2 — Dual indexing. The enriched chunk goes into both a dense vector store (for semantic search) and a BM25 sparse index (for keyword search). Running both in parallel is what Anthropic calls "hybrid search."
Stage 3 — Reciprocal Rank Fusion. Results from both indexes are merged using RRF scoring, then passed to a reranker (Cohere or a local cross-encoder). The reranker re-scores on semantic relevance before the top-k results reach the LLM.
This three-stage approach is what drives the accuracy improvement. The contextual prefix fixes embedding quality. Hybrid search fixes recall gaps. RRF + reranking fixes precision.
Implementation
Step 1: Install dependencies
# Python 3.12 recommended — uses new type annotation syntax in later steps
pip install anthropic langchain langchain-community \
langchain-anthropic chromadb rank-bm25 \
cohere tiktoken pypdf --break-system-packages
Set your API keys:
export ANTHROPIC_API_KEY="sk-ant-..."
export COHERE_API_KEY="..." # Optional — swap for a local cross-encoder if preferred
Step 2: Build the contextual chunk enricher
This is the core of the technique. For every chunk, call Claude with the full document text and ask it to situate the chunk.
import anthropic
from dataclasses import dataclass
client = anthropic.Anthropic()
CONTEXT_PROMPT = """\
<document>
{document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>
Please give a short succinct context (1-2 sentences) to situate this chunk \
within the overall document for the purposes of improving search retrieval. \
Answer only with the succinct context and nothing else.
"""
@dataclass
class EnrichedChunk:
text: str # context prefix + original chunk text
original: str # original chunk text (for display)
source: str # document filename or ID
chunk_index: int
def enrich_chunk(document_text: str, chunk_text: str, source: str, idx: int) -> EnrichedChunk:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200, # context prefix is short — cap tokens to control cost
messages=[{
"role": "user",
"content": CONTEXT_PROMPT.format(document=document_text, chunk=chunk_text)
}]
)
context_prefix = response.content[0].text.strip()
enriched_text = f"{context_prefix}\n\n{chunk_text}"
return EnrichedChunk(
text=enriched_text,
original=chunk_text,
source=source,
chunk_index=idx
)
Note on cost: Each enrich_chunk call sends the full document in the prompt. For a 50-page PDF (~25k tokens) with 100 chunks, that's 2.5M input tokens. At Claude Sonnet's current pricing of $3/MTok input, that's ~$7.50 per document. Use prompt caching (see Step 5) to reduce this by ~90%.
Step 3: Chunk documents and enrich in batch
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
def load_and_enrich(pdf_path: str) -> list[EnrichedChunk]:
loader = PyPDFLoader(pdf_path)
pages = loader.load()
full_text = "\n\n".join(p.page_content for p in pages)
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64, # overlap preserves sentence boundaries
separators=["\n\n", "\n", ". ", " "]
)
raw_chunks = splitter.split_text(full_text)
enriched = []
for idx, chunk in enumerate(raw_chunks):
ec = enrich_chunk(full_text, chunk, pdf_path, idx)
enriched.append(ec)
print(f" Enriched chunk {idx+1}/{len(raw_chunks)}")
return enriched
Expected output after enriching chunk 1:
Enriched chunk 1/87
Step 4: Index into ChromaDB (dense) and BM25 (sparse)
import chromadb
from langchain_anthropic import ChatAnthropic
from langchain_community.embeddings import HuggingFaceEmbeddings
from rank_bm25 import BM25Okapi
# Use a local embedding model to avoid per-embed API costs
embedder = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection("contextual_rag")
def build_indexes(chunks: list[EnrichedChunk]):
# Dense index — ChromaDB
collection.add(
ids=[f"{c.source}_{c.chunk_index}" for c in chunks],
documents=[c.text for c in chunks],
embeddings=[embedder.embed_query(c.text) for c in chunks],
metadatas=[{"source": c.source, "original": c.original} for c in chunks]
)
# Sparse index — BM25 on tokenized enriched text
tokenized = [c.text.lower().split() for c in chunks]
bm25 = BM25Okapi(tokenized)
return bm25, [c.text for c in chunks] # return corpus alongside index for score lookup
Step 5: Enable prompt caching to cut enrichment cost by ~90%
Anthropic's API supports prompt caching via cache_control. For contextual retrieval, the document stays constant across all chunk enrichment calls — mark it for caching.
def enrich_chunk_cached(document_text: str, chunk_text: str, source: str, idx: int) -> EnrichedChunk:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
system=[{
"type": "text",
"text": "You are a helpful assistant that situates document chunks for retrieval.",
"cache_control": {"type": "ephemeral"} # cache the system prompt
}],
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": f"<document>\n{document_text}\n</document>",
"cache_control": {"type": "ephemeral"} # cache the full document
},
{
"type": "text",
"text": (
f"Here is the chunk to situate:\n<chunk>\n{chunk_text}\n</chunk>\n\n"
"Give a 1-2 sentence context to situate this chunk. Answer only with the context."
)
}
]
}]
)
context_prefix = response.content[0].text.strip()
return EnrichedChunk(
text=f"{context_prefix}\n\n{chunk_text}",
original=chunk_text,
source=source,
chunk_index=idx
)
The first call caches the document. Every subsequent call for the same document hits the cache — input tokens for the document cost ~10x less. For a 100-chunk document, effective cost drops from ~$7.50 to under $1.
Step 6: Hybrid retrieval with Reciprocal Rank Fusion
def reciprocal_rank_fusion(
dense_results: list[dict],
sparse_results: list[tuple[int, float]],
k: int = 60
) -> list[dict]:
"""Merge dense and sparse rankings using RRF scoring."""
scores: dict[str, float] = {}
doc_map: dict[str, dict] = {}
for rank, result in enumerate(dense_results):
doc_id = result["id"]
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
doc_map[doc_id] = result
for rank, (corpus_idx, _) in enumerate(sparse_results):
doc_id = str(corpus_idx)
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(doc_map.values(), key=lambda r: scores.get(r["id"], 0), reverse=True)
def retrieve(query: str, bm25: BM25Okapi, corpus: list[str], top_k: int = 20) -> list[dict]:
query_embedding = embedder.embed_query(query)
# Dense retrieval
dense_raw = collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
dense_results = [
{"id": dense_raw["ids"][0][i], "text": dense_raw["documents"][0][i],
"metadata": dense_raw["metadatas"][0][i]}
for i in range(len(dense_raw["ids"][0]))
]
# Sparse retrieval — BM25
tokenized_query = query.lower().split()
bm25_scores = bm25.get_scores(tokenized_query)
sparse_results = sorted(enumerate(bm25_scores), key=lambda x: x[1], reverse=True)[:top_k]
return reciprocal_rank_fusion(dense_results, sparse_results)
Step 7: Wire it into a full RAG chain
from langchain_anthropic import ChatAnthropic
import cohere
co = cohere.Client() # Cohere reranker — swap for a local cross-encoder to avoid API dependency
llm = ChatAnthropic(model="claude-sonnet-4-20250514", max_tokens=1024)
def answer(query: str, bm25: BM25Okapi, corpus: list[str]) -> str:
# 1. Hybrid retrieve
candidates = retrieve(query, bm25, corpus, top_k=20)
# 2. Rerank with Cohere
rerank_response = co.rerank(
query=query,
documents=[c["text"] for c in candidates],
top_n=5, # final context window gets top 5 after reranking
model="rerank-english-v3.0"
)
top_chunks = [candidates[r.index]["text"] for r in rerank_response.results]
# 3. Generate answer
context = "\n\n---\n\n".join(top_chunks)
prompt = f"""Answer the question using only the context below.
<context>
{context}
</context>
Question: {query}
Answer:"""
response = llm.invoke(prompt)
return response.content
Verification
Run a quick end-to-end test with a sample document:
if __name__ == "__main__":
chunks = load_and_enrich("sample_report.pdf")
bm25, corpus = build_indexes(chunks)
result = answer("What are the key latency optimizations described?", bm25, corpus)
print(result)
You should see: A coherent answer that correctly attributes the optimization to the right system — not a vague "the method described above reduces latency."
If the answer is still vague, check:
Error: Invalid API Key→ Re-exportANTHROPIC_API_KEY- Chunks too short → Raise
chunk_sizeto 768 or 1024. Very short chunks lose meaning even with context prepended. - BM25 always wins → Your document has rare keywords dominating scores. Add a weight parameter to RRF: multiply the dense score contribution by 1.3.
Contextual Retrieval vs Standard RAG
| Standard RAG | Contextual Retrieval | |
|---|---|---|
| Chunk embedding | Raw chunk only | Context prefix + chunk |
| Retrieval method | Dense vector only | Dense + BM25 hybrid |
| Merging strategy | Top-k by score | Reciprocal Rank Fusion |
| Reranking | Optional | Recommended |
| Cost per document | Low | Medium (offset by caching) |
| Retrieval failure rate | Baseline | ~49% lower (Anthropic benchmark) |
| Best for | Short focused docs | Long technical docs, contracts, reports |
Use standard RAG if: Your documents are short (under 5 pages), your queries are highly specific keyword searches, or you need zero additional API cost.
Use contextual retrieval if: You're working with long technical documents, legal contracts, research papers, or any corpus where chunks frequently reference earlier content.
What You Learned
- Standard RAG fails because chunks lose surrounding context during splitting — the fix is prepending a Claude-generated summary before embedding
- Prompt caching on the document text cuts enrichment costs by ~90% and is essential for production use
- BM25 hybrid search recovers keyword-match cases that dense retrieval misses, especially for proper nouns and exact terms
- Reciprocal Rank Fusion is a parameter-free merge strategy — it outperforms weighted sum in most RAG benchmarks without tuning
- The reranker is the final quality gate — it re-scores on semantic relevance after retrieval, not before
Tested on Python 3.12, anthropic SDK 0.40, chromadb 0.5, rank-bm25 0.2.2, macOS and Ubuntu 24.04
FAQ
Q: Does contextual retrieval work with OpenAI embeddings instead of a local model?
A: Yes. Replace HuggingFaceEmbeddings with OpenAIEmbeddings from langchain-openai. The enrichment logic is independent of the embedding model. Local models like BAAI/bge-small-en-v1.5 perform comparably on retrieval benchmarks and cost nothing per query.
Q: What is the minimum context window needed to enrich chunks from long documents? A: Claude Sonnet supports 200k tokens. For most PDFs under 150 pages (~75k tokens), a single call handles the full document. For larger documents, split by section and enrich each section's chunks against only that section's text.
Q: Can I skip the reranker and just use RRF output directly?
A: Yes, but accuracy drops. RRF improves recall; the reranker improves precision. If Cohere is too expensive (~$1/1k searches on the paid tier), use a local cross-encoder like cross-encoder/ms-marco-MiniLM-L-6-v2 via sentence-transformers — free and nearly as accurate.
Q: How does contextual retrieval handle duplicate or near-duplicate chunks? A: It doesn't deduplicate by default. Add a post-retrieval step that compares chunk embeddings and drops any result with cosine similarity > 0.95 to a higher-ranked result. ChromaDB doesn't expose this natively — run it in Python after retrieval.
Q: What chunk size works best with contextual retrieval? A: Anthropic's own tests used 512-token chunks with a 50-token context prefix. Smaller chunks (256 tokens) benefit more from context enrichment because they lose more coherence on their own. Larger chunks (1024 tokens) need less enrichment but slow reranking.