Problem: Standard RAG Loses Context When Chunks Are Split

Contextual retrieval is Anthropic's technique for fixing the silent failure mode in every standard RAG pipeline — chunks that are semantically meaningless without the surrounding document context.

Here's the situation: you split a 50-page PDF into 512-token chunks, embed them, and store them in a vector DB. A user asks a question. Your retriever pulls the top-5 chunks by cosine similarity. Three of those chunks say things like "As described above, this approach…" or "The following table summarizes…" — stripped of the context that makes them useful.

Anthropic published results showing this degrades retrieval accuracy. Their fix: prepend each chunk with a short, Claude-generated summary that anchors it to the source document before embedding. The result was a 49% reduction in retrieval failures in their internal benchmarks.

You'll learn:

Why standard chunking breaks retrieval for long documents
How to implement Anthropic's contextual retrieval in Python 3.12
How to combine it with BM25 hybrid search for maximum recall
How to keep Claude API costs under control at scale

Time: 25 min | Difficulty: Intermediate

Why Standard RAG Fails on Long Documents

Most RAG pipelines treat chunking as a solved problem. Split on token count, maybe add overlap, done. The issue isn't the split itself — it's that the resulting chunks are orphaned from their source document during retrieval.

Symptoms:

Retrieved chunks reference "the section above" or "as mentioned earlier" — meaning missing from the chunk
Questions about document-wide themes return low-confidence results
Chunks from the middle of dense technical docs score poorly despite being highly relevant
Rerankers can't rescue chunks that lost critical context during splitting

Root cause: Embedding models encode the chunk in isolation. If the chunk says "This method reduces latency by 40%", the embedding has no signal for which method, which system, or which benchmark — because that context lived 800 tokens earlier in the document.

How Contextual Retrieval Works

Contextual Retrieval architecture — chunk prepend and dual-index pipeline Each chunk gets a Claude-generated context prefix before embedding. Both the dense vector index and the BM25 sparse index receive the enriched chunk.

The pipeline has three stages that differ from standard RAG:

Stage 1 — Contextual chunk enrichment. Before embedding, each chunk is sent to Claude with the full document and a prompt asking for a 1–2 sentence situating summary. That summary is prepended to the chunk text.

Stage 2 — Dual indexing. The enriched chunk goes into both a dense vector store (for semantic search) and a BM25 sparse index (for keyword search). Running both in parallel is what Anthropic calls "hybrid search."

Stage 3 — Reciprocal Rank Fusion. Results from both indexes are merged using RRF scoring, then passed to a reranker (Cohere or a local cross-encoder). The reranker re-scores on semantic relevance before the top-k results reach the LLM.

This three-stage approach is what drives the accuracy improvement. The contextual prefix fixes embedding quality. Hybrid search fixes recall gaps. RRF + reranking fixes precision.

Implementation

Step 1: Install dependencies

# Python 3.12 recommended — uses new type annotation syntax in later steps
pip install anthropic langchain langchain-community \
    langchain-anthropic chromadb rank-bm25 \
    cohere tiktoken pypdf --break-system-packages

Set your API keys:

export ANTHROPIC_API_KEY="sk-ant-..."
export COHERE_API_KEY="..."        # Optional — swap for a local cross-encoder if preferred

Step 2: Build the contextual chunk enricher

This is the core of the technique. For every chunk, call Claude with the full document text and ask it to situate the chunk.

import anthropic
from dataclasses import dataclass

client = anthropic.Anthropic()

CONTEXT_PROMPT = """\
<document>
{document}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>

Please give a short succinct context (1-2 sentences) to situate this chunk \
within the overall document for the purposes of improving search retrieval. \
Answer only with the succinct context and nothing else.
"""

@dataclass
class EnrichedChunk:
    text: str           # context prefix + original chunk text
    original: str       # original chunk text (for display)
    source: str         # document filename or ID
    chunk_index: int


def enrich_chunk(document_text: str, chunk_text: str, source: str, idx: int) -> EnrichedChunk:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,             # context prefix is short — cap tokens to control cost
        messages=[{
            "role": "user",
            "content": CONTEXT_PROMPT.format(document=document_text, chunk=chunk_text)
        }]
    )
    context_prefix = response.content[0].text.strip()
    enriched_text = f"{context_prefix}\n\n{chunk_text}"

    return EnrichedChunk(
        text=enriched_text,
        original=chunk_text,
        source=source,
        chunk_index=idx
    )

Note on cost: Each enrich_chunk call sends the full document in the prompt. For a 50-page PDF (~25k tokens) with 100 chunks, that's 2.5M input tokens. At Claude Sonnet's current pricing of $3/MTok input, that's ~$7.50 per document. Use prompt caching (see Step 5) to reduce this by ~90%.

Step 3: Chunk documents and enrich in batch

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_and_enrich(pdf_path: str) -> list[EnrichedChunk]:
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()
    full_text = "\n\n".join(p.page_content for p in pages)

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=64,           # overlap preserves sentence boundaries
        separators=["\n\n", "\n", ". ", " "]
    )
    raw_chunks = splitter.split_text(full_text)

    enriched = []
    for idx, chunk in enumerate(raw_chunks):
        ec = enrich_chunk(full_text, chunk, pdf_path, idx)
        enriched.append(ec)
        print(f"  Enriched chunk {idx+1}/{len(raw_chunks)}")

    return enriched

Expected output after enriching chunk 1:

Enriched chunk 1/87

Step 4: Index into ChromaDB (dense) and BM25 (sparse)

import chromadb
from langchain_anthropic import ChatAnthropic
from langchain_community.embeddings import HuggingFaceEmbeddings
from rank_bm25 import BM25Okapi

# Use a local embedding model to avoid per-embed API costs
embedder = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")

chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection("contextual_rag")


def build_indexes(chunks: list[EnrichedChunk]):
    # Dense index — ChromaDB
    collection.add(
        ids=[f"{c.source}_{c.chunk_index}" for c in chunks],
        documents=[c.text for c in chunks],
        embeddings=[embedder.embed_query(c.text) for c in chunks],
        metadatas=[{"source": c.source, "original": c.original} for c in chunks]
    )

    # Sparse index — BM25 on tokenized enriched text
    tokenized = [c.text.lower().split() for c in chunks]
    bm25 = BM25Okapi(tokenized)

    return bm25, [c.text for c in chunks]   # return corpus alongside index for score lookup

Step 5: Enable prompt caching to cut enrichment cost by ~90%

Anthropic's API supports prompt caching via cache_control. For contextual retrieval, the document stays constant across all chunk enrichment calls — mark it for caching.

def enrich_chunk_cached(document_text: str, chunk_text: str, source: str, idx: int) -> EnrichedChunk:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        system=[{
            "type": "text",
            "text": "You are a helpful assistant that situates document chunks for retrieval.",
            "cache_control": {"type": "ephemeral"}   # cache the system prompt
        }],
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"<document>\n{document_text}\n</document>",
                    "cache_control": {"type": "ephemeral"}  # cache the full document
                },
                {
                    "type": "text",
                    "text": (
                        f"Here is the chunk to situate:\n<chunk>\n{chunk_text}\n</chunk>\n\n"
                        "Give a 1-2 sentence context to situate this chunk. Answer only with the context."
                    )
                }
            ]
        }]
    )
    context_prefix = response.content[0].text.strip()
    return EnrichedChunk(
        text=f"{context_prefix}\n\n{chunk_text}",
        original=chunk_text,
        source=source,
        chunk_index=idx
    )

The first call caches the document. Every subsequent call for the same document hits the cache — input tokens for the document cost ~10x less. For a 100-chunk document, effective cost drops from ~$7.50 to under $1.

Step 6: Hybrid retrieval with Reciprocal Rank Fusion

def reciprocal_rank_fusion(
    dense_results: list[dict],
    sparse_results: list[tuple[int, float]],
    k: int = 60
) -> list[dict]:
    """Merge dense and sparse rankings using RRF scoring."""
    scores: dict[str, float] = {}
    doc_map: dict[str, dict] = {}

    for rank, result in enumerate(dense_results):
        doc_id = result["id"]
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
        doc_map[doc_id] = result

    for rank, (corpus_idx, _) in enumerate(sparse_results):
        doc_id = str(corpus_idx)
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    return sorted(doc_map.values(), key=lambda r: scores.get(r["id"], 0), reverse=True)


def retrieve(query: str, bm25: BM25Okapi, corpus: list[str], top_k: int = 20) -> list[dict]:
    query_embedding = embedder.embed_query(query)

    # Dense retrieval
    dense_raw = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    dense_results = [
        {"id": dense_raw["ids"][0][i], "text": dense_raw["documents"][0][i],
         "metadata": dense_raw["metadatas"][0][i]}
        for i in range(len(dense_raw["ids"][0]))
    ]

    # Sparse retrieval — BM25
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    sparse_results = sorted(enumerate(bm25_scores), key=lambda x: x[1], reverse=True)[:top_k]

    return reciprocal_rank_fusion(dense_results, sparse_results)

Step 7: Wire it into a full RAG chain

from langchain_anthropic import ChatAnthropic
import cohere

co = cohere.Client()   # Cohere reranker — swap for a local cross-encoder to avoid API dependency

llm = ChatAnthropic(model="claude-sonnet-4-20250514", max_tokens=1024)

def answer(query: str, bm25: BM25Okapi, corpus: list[str]) -> str:
    # 1. Hybrid retrieve
    candidates = retrieve(query, bm25, corpus, top_k=20)

    # 2. Rerank with Cohere
    rerank_response = co.rerank(
        query=query,
        documents=[c["text"] for c in candidates],
        top_n=5,                # final context window gets top 5 after reranking
        model="rerank-english-v3.0"
    )
    top_chunks = [candidates[r.index]["text"] for r in rerank_response.results]

    # 3. Generate answer
    context = "\n\n---\n\n".join(top_chunks)
    prompt = f"""Answer the question using only the context below.

<context>
{context}
</context>

Question: {query}
Answer:"""

    response = llm.invoke(prompt)
    return response.content

Verification

Run a quick end-to-end test with a sample document:

if __name__ == "__main__":
    chunks = load_and_enrich("sample_report.pdf")
    bm25, corpus = build_indexes(chunks)
    result = answer("What are the key latency optimizations described?", bm25, corpus)
    print(result)

You should see: A coherent answer that correctly attributes the optimization to the right system — not a vague "the method described above reduces latency."

If the answer is still vague, check:

Error: Invalid API Key → Re-export ANTHROPIC_API_KEY
Chunks too short → Raise chunk_size to 768 or 1024. Very short chunks lose meaning even with context prepended.
BM25 always wins → Your document has rare keywords dominating scores. Add a weight parameter to RRF: multiply the dense score contribution by 1.3.

Contextual Retrieval vs Standard RAG

	Standard RAG	Contextual Retrieval
Chunk embedding	Raw chunk only	Context prefix + chunk
Retrieval method	Dense vector only	Dense + BM25 hybrid
Merging strategy	Top-k by score	Reciprocal Rank Fusion
Reranking	Optional	Recommended
Cost per document	Low	Medium (offset by caching)
Retrieval failure rate	Baseline	~49% lower (Anthropic benchmark)
Best for	Short focused docs	Long technical docs, contracts, reports

Use standard RAG if: Your documents are short (under 5 pages), your queries are highly specific keyword searches, or you need zero additional API cost.

Use contextual retrieval if: You're working with long technical documents, legal contracts, research papers, or any corpus where chunks frequently reference earlier content.

What You Learned

Standard RAG fails because chunks lose surrounding context during splitting — the fix is prepending a Claude-generated summary before embedding
Prompt caching on the document text cuts enrichment costs by ~90% and is essential for production use
BM25 hybrid search recovers keyword-match cases that dense retrieval misses, especially for proper nouns and exact terms
Reciprocal Rank Fusion is a parameter-free merge strategy — it outperforms weighted sum in most RAG benchmarks without tuning
The reranker is the final quality gate — it re-scores on semantic relevance after retrieval, not before

Tested on Python 3.12, anthropic SDK 0.40, chromadb 0.5, rank-bm25 0.2.2, macOS and Ubuntu 24.04

FAQ

Q: Does contextual retrieval work with OpenAI embeddings instead of a local model? A: Yes. Replace HuggingFaceEmbeddings with OpenAIEmbeddings from langchain-openai. The enrichment logic is independent of the embedding model. Local models like BAAI/bge-small-en-v1.5 perform comparably on retrieval benchmarks and cost nothing per query.

Q: What is the minimum context window needed to enrich chunks from long documents? A: Claude Sonnet supports 200k tokens. For most PDFs under 150 pages (~75k tokens), a single call handles the full document. For larger documents, split by section and enrich each section's chunks against only that section's text.

Q: Can I skip the reranker and just use RRF output directly? A: Yes, but accuracy drops. RRF improves recall; the reranker improves precision. If Cohere is too expensive (~$1/1k searches on the paid tier), use a local cross-encoder like cross-encoder/ms-marco-MiniLM-L-6-v2 via sentence-transformers — free and nearly as accurate.

Q: How does contextual retrieval handle duplicate or near-duplicate chunks? A: It doesn't deduplicate by default. Add a post-retrieval step that compares chunk embeddings and drops any result with cosine similarity > 0.95 to a higher-ranked result. ChromaDB doesn't expose this natively — run it in Python after retrieval.

Q: What chunk size works best with contextual retrieval? A: Anthropic's own tests used 512-token chunks with a 50-token context prefix. Smaller chunks (256 tokens) benefit more from context enrichment because they lose more coherence on their own. Larger chunks (1024 tokens) need less enrichment but slow reranking.