Manage RAG Context Windows: Chunk Strategy Guide 2026

Problem: Your RAG Pipeline Retrieves the Wrong Context

RAG context window management determines whether your retrieval pipeline surfaces the right information — or silently returns irrelevant chunks that confuse the LLM. A poorly chunked corpus is the most common cause of hallucinations in otherwise well-architected RAG systems.

This guide walks through every major chunk strategy, when to use each, and how to implement them in Python with LangChain and pgvector.

You'll learn:

Why chunk size and overlap directly control retrieval accuracy
How to choose between fixed-size, recursive, semantic, and document-aware chunking
How to tune chunk_size and chunk_overlap for your specific use case

Time: 20 min | Difficulty: Intermediate

Why This Happens

Context windows in LLMs are finite. GPT-4o has a 128K token window; Claude 3.5 Sonnet has 200K. But your retrieval step runs before the LLM — and the top-k chunks you fetch must fit inside the prompt alongside your system instructions, conversation history, and output space.

Most RAG failures trace back to one of three chunking mistakes:

Chunks too large — retrieved text buries the relevant sentence in noise; the LLM attends to the wrong part
Chunks too small — relevant context is split across chunk boundaries; neither chunk alone answers the question
No overlap — a sentence at the end of chunk N and the start of chunk N+1 loses its relationship entirely

Symptoms:

LLM says "I don't have enough information" when the answer is clearly in your docs
Retrieval returns the right document but wrong section
Answers are technically correct but miss key qualifiers (e.g., version constraints, pricing tiers)

RAG Context Window Chunk Strategy Pipeline End-to-end RAG chunking pipeline: document → splitter → vector store → retriever → LLM context window

Solution

Step 1: Start With RecursiveCharacterTextSplitter

For most use cases, RecursiveCharacterTextSplitter is the correct default. It splits on \n\n, then \n, then spaces — preserving paragraph and sentence boundaries before falling back to character splits.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,      # tokens ≈ chars / 4; 512 tokens fits in most retrieval budgets
    chunk_overlap=64,    # 12% overlap prevents context loss at boundaries
    length_function=len, # swap for tiktoken if you need exact token counts
)

docs = splitter.create_documents([your_text])

Expected output: A list of Document objects, each with .page_content under 512 chars and .metadata carrying the source.

If it fails:

ImportError: langchain → pip install langchain-text-splitters --break-system-packages
Chunks are empty strings → your input has Windows line endings; normalize with text.replace("\r\n", "\n") first

Step 2: Pick the Right Chunk Size for Your Document Type

There is no universal chunk_size. The right value depends on your document type and query pattern.

Document type	Recommended `chunk_size`	`chunk_overlap`	Reasoning
API reference / docs	256–512 tokens	32–64	Each function is self-contained; small chunks = precise retrieval
Long-form prose / PDFs	512–1024 tokens	100–128	Paragraphs need context from surrounding text
Support tickets / logs	128–256 tokens	16–32	Short entries; larger chunks add noise
Legal / compliance	1024–2048 tokens	256	Clause meaning depends on surrounding clauses
Code files	256–512 tokens	64	Function-level granularity; overlap preserves imports

# Exact token-based sizing with tiktoken (recommended for OpenAI models)
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4o and text-embedding-3-* encoding

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    length_function=lambda text: len(enc.encode(text)),  # count tokens, not chars
)

Why token-based sizing matters: OpenAI's text-embedding-3-large silently truncates inputs over 8191 tokens. If your chunks measure in characters, a dense technical paragraph can easily exceed the embedding model's limit — and you'll never see an error, just degraded retrieval.

Step 3: Use Semantic Chunking for High-Precision Use Cases

Fixed-size splitting ignores meaning. Semantic chunking groups sentences that are topically related, producing chunks that align with how humans organize information.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",  # split where cosine distance exceeds the 95th percentile
    breakpoint_threshold_amount=95,
)

docs = chunker.create_documents([your_text])

When to use semantic chunking:

Your corpus mixes topics within single documents (e.g., internal wikis, meeting notes)
Retrieval precision matters more than indexing speed
You're seeing topic bleed — chunks about pricing containing unrelated product details

When NOT to use it:

Documents with tight token budgets (semantic chunks vary in size; some will be very large)
Indexing millions of documents — the embedding call per chunk boundary is expensive
Structured reference docs where fixed-size works perfectly well

Step 4: Add Document-Aware Splitting for PDFs and Code

For structured formats, use format-specific splitters that respect document semantics.

For Markdown / HTML (structure-aware):

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False,  # keep headers in chunk text for retrieval context
)

docs = splitter.split_text(markdown_text)
# Each doc.metadata now contains {"h1": "...", "h2": "..."} — filter by section at query time

For Python code (AST-aware):

from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

code_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=512,
    chunk_overlap=64,
)

# Splits on class/function boundaries before falling back to lines
code_docs = code_splitter.create_documents([python_source_code])

Step 5: Store and Retrieve With pgvector

Once your chunks are sized correctly, store them with metadata for filtered retrieval.

from langchain_community.vectorstores import PGVector
from langchain_openai import OpenAIEmbeddings

CONNECTION_STRING = "postgresql+psycopg2://user:pass@localhost:5432/ragdb"

db = PGVector.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
    connection_string=CONNECTION_STRING,
    collection_name="docs_512",  # name encodes chunk strategy — makes A/B testing easy
)

# Filtered MMR retrieval: diversity + relevance, scoped to one doc section
retriever = db.as_retriever(
    search_type="mmr",           # Max Marginal Relevance — reduces redundant chunks
    search_kwargs={
        "k": 6,                  # fetch 6 chunks; fits ~3K tokens in most prompts
        "fetch_k": 20,           # MMR candidate pool before re-ranking
        "filter": {"h2": "API Reference"},  # metadata filter from MarkdownHeaderTextSplitter
    },
)

Why MMR matters for context windows: similarity_search often returns 5 nearly identical chunks from the same paragraph. MMR explicitly penalizes redundancy — you get 6 diverse chunks instead of 6 copies of the same sentence, which uses your context window budget far more efficiently.

Verification

Run this script to confirm your chunk distribution is healthy before indexing at scale:

import statistics

chunk_lengths = [len(doc.page_content) for doc in docs]

print(f"Total chunks:  {len(chunk_lengths)}")
print(f"Mean length:   {statistics.mean(chunk_lengths):.0f} chars")
print(f"Median length: {statistics.median(chunk_lengths):.0f} chars")
print(f"Max length:    {max(chunk_lengths)} chars")
print(f"Min length:    {min(chunk_lengths)} chars")
print(f"Chunks > 2000 chars: {sum(1 for l in chunk_lengths if l > 2000)}")  # flag oversized
print(f"Chunks < 50 chars:   {sum(1 for l in chunk_lengths if l < 50)}")    # flag fragments

You should see:

Mean close to your target chunk_size
Fewer than 5% of chunks flagged as oversized or fragments
Max length under 2× your target (a spike means a paragraph with no natural split points)

What You Learned

RecursiveCharacterTextSplitter with chunk_size=512 and chunk_overlap=64 is the correct starting point for most RAG pipelines
Token-based length functions prevent silent truncation by embedding models like text-embedding-3-large
Semantic chunking improves precision on mixed-topic corpora but costs more at index time
Structure-aware splitters (MarkdownHeaderTextSplitter, from_language) preserve document hierarchy as metadata you can filter on at query time
MMR retrieval maximizes context window efficiency by penalizing redundant chunks

Tested on LangChain 0.3, LangChain Experimental 0.3, Python 3.12, pgvector 0.7, macOS & Ubuntu 24.04

FAQ

Q: What chunk size should I start with if I have no idea what my documents look like? A: Start with chunk_size=512 and chunk_overlap=64. Index 100 documents, run 20 representative queries, and inspect which chunks get retrieved. Adjust from there — this is always an empirical process.

Q: Does chunk overlap cost extra on embedding APIs? A: Yes — overlap increases total tokens ingested. At chunk_overlap=64 on a 512-token chunk, you're adding roughly 12% to your embedding bill. For most corpora this is negligible; for millions of documents, consider reducing overlap to 32 tokens.

Q: Can I mix chunk strategies in the same vector store? A: Yes, but store them in separate collections (e.g., docs_512_fixed vs docs_semantic). Mixed-strategy collections make A/B testing and debugging much harder.

Q: What's the minimum VRAM needed to run a self-hosted embedding model for chunking? A: nomic-embed-text via Ollama runs on 4GB VRAM and produces 768-dimensional embeddings suitable for most RAG workloads. For production on AWS us-east-1, a g4dn.xlarge (16GB VRAM, ~$0.526/hour USD) handles embedding throughput for most mid-size corpora.

Q: How does chunk strategy affect reranking? A: Rerankers like Cohere Rerank ($1.00 per 1,000 searches USD) operate on your retrieved top-k chunks. Smaller, more precise chunks give the reranker cleaner signal — oversized chunks tend to score inconsistently because the relevant sentence is diluted by surrounding noise.