Problem: Legal Corpora Break Naive RAG
You've got 10,000 pages of legal documents — contracts, statutes, case law — and a naive vector search chatbot that returns wrong sections, misses context, and hallucinates citations.
You'll learn:
- How to chunk legal documents without destroying meaning
- Why hierarchical retrieval beats flat vector search at scale
- How to build a production-ready pipeline with LangChain, pgvector, and Claude
Time: 60 min | Level: Advanced
Why This Happens
Legal text breaks standard RAG in three ways. First, a single clause often spans multiple pages with references to earlier definitions — chunk at 512 tokens and you lose the meaning. Second, 10,000 pages means ~5 million tokens in your index; brute-force similarity search becomes slow and noisy. Third, legal language is precise: "shall" vs "may" matters, and embedding models trained on general text don't always capture that distinction.
Common symptoms:
- Chatbot returns a correct-sounding paragraph from the wrong contract
- Answers miss critical exceptions buried in subsections
- Retrieval latency spikes past 3 seconds at query time
Solution
Step 1: Set Up the Environment
pip install langchain langchain-anthropic langchain-postgres \
pgvector psycopg2-binary pypdf unstructured[pdf] tiktoken
You need PostgreSQL 15+ with the pgvector extension enabled:
CREATE EXTENSION IF NOT EXISTS vector;
Expected: CREATE EXTENSION with no errors.
If it fails:
- "could not open extension control file": Install pgvector:
sudo apt install postgresql-15-pgvector - Permission denied: Run as postgres superuser
Step 2: Build a Legal-Aware Document Chunker
Standard recursive chunking ignores document structure. Instead, chunk by legal hierarchy: sections → subsections → paragraphs. This keeps clauses intact and preserves cross-references.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from unstructured.partition.pdf import partition_pdf
from dataclasses import dataclass
from typing import List
@dataclass
class LegalChunk:
text: str
doc_id: str
section: str # e.g. "Article 12.3(b)"
page_range: str # e.g. "pp. 47-48"
chunk_type: str # "definition", "obligation", "exception"
def partition_legal_pdf(path: str, doc_id: str) -> List[LegalChunk]:
elements = partition_pdf(
filename=path,
strategy="hi_res", # OCR fallback for scanned docs
infer_table_structure=True,
)
splitter = RecursiveCharacterTextSplitter(
separators=[
"\n## ", # H2 sections
"\n### ", # H3 subsections
"\n\n", # Paragraphs
". ", # Sentences — last resort
],
chunk_size=800, # ~200-250 words keeps clauses intact
chunk_overlap=100, # Overlap catches cross-paragraph references
length_function=len,
)
chunks = []
current_section = "Preamble"
for el in elements:
# Track section headers for metadata
if el.category == "Title":
current_section = el.text.strip()
continue
splits = splitter.split_text(el.text)
for text in splits:
chunks.append(LegalChunk(
text=text,
doc_id=doc_id,
section=current_section,
page_range=str(getattr(el.metadata, "page_number", "?")),
chunk_type=classify_chunk(text), # see Step 3
))
return chunks
def classify_chunk(text: str) -> str:
# Lightweight keyword classifier — no LLM call needed
text_lower = text.lower()
if any(w in text_lower for w in ["means", "defined as", '"', "'"]):
return "definition"
if any(w in text_lower for w in ["shall not", "must not", "is prohibited"]):
return "prohibition"
if any(w in text_lower for w in ["shall", "must", "is required"]):
return "obligation"
if any(w in text_lower for w in ["except", "unless", "notwithstanding"]):
return "exception"
return "general"
Why 800 characters: Legal clauses average 150-250 words. 512-token chunks cut mid-clause; 1500-token chunks dilute retrieval signal. 800 chars is the practical sweet spot.
Step 3: Index into pgvector with Rich Metadata
Metadata filtering is what separates production RAG from a demo. You want to retrieve only from the relevant contract, time period, or jurisdiction — not all 10,000 pages.
from langchain_postgres import PGVector
from langchain_anthropic import AnthropicEmbeddings
from langchain.schema import Document
# Use voyage-law-2 if budget allows — it's trained on legal text
# Otherwise, text-embedding-3-large is a solid fallback
embeddings = AnthropicEmbeddings(model="voyage-3")
vectorstore = PGVector(
connection="postgresql://user:pass@localhost:5432/legaldb",
embeddings=embeddings,
collection_name="legal_chunks",
use_jsonb=True, # Enables fast metadata filtering
)
def index_chunks(chunks: List[LegalChunk]) -> None:
docs = [
Document(
page_content=chunk.text,
metadata={
"doc_id": chunk.doc_id,
"section": chunk.section,
"page_range": chunk.page_range,
"chunk_type": chunk.chunk_type,
}
)
for chunk in chunks
]
# Batch in groups of 200 — avoids embedding API rate limits
batch_size = 200
for i in range(0, len(docs), batch_size):
vectorstore.add_documents(docs[i : i + batch_size])
print(f"Indexed {min(i + batch_size, len(docs))}/{len(docs)}")
For 10,000 pages (~500,000 chunks at 800 chars), indexing takes 2-4 hours on a standard API tier. Run it overnight.
Step 4: Build a Two-Stage Retriever
Flat vector search over 500k chunks returns noisy results. Use a two-stage approach: broad semantic search → cross-encoder reranking. This cuts irrelevant results by ~60% in legal benchmarks.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
def build_retriever(doc_id: str = None):
# Stage 1: Vector search — broad, fast, top-20
base_retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={
"k": 20,
# Filter to specific document if provided
"filter": {"doc_id": doc_id} if doc_id else None,
}
)
# Stage 2: Cross-encoder reranking — precise, top-5
# ms-marco-MiniLM-L-6-v2 is fast and good enough for legal text
reranker = CrossEncoderReranker(
model=HuggingFaceCrossEncoder(
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
),
top_n=5,
)
return ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=base_retriever,
)
Why reranking matters here: Embedding similarity is good at topic matching but poor at legal precision. "The lessor shall maintain" and "the lessee shall maintain" have nearly identical embeddings. The cross-encoder reads both query and document together, catching that distinction.
Step 5: Wire Up the RAG Chain with Citation Enforcement
Legal chatbots must cite sources. Build citation enforcement into the prompt — don't rely on the model to volunteer it.
from langchain_anthropic import ChatAnthropic
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
LEGAL_RAG_PROMPT = PromptTemplate.from_template("""
You are a precise legal research assistant. Answer using ONLY the provided context.
Rules:
- Cite every claim as [Doc: {doc_id}, Section: X, p. Y]
- If the context doesn't answer the question, say "Not found in provided documents"
- Never infer or extrapolate beyond the text
- Flag obligations vs exceptions explicitly
Context:
{context}
Question: {question}
Answer:
""")
llm = ChatAnthropic(
model="claude-sonnet-4-6",
max_tokens=2048,
temperature=0, # Zero temp for factual legal work
)
def build_rag_chain(doc_id: str = None):
retriever = build_retriever(doc_id)
return RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # Stuff all chunks into context
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": LEGAL_RAG_PROMPT},
)
# Usage
chain = build_rag_chain(doc_id="master-services-agreement-2024")
result = chain.invoke({"query": "What are the indemnification limits?"})
print(result["result"])
print("\n--- Sources ---")
for doc in result["source_documents"]:
print(f" {doc.metadata['section']} ({doc.metadata['page_range']})")
Step 6: Add a Query Router for Multi-Document Queries
When users ask cross-document questions ("Does this NDA conflict with the MSA?"), you need to query multiple indexes and synthesize. A simple intent classifier handles routing.
from langchain_anthropic import ChatAnthropic
from langchain.schema import HumanMessage
router_llm = ChatAnthropic(model="claude-haiku-4-5-20251001", max_tokens=100)
def route_query(question: str, available_docs: list[str]) -> list[str]:
"""Returns list of doc_ids to query."""
response = router_llm.invoke([
HumanMessage(content=f"""
Given this question: "{question}"
Available documents: {available_docs}
Return a JSON array of document IDs most relevant to answer this.
Return ALL documents if the question is cross-document. Max 3.
Return only the JSON array, nothing else.
""")
])
import json
try:
return json.loads(response.content)
except json.JSONDecodeError:
return available_docs[:3] # Fallback: query top 3
def query_multi_doc(question: str, available_docs: list[str]) -> dict:
target_docs = route_query(question, available_docs)
all_results = []
for doc_id in target_docs:
chain = build_rag_chain(doc_id)
result = chain.invoke({"query": question})
all_results.append({
"doc_id": doc_id,
"answer": result["result"],
"sources": result["source_documents"],
})
# Synthesize if multiple docs queried
if len(all_results) > 1:
return synthesize_cross_doc(question, all_results)
return all_results[0]
Verification
python -c "
from your_module import build_rag_chain
chain = build_rag_chain()
result = chain.invoke({'query': 'What is the governing law?'})
print(result['result'])
print(f'Sources: {len(result[\"source_documents\"])} chunks retrieved')
"
You should see: A cited answer with section references and 3-5 source chunks. Query latency should be under 4 seconds (2s retrieval + 2s generation).
If latency is high:
- Reranking slow: Switch to
FlashrankRerank— 5x faster than cross-encoder with minimal quality loss - Embedding calls slow: Cache embeddings with Redis; queries repeat more than you'd expect
- pgvector slow: Add an HNSW index:
CREATE INDEX ON legal_chunks USING hnsw (embedding vector_cosine_ops);
What You Learned
- Chunk at 800 chars using legal structure separators — not fixed token windows
- Store
doc_id,section,page_range, andchunk_typein metadata — they're worth 3x the retrieval precision - Two-stage retrieval (vector + reranker) is non-negotiable at 500k+ chunks
- Citation enforcement belongs in the prompt, not in post-processing
- Use
temperature=0for legal work — every time
Limitations: This pipeline handles English-language, digitally-native PDFs well. Scanned documents need strategy="hi_res" and an OCR quality check step. Non-English corpora need jurisdiction-specific embedding models.
When NOT to use this: If your corpus is under 500 pages, a simple vector store with no reranker is fast enough and cheaper to run.
Tested on Python 3.12, LangChain 0.3, pgvector 0.7, Claude Sonnet 4.6, Ubuntu 24.04