Problem: Your RAG Retrieval Is Imprecise
You've built a RAG pipeline, but retrieval keeps pulling irrelevant chunks. Semantic similarity alone isn't enough — your chatbot answers questions about Q3 reports with data from Q1, or returns docs from the wrong department entirely.
The fix isn't a better embedding model. It's metadata filtering.
You'll learn:
- Why semantic search alone fails at scale
- How to prompt an LLM to extract structured metadata from any document
- How to attach that metadata to vector store payloads for precise pre-filtering
Time: 20 min | Level: Intermediate
Why This Happens
Vector similarity finds semantically close chunks — but "close" is relative. When your corpus has 10,000 documents across five departments and three years, a query like "What was the refund policy?" might match chunks from 2021, 2023, and 2025 with nearly identical scores.
Metadata filtering solves this by letting you scope retrieval before the similarity search runs. Instead of searching all 10,000 chunks, you search only the 200 that match department=support and year=2025.
Common symptoms without metadata filtering:
- Top-k retrieval returns chunks from wrong time periods or departments
- Increasing
kmakes answers worse, not better - Users have to over-specify queries to get relevant results
The problem is that metadata is tedious to write manually — so most teams skip it. That's where LLMs come in.
Solution
Step 1: Define Your Metadata Schema
Before prompting anything, decide what fields actually matter for filtering. Keep it to 5–8 fields maximum — more than that creates noise.
# metadata_schema.py
from dataclasses import dataclass
from typing import Optional
@dataclass
class DocumentMetadata:
# Core filter fields — these drive retrieval precision
doc_type: str # "policy", "report", "faq", "contract"
department: str # "hr", "finance", "legal", "support"
date_period: str # "2024-Q3", "2025-01", "2026"
audience: str # "internal", "customer", "executive"
# Secondary fields — useful for ranking, not filtering
topic_tags: list[str] # ["refunds", "billing", "cancellation"]
sensitivity: str # "public", "internal", "confidential"
# Optional — only if your corpus needs it
product: Optional[str] = None # "pro-plan", "enterprise"
region: Optional[str] = None # "us", "eu", "apac"
Why this schema: doc_type and department are the highest-leverage filters. They eliminate 70–80% of irrelevant chunks before similarity even runs. Date period lets you answer "latest policy" queries correctly.
Step 2: Build the Metadata Extraction Prompt
The prompt is the core of this approach. You need structured output, strict field constraints, and a fallback for unknown values.
# extractor.py
import json
from anthropic import Anthropic
client = Anthropic()
EXTRACTION_PROMPT = """Extract structured metadata from this document chunk.
Return ONLY valid JSON matching this exact schema — no explanation, no markdown:
{
"doc_type": "<one of: policy|report|faq|contract|guide|other>",
"department": "<one of: hr|finance|legal|support|engineering|marketing|other>",
"date_period": "<YYYY-QN or YYYY-MM or YYYY or 'unknown'>",
"audience": "<one of: internal|customer|executive|public>",
"topic_tags": ["<tag1>", "<tag2>"],
"sensitivity": "<one of: public|internal|confidential>",
"product": "<product name or null>",
"region": "<one of: us|eu|apac|global|null>"
}
Rules:
- Use "other" or "unknown" when you can't determine a field — never guess
- topic_tags: 2–5 lowercase tags describing the core subject matter
- Infer date_period from content clues (fiscal year mentions, version numbers, etc.)
Document chunk:
"""
def extract_metadata(chunk: str) -> dict:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=512,
messages=[{
"role": "user",
"content": EXTRACTION_PROMPT + chunk
}]
)
raw = response.content[0].text.strip()
try:
return json.loads(raw)
except json.JSONDecodeError:
# Fallback: return safe defaults rather than crashing the pipeline
return {
"doc_type": "other",
"department": "other",
"date_period": "unknown",
"audience": "internal",
"topic_tags": [],
"sensitivity": "internal",
"product": None,
"region": None
}
Why structured constraints matter: Open-ended prompts produce inconsistent values ("HR" vs "human-resources" vs "hr"). Enums in the prompt enforce a vocabulary your filter queries can rely on.
Step 3: Attach Metadata to Your Vector Store
This example uses Qdrant, but the pattern works with Pinecone, Weaviate, or pgvector — they all support payload/metadata filtering.
# indexer.py
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance
import uuid
client_qdrant = QdrantClient(host="localhost", port=6333)
def index_document_with_metadata(
chunk: str,
embedding: list[float],
metadata: dict,
source_file: str
) -> str:
point_id = str(uuid.uuid4())
# Merge extracted metadata with source info
payload = {
**metadata,
"source_file": source_file,
"chunk_text": chunk, # Store for retrieval display
}
client_qdrant.upsert(
collection_name="documents",
points=[
PointStruct(
id=point_id,
vector=embedding,
payload=payload # Qdrant stores this alongside the vector
)
]
)
return point_id
Now retrieval can pre-filter before similarity search:
# retriever.py
from qdrant_client.models import Filter, FieldCondition, MatchValue
def filtered_search(
query_embedding: list[float],
department: str,
doc_type: str,
top_k: int = 5
) -> list[dict]:
results = client_qdrant.search(
collection_name="documents",
query_vector=query_embedding,
query_filter=Filter(
must=[
# Only search within matching department + doc_type
FieldCondition(key="department", match=MatchValue(value=department)),
FieldCondition(key="doc_type", match=MatchValue(value=doc_type)),
]
),
limit=top_k,
with_payload=True
)
return [
{"text": r.payload["chunk_text"], "score": r.score, "meta": r.payload}
for r in results
]
Expected: Filtered queries run faster and return tighter results. On a 50K-chunk corpus, filtering by department + doc_type typically reduces the search space by 85–95%.
Step 4: Auto-Detect Filter Intent from User Queries
You don't want users to manually specify department and doc_type. Use a second lightweight LLM call to infer filter intent from the query itself.
# query_parser.py
FILTER_INTENT_PROMPT = """Given this user query, extract filter intent for document retrieval.
Return ONLY valid JSON:
{
"department": "<hr|finance|legal|support|engineering|marketing|other|null>",
"doc_type": "<policy|report|faq|contract|guide|other|null>",
"date_preference": "<latest|specific_period|any>"
}
Use null when the query gives no signal for a field.
Query: """
def parse_query_intent(query: str) -> dict:
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Fast + cheap for intent parsing
max_tokens=128,
messages=[{"role": "user", "content": FILTER_INTENT_PROMPT + query}]
)
try:
return json.loads(response.content[0].text.strip())
except json.JSONDecodeError:
return {"department": None, "doc_type": None, "date_preference": "any"}
Use Haiku here — it's fast enough for latency-sensitive query paths and the task is simple classification.
Verification
Run this end-to-end test with a known document:
python -c "
from extractor import extract_metadata
sample = '''
This HR policy document outlines the 2025 parental leave benefits
for all full-time employees in the US region. Effective January 2025.
'''
result = extract_metadata(sample)
print(result)
"
You should see:
{
"doc_type": "policy",
"department": "hr",
"date_period": "2025",
"audience": "internal",
"topic_tags": ["parental-leave", "benefits", "hr-policy"],
"sensitivity": "internal",
"product": null,
"region": "us"
}
If it fails:
- JSONDecodeError on good chunks: Add
temperature=0to your API call — higher temps produce inconsistent formatting - All fields return "other": Your chunks may be too short. Aim for 300–800 tokens per chunk before extracting metadata
- Wrong department detected: Add 2–3 few-shot examples to your prompt showing your specific domain vocabulary
What You Learned
- Metadata filtering scopes retrieval before similarity search runs — it's a multiplier on retrieval quality, not a replacement for good embeddings
- Enum constraints in extraction prompts are non-negotiable — open fields create retrieval bugs you won't catch until production
- Use a fast model (Haiku) for query intent parsing and a capable model (Opus/Sonnet) for document extraction — they have different accuracy requirements
- The
"unknown"fallback is critical: unfiltered results are better than wrong filters blocking valid chunks
Limitation: This approach works best when your corpus has consistent domain vocabulary. Mixed-language corpora or heavily technical jargon may need domain-specific few-shot examples in the extraction prompt.
When NOT to use this: If your corpus is small (<1K documents) and homogeneous, semantic search alone is probably sufficient. Add metadata when you start seeing retrieval confusion at scale.
Tested on Python 3.12, anthropic SDK 0.40+, Qdrant 1.9, Claude Haiku 4.5 and Opus 4.6