Use AI to Auto-Generate Metadata for Better RAG Filtering

Stop manually tagging documents. Learn how to use LLMs to auto-generate structured metadata that makes RAG retrieval dramatically more accurate.

Problem: Your RAG Retrieval Is Imprecise

You've built a RAG pipeline, but retrieval keeps pulling irrelevant chunks. Semantic similarity alone isn't enough — your chatbot answers questions about Q3 reports with data from Q1, or returns docs from the wrong department entirely.

The fix isn't a better embedding model. It's metadata filtering.

You'll learn:

  • Why semantic search alone fails at scale
  • How to prompt an LLM to extract structured metadata from any document
  • How to attach that metadata to vector store payloads for precise pre-filtering

Time: 20 min | Level: Intermediate


Why This Happens

Vector similarity finds semantically close chunks — but "close" is relative. When your corpus has 10,000 documents across five departments and three years, a query like "What was the refund policy?" might match chunks from 2021, 2023, and 2025 with nearly identical scores.

Metadata filtering solves this by letting you scope retrieval before the similarity search runs. Instead of searching all 10,000 chunks, you search only the 200 that match department=support and year=2025.

Common symptoms without metadata filtering:

  • Top-k retrieval returns chunks from wrong time periods or departments
  • Increasing k makes answers worse, not better
  • Users have to over-specify queries to get relevant results

The problem is that metadata is tedious to write manually — so most teams skip it. That's where LLMs come in.


Solution

Step 1: Define Your Metadata Schema

Before prompting anything, decide what fields actually matter for filtering. Keep it to 5–8 fields maximum — more than that creates noise.

# metadata_schema.py
from dataclasses import dataclass
from typing import Optional

@dataclass
class DocumentMetadata:
    # Core filter fields — these drive retrieval precision
    doc_type: str          # "policy", "report", "faq", "contract"
    department: str        # "hr", "finance", "legal", "support"
    date_period: str       # "2024-Q3", "2025-01", "2026"
    audience: str          # "internal", "customer", "executive"
    
    # Secondary fields — useful for ranking, not filtering
    topic_tags: list[str]  # ["refunds", "billing", "cancellation"]
    sensitivity: str       # "public", "internal", "confidential"
    
    # Optional — only if your corpus needs it
    product: Optional[str] = None   # "pro-plan", "enterprise"
    region: Optional[str] = None    # "us", "eu", "apac"

Why this schema: doc_type and department are the highest-leverage filters. They eliminate 70–80% of irrelevant chunks before similarity even runs. Date period lets you answer "latest policy" queries correctly.


Step 2: Build the Metadata Extraction Prompt

The prompt is the core of this approach. You need structured output, strict field constraints, and a fallback for unknown values.

# extractor.py
import json
from anthropic import Anthropic

client = Anthropic()

EXTRACTION_PROMPT = """Extract structured metadata from this document chunk.

Return ONLY valid JSON matching this exact schema — no explanation, no markdown:
{
  "doc_type": "<one of: policy|report|faq|contract|guide|other>",
  "department": "<one of: hr|finance|legal|support|engineering|marketing|other>",
  "date_period": "<YYYY-QN or YYYY-MM or YYYY or 'unknown'>",
  "audience": "<one of: internal|customer|executive|public>",
  "topic_tags": ["<tag1>", "<tag2>"],
  "sensitivity": "<one of: public|internal|confidential>",
  "product": "<product name or null>",
  "region": "<one of: us|eu|apac|global|null>"
}

Rules:
- Use "other" or "unknown" when you can't determine a field — never guess
- topic_tags: 2–5 lowercase tags describing the core subject matter
- Infer date_period from content clues (fiscal year mentions, version numbers, etc.)

Document chunk:
"""

def extract_metadata(chunk: str) -> dict:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": EXTRACTION_PROMPT + chunk
        }]
    )
    
    raw = response.content[0].text.strip()
    
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        # Fallback: return safe defaults rather than crashing the pipeline
        return {
            "doc_type": "other",
            "department": "other", 
            "date_period": "unknown",
            "audience": "internal",
            "topic_tags": [],
            "sensitivity": "internal",
            "product": None,
            "region": None
        }

Why structured constraints matter: Open-ended prompts produce inconsistent values ("HR" vs "human-resources" vs "hr"). Enums in the prompt enforce a vocabulary your filter queries can rely on.


Step 3: Attach Metadata to Your Vector Store

This example uses Qdrant, but the pattern works with Pinecone, Weaviate, or pgvector — they all support payload/metadata filtering.

# indexer.py
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance
import uuid

client_qdrant = QdrantClient(host="localhost", port=6333)

def index_document_with_metadata(
    chunk: str,
    embedding: list[float],
    metadata: dict,
    source_file: str
) -> str:
    point_id = str(uuid.uuid4())
    
    # Merge extracted metadata with source info
    payload = {
        **metadata,
        "source_file": source_file,
        "chunk_text": chunk,  # Store for retrieval display
    }
    
    client_qdrant.upsert(
        collection_name="documents",
        points=[
            PointStruct(
                id=point_id,
                vector=embedding,
                payload=payload  # Qdrant stores this alongside the vector
            )
        ]
    )
    
    return point_id

Now retrieval can pre-filter before similarity search:

# retriever.py
from qdrant_client.models import Filter, FieldCondition, MatchValue

def filtered_search(
    query_embedding: list[float],
    department: str,
    doc_type: str,
    top_k: int = 5
) -> list[dict]:
    
    results = client_qdrant.search(
        collection_name="documents",
        query_vector=query_embedding,
        query_filter=Filter(
            must=[
                # Only search within matching department + doc_type
                FieldCondition(key="department", match=MatchValue(value=department)),
                FieldCondition(key="doc_type", match=MatchValue(value=doc_type)),
            ]
        ),
        limit=top_k,
        with_payload=True
    )
    
    return [
        {"text": r.payload["chunk_text"], "score": r.score, "meta": r.payload}
        for r in results
    ]

Expected: Filtered queries run faster and return tighter results. On a 50K-chunk corpus, filtering by department + doc_type typically reduces the search space by 85–95%.


Step 4: Auto-Detect Filter Intent from User Queries

You don't want users to manually specify department and doc_type. Use a second lightweight LLM call to infer filter intent from the query itself.

# query_parser.py
FILTER_INTENT_PROMPT = """Given this user query, extract filter intent for document retrieval.

Return ONLY valid JSON:
{
  "department": "<hr|finance|legal|support|engineering|marketing|other|null>",
  "doc_type": "<policy|report|faq|contract|guide|other|null>",
  "date_preference": "<latest|specific_period|any>"
}

Use null when the query gives no signal for a field.

Query: """

def parse_query_intent(query: str) -> dict:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Fast + cheap for intent parsing
        max_tokens=128,
        messages=[{"role": "user", "content": FILTER_INTENT_PROMPT + query}]
    )
    
    try:
        return json.loads(response.content[0].text.strip())
    except json.JSONDecodeError:
        return {"department": None, "doc_type": None, "date_preference": "any"}

Use Haiku here — it's fast enough for latency-sensitive query paths and the task is simple classification.


Verification

Run this end-to-end test with a known document:

python -c "
from extractor import extract_metadata
sample = '''
This HR policy document outlines the 2025 parental leave benefits 
for all full-time employees in the US region. Effective January 2025.
'''
result = extract_metadata(sample)
print(result)
"

You should see:

{
  "doc_type": "policy",
  "department": "hr",
  "date_period": "2025",
  "audience": "internal",
  "topic_tags": ["parental-leave", "benefits", "hr-policy"],
  "sensitivity": "internal",
  "product": null,
  "region": "us"
}

If it fails:

  • JSONDecodeError on good chunks: Add temperature=0 to your API call — higher temps produce inconsistent formatting
  • All fields return "other": Your chunks may be too short. Aim for 300–800 tokens per chunk before extracting metadata
  • Wrong department detected: Add 2–3 few-shot examples to your prompt showing your specific domain vocabulary

What You Learned

  • Metadata filtering scopes retrieval before similarity search runs — it's a multiplier on retrieval quality, not a replacement for good embeddings
  • Enum constraints in extraction prompts are non-negotiable — open fields create retrieval bugs you won't catch until production
  • Use a fast model (Haiku) for query intent parsing and a capable model (Opus/Sonnet) for document extraction — they have different accuracy requirements
  • The "unknown" fallback is critical: unfiltered results are better than wrong filters blocking valid chunks

Limitation: This approach works best when your corpus has consistent domain vocabulary. Mixed-language corpora or heavily technical jargon may need domain-specific few-shot examples in the extraction prompt.

When NOT to use this: If your corpus is small (<1K documents) and homogeneous, semantic search alone is probably sufficient. Add metadata when you start seeing retrieval confusion at scale.


Tested on Python 3.12, anthropic SDK 0.40+, Qdrant 1.9, Claude Haiku 4.5 and Opus 4.6