Problem: Your RAG Pipeline Returns Irrelevant Chunks

Agentic RAG with self-querying and adaptive retrieval fixes the core failure of naive RAG: a single static vector search that can't handle multi-faceted questions, filter conditions, or follow-up reasoning.

Here's the symptom. You ask "What are the cheapest PostgreSQL-compatible databases under $50/month with SOC 2 compliance?" and your retriever returns generic database overview chunks. The LLM then hallucinates the rest. This happens because naive RAG treats every question as a pure semantic similarity problem and ignores structured metadata entirely.

You'll learn:

How to implement SelfQueryRetriever so the LLM decomposes questions into semantic search + metadata filters automatically
How to build an adaptive retrieval loop that re-queries when confidence is low
How to wire a LangGraph agent that routes between retrieval strategies based on query type

Time: 25 min | Difficulty: Advanced

Why Naive RAG Breaks on Complex Queries

Static top-k vector search has two hard limits. First, it can't apply structured filters — dates, prices, categories, compliance flags — without you manually pre-filtering before the search. Second, it runs once and hands whatever it found to the LLM, with no feedback loop to detect when retrieval failed.

Symptoms:

Returned chunks answer a related question, not the actual one asked
LLM says "based on the provided context" and then ignores it — it's hallucinating because context was unhelpful
Multi-hop questions (A → B → C) always fail because a single retrieval pass can't chain facts
Filter conditions in the question ("before 2024", "under $100", "enterprise tier only") are silently ignored

The fix is to make retrieval agentic: give the LLM control over what to search for, what filters to apply, and whether to search again.

Agentic RAG self-querying and adaptive retrieval architecture Agentic RAG loop: query analysis → self-querying retriever → confidence check → adaptive re-retrieval → grounded generation

Solution

Step 1: Install Dependencies

# Python 3.12 + uv recommended — avoids pip resolver conflicts
uv pip install langchain==0.3.18 langchain-openai==0.2.14 langchain-community==0.3.18 \
  langgraph==0.2.68 chromadb==0.6.3 pydantic==2.10.6

Expected output: Successfully installed langchain-0.3.18 ...

If it fails:

ERROR: pip's dependency resolver → use uv pip install instead of bare pip
chromadb requires sqlite >= 3.35 → run on Ubuntu 22.04+ or macOS 13+; SQLite on Ubuntu 20.04 is too old

Step 2: Define Your Document Schema with Metadata

Self-querying only works when your vector store documents have structured metadata the LLM can filter against. Define that schema explicitly — this becomes the LLM's "filter vocabulary."

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever

# Describe every filterable field — the LLM reads these descriptions
# to decide which filters to apply. Be precise.
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The product or documentation source name, e.g. 'aws-rds', 'neon', 'planetscale'",
        type="string",
    ),
    AttributeInfo(
        name="price_usd_month",
        description="Monthly price in USD for the base tier. Use for cost comparisons.",
        type="integer",
    ),
    AttributeInfo(
        name="compliance",
        description="List of compliance certifications: SOC2, HIPAA, GDPR, PCI-DSS",
        type="list[string]",
    ),
    AttributeInfo(
        name="year",
        description="Year the document or pricing page was last updated",
        type="integer",
    ),
]

document_content_description = "Technical documentation and pricing pages for cloud database services"

The AttributeInfo descriptions are injected directly into the LLM prompt. Vague descriptions produce vague filters — write them as if briefing a junior analyst.

Step 3: Build the SelfQueryRetriever

import os
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Load or create your Chroma collection — persistent storage for production
vectorstore = Chroma(
    collection_name="database_docs",
    embedding_function=embeddings,
    persist_directory="./chroma_db",  # Omit for in-memory dev testing
)

llm = ChatOpenAI(model="gpt-4o", temperature=0)

retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    verbose=True,  # Logs the structured query — essential for debugging filter generation
    search_kwargs={"k": 6},  # Fetch 6 candidates before re-ranking
)

When verbose=True, you'll see the structured query the LLM generates. For the question "PostgreSQL-compatible under $50/month with SOC 2", it should output something like:

query='PostgreSQL compatible database'
filter=Operation(operator=AND, arguments=[
  Comparison(comparator=lte, attribute='price_usd_month', value=50),
  Comparison(comparator=contain, attribute='compliance', value='SOC2')
])

If you see filter=None for a question that should have filters, tighten your AttributeInfo descriptions or switch from gpt-4o-mini to gpt-4o.

Step 4: Add a Confidence Gate for Adaptive Re-Retrieval

Static retrieval hands results to the LLM unconditionally. Adaptive retrieval checks whether the retrieved chunks actually contain evidence for the answer — and re-queries with a rewritten question if they don't.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from pydantic import BaseModel, Field

class RetrievalGrade(BaseModel):
    """Grade the relevance of retrieved documents to the question."""
    relevant: bool = Field(description="True if documents contain evidence to answer the question")
    reason: str = Field(description="One sentence explaining the grade")

grader_llm = ChatOpenAI(model="gpt-4o", temperature=0).with_structured_output(RetrievalGrade)

grader_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are grading whether retrieved documents are relevant to a question. Be strict."),
    ("human", "Question: {question}\n\nDocuments:\n{documents}\n\nAre these documents relevant?"),
])

grader_chain = grader_prompt | grader_llm


def grade_retrieval(question: str, docs: list) -> RetrievalGrade:
    doc_text = "\n\n---\n\n".join(d.page_content for d in docs[:4])  # Grade top 4 only
    return grader_chain.invoke({"question": question, "documents": doc_text})

Step 5: Query Rewriting for Failed Retrievals

When the grader returns relevant=False, rewrite the query before trying again. Rewriting expands abbreviations, removes negations, and surfaces implicit intent.

rewriter_prompt = ChatPromptTemplate.from_messages([
    ("system", (
        "You rewrite user questions to improve vector search retrieval. "
        "Remove negations. Expand abbreviations. Make implicit requirements explicit. "
        "Output only the rewritten question — no explanation."
    )),
    ("human", "Original question: {question}\n\nRewritten question:"),
])

rewriter_chain = rewriter_prompt | ChatOpenAI(model="gpt-4o", temperature=0) | StrOutputParser()


def rewrite_query(question: str) -> str:
    return rewriter_chain.invoke({"question": question})

Example: "cheapest PG-compatible DB, not AWS" rewrites to "lowest cost PostgreSQL compatible cloud database excluding Amazon RDS" — a much cleaner semantic search target.

Step 6: Wire the Adaptive Retrieval Loop with LangGraph

LangGraph manages the retrieval state machine: retrieve → grade → rewrite and retry (max 2 retries) → generate.

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

class AgenticRAGState(TypedDict):
    question: str
    documents: list
    generation: str
    retry_count: int

def retrieve_node(state: AgenticRAGState) -> AgenticRAGState:
    docs = retriever.invoke(state["question"])
    return {**state, "documents": docs}

def grade_node(state: AgenticRAGState) -> AgenticRAGState:
    grade = grade_retrieval(state["question"], state["documents"])
    # Store grade result in state for router to read
    return {**state, "_grade_relevant": grade.relevant}

def rewrite_node(state: AgenticRAGState) -> AgenticRAGState:
    new_question = rewrite_query(state["question"])
    return {**state, "question": new_question, "retry_count": state["retry_count"] + 1}

def generate_node(state: AgenticRAGState) -> AgenticRAGState:
    context = "\n\n".join(d.page_content for d in state["documents"])
    generate_prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer using only the provided context. If the context doesn't contain the answer, say so."),
        ("human", "Context:\n{context}\n\nQuestion: {question}"),
    ])
    chain = generate_prompt | ChatOpenAI(model="gpt-4o", temperature=0) | StrOutputParser()
    answer = chain.invoke({"context": context, "question": state["question"]})
    return {**state, "generation": answer}

def should_retry(state: AgenticRAGState) -> str:
    # Max 2 retries to avoid runaway loops — critical for production cost control
    if state.get("_grade_relevant", False):
        return "generate"
    if state.get("retry_count", 0) >= 2:
        return "generate"  # Generate with what we have after max retries
    return "rewrite"

# Build the graph
workflow = StateGraph(AgenticRAGState)
workflow.add_node("retrieve", retrieve_node)
workflow.add_node("grade", grade_node)
workflow.add_node("rewrite", rewrite_node)
workflow.add_node("generate", generate_node)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade")
workflow.add_conditional_edges("grade", should_retry, {
    "generate": "generate",
    "rewrite": "rewrite",
})
workflow.add_edge("rewrite", "retrieve")  # Loop back after rewrite
workflow.add_edge("generate", END)

app = workflow.compile()

Step 7: Run the Agentic RAG Pipeline

result = app.invoke({
    "question": "Which PostgreSQL-compatible databases cost under $50/month and have SOC 2 compliance?",
    "documents": [],
    "generation": "",
    "retry_count": 0,
})

print(result["generation"])

Expected output: A grounded answer listing specific database products with their prices and compliance status — sourced directly from your indexed documents.

If it fails:

openai.AuthenticationError → set OPENAI_API_KEY environment variable
chromadb.errors.InvalidCollectionException → collection doesn't exist yet; add documents first with vectorstore.add_documents(docs)
SelfQueryRetriever produces no results → check that documents in Chroma actually have the metadata fields you defined; run vectorstore.get() to inspect stored metadata

Comparison: Naive RAG vs Self-Querying vs Agentic RAG

	Naive RAG	Self-Querying	Agentic RAG
Handles metadata filters	❌	✅	✅
Retries on poor retrieval	❌	❌	✅
Multi-hop reasoning	❌	❌	✅ (with subgraph)
Cost per query	Low	Medium	Medium–High
Setup complexity	Simple	Moderate	Complex
Best for	Prototypes	Structured data	Production Q&A

Self-querying alone gets you 80% of the benefit at 40% of the complexity. Add the adaptive loop when you're seeing >15% irrelevant retrievals in production (track this with LangSmith).

Verification

Run this smoke test against your pipeline to confirm all three layers are working:

# Test 1 — filter extraction
test_result = app.invoke({
    "question": "Show me HIPAA-compliant databases updated after 2023",
    "documents": [], "generation": "", "retry_count": 0,
})
assert "HIPAA" in test_result["generation"] or "context" in test_result["generation"].lower()

# Test 2 — rewrite triggered (use a question with negation)
test_result2 = app.invoke({
    "question": "databases that aren't expensive and don't lack SOC2",
    "documents": [], "generation": "", "retry_count": 0,
})
print("Retries:", test_result2.get("retry_count", 0))  # Expect > 0

You should see: Retries: 1 or Retries: 2 for the negation test — confirming the grader caught the mismatch and triggered a rewrite.

What You Learned

SelfQueryRetriever translates natural language into vector search + structured filters in a single LLM call — no manual filter logic required
Relevance grading before generation is the most cost-effective quality gate; a cheap gpt-4o-mini grader call saves expensive regenerations downstream
The retry cap (retry_count >= 2) is non-negotiable in production — unbounded loops will blow your OpenAI budget in minutes under adversarial or ambiguous queries
Agentic RAG is overkill for FAQ-style Q&A with clean, uniform documents; use it when queries have filter conditions, multiple intents, or require chaining facts across documents

Tested on LangChain 0.3.18, LangGraph 0.2.68, ChromaDB 0.6.3, Python 3.12, Ubuntu 22.04

FAQ

Q: Does SelfQueryRetriever work with vector stores other than Chroma? A: Yes — LangChain ships SelfQueryRetriever adapters for Pinecone, Weaviate, Qdrant, PGVector, and Milvus. The metadata_field_info schema is portable; only the vectorstore constructor changes.

Q: What is the difference between self-querying and HyDE (Hypothetical Document Embeddings)? A: Self-querying extracts structured filters and a semantic query from the question; HyDE generates a hypothetical answer and embeds that instead of the question. They solve different problems — use self-querying when your documents have rich metadata; use HyDE when queries are short and semantically sparse.

Q: How much does running the full agentic loop cost in USD? A: A single query with one retry costs roughly $0.003–$0.008 with gpt-4o (grader + rewriter + generator). At 10,000 queries/month that's $30–$80/month — well within AWS us-east-1 free tier for the hosting side.

Q: Can I run this fully self-hosted without OpenAI? A: Yes. Replace ChatOpenAI with ChatOllama pointing at a local Llama 3.3 70B instance. Self-querying filter quality degrades on models smaller than 13B — test your filter extraction before deploying.

Q: What minimum context window does the LLM need for the grader step? A: The grader concatenates up to 4 document chunks. With 512-token chunks that's ~2,048 tokens of context plus the prompt — any model with an 8K context window handles this comfortably.