Problem: Your RAG Pipeline Returns Irrelevant Chunks
Agentic RAG with self-querying and adaptive retrieval fixes the core failure of naive RAG: a single static vector search that can't handle multi-faceted questions, filter conditions, or follow-up reasoning.
Here's the symptom. You ask "What are the cheapest PostgreSQL-compatible databases under $50/month with SOC 2 compliance?" and your retriever returns generic database overview chunks. The LLM then hallucinates the rest. This happens because naive RAG treats every question as a pure semantic similarity problem and ignores structured metadata entirely.
You'll learn:
- How to implement
SelfQueryRetrieverso the LLM decomposes questions into semantic search + metadata filters automatically - How to build an adaptive retrieval loop that re-queries when confidence is low
- How to wire a LangGraph agent that routes between retrieval strategies based on query type
Time: 25 min | Difficulty: Advanced
Why Naive RAG Breaks on Complex Queries
Static top-k vector search has two hard limits. First, it can't apply structured filters — dates, prices, categories, compliance flags — without you manually pre-filtering before the search. Second, it runs once and hands whatever it found to the LLM, with no feedback loop to detect when retrieval failed.
Symptoms:
- Returned chunks answer a related question, not the actual one asked
- LLM says "based on the provided context" and then ignores it — it's hallucinating because context was unhelpful
- Multi-hop questions (A → B → C) always fail because a single retrieval pass can't chain facts
- Filter conditions in the question ("before 2024", "under $100", "enterprise tier only") are silently ignored
The fix is to make retrieval agentic: give the LLM control over what to search for, what filters to apply, and whether to search again.
Agentic RAG loop: query analysis → self-querying retriever → confidence check → adaptive re-retrieval → grounded generation
Solution
Step 1: Install Dependencies
# Python 3.12 + uv recommended — avoids pip resolver conflicts
uv pip install langchain==0.3.18 langchain-openai==0.2.14 langchain-community==0.3.18 \
langgraph==0.2.68 chromadb==0.6.3 pydantic==2.10.6
Expected output: Successfully installed langchain-0.3.18 ...
If it fails:
ERROR: pip's dependency resolver→ useuv pip installinstead of barepipchromadb requires sqlite >= 3.35→ run on Ubuntu 22.04+ or macOS 13+; SQLite on Ubuntu 20.04 is too old
Step 2: Define Your Document Schema with Metadata
Self-querying only works when your vector store documents have structured metadata the LLM can filter against. Define that schema explicitly — this becomes the LLM's "filter vocabulary."
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
# Describe every filterable field — the LLM reads these descriptions
# to decide which filters to apply. Be precise.
metadata_field_info = [
AttributeInfo(
name="source",
description="The product or documentation source name, e.g. 'aws-rds', 'neon', 'planetscale'",
type="string",
),
AttributeInfo(
name="price_usd_month",
description="Monthly price in USD for the base tier. Use for cost comparisons.",
type="integer",
),
AttributeInfo(
name="compliance",
description="List of compliance certifications: SOC2, HIPAA, GDPR, PCI-DSS",
type="list[string]",
),
AttributeInfo(
name="year",
description="Year the document or pricing page was last updated",
type="integer",
),
]
document_content_description = "Technical documentation and pricing pages for cloud database services"
The AttributeInfo descriptions are injected directly into the LLM prompt. Vague descriptions produce vague filters — write them as if briefing a junior analyst.
Step 3: Build the SelfQueryRetriever
import os
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Load or create your Chroma collection — persistent storage for production
vectorstore = Chroma(
collection_name="database_docs",
embedding_function=embeddings,
persist_directory="./chroma_db", # Omit for in-memory dev testing
)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectorstore,
document_contents=document_content_description,
metadata_field_info=metadata_field_info,
verbose=True, # Logs the structured query — essential for debugging filter generation
search_kwargs={"k": 6}, # Fetch 6 candidates before re-ranking
)
When verbose=True, you'll see the structured query the LLM generates. For the question "PostgreSQL-compatible under $50/month with SOC 2", it should output something like:
query='PostgreSQL compatible database'
filter=Operation(operator=AND, arguments=[
Comparison(comparator=lte, attribute='price_usd_month', value=50),
Comparison(comparator=contain, attribute='compliance', value='SOC2')
])
If you see filter=None for a question that should have filters, tighten your AttributeInfo descriptions or switch from gpt-4o-mini to gpt-4o.
Step 4: Add a Confidence Gate for Adaptive Re-Retrieval
Static retrieval hands results to the LLM unconditionally. Adaptive retrieval checks whether the retrieved chunks actually contain evidence for the answer — and re-queries with a rewritten question if they don't.
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from pydantic import BaseModel, Field
class RetrievalGrade(BaseModel):
"""Grade the relevance of retrieved documents to the question."""
relevant: bool = Field(description="True if documents contain evidence to answer the question")
reason: str = Field(description="One sentence explaining the grade")
grader_llm = ChatOpenAI(model="gpt-4o", temperature=0).with_structured_output(RetrievalGrade)
grader_prompt = ChatPromptTemplate.from_messages([
("system", "You are grading whether retrieved documents are relevant to a question. Be strict."),
("human", "Question: {question}\n\nDocuments:\n{documents}\n\nAre these documents relevant?"),
])
grader_chain = grader_prompt | grader_llm
def grade_retrieval(question: str, docs: list) -> RetrievalGrade:
doc_text = "\n\n---\n\n".join(d.page_content for d in docs[:4]) # Grade top 4 only
return grader_chain.invoke({"question": question, "documents": doc_text})
Step 5: Query Rewriting for Failed Retrievals
When the grader returns relevant=False, rewrite the query before trying again. Rewriting expands abbreviations, removes negations, and surfaces implicit intent.
rewriter_prompt = ChatPromptTemplate.from_messages([
("system", (
"You rewrite user questions to improve vector search retrieval. "
"Remove negations. Expand abbreviations. Make implicit requirements explicit. "
"Output only the rewritten question — no explanation."
)),
("human", "Original question: {question}\n\nRewritten question:"),
])
rewriter_chain = rewriter_prompt | ChatOpenAI(model="gpt-4o", temperature=0) | StrOutputParser()
def rewrite_query(question: str) -> str:
return rewriter_chain.invoke({"question": question})
Example: "cheapest PG-compatible DB, not AWS" rewrites to "lowest cost PostgreSQL compatible cloud database excluding Amazon RDS" — a much cleaner semantic search target.
Step 6: Wire the Adaptive Retrieval Loop with LangGraph
LangGraph manages the retrieval state machine: retrieve → grade → rewrite and retry (max 2 retries) → generate.
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
class AgenticRAGState(TypedDict):
question: str
documents: list
generation: str
retry_count: int
def retrieve_node(state: AgenticRAGState) -> AgenticRAGState:
docs = retriever.invoke(state["question"])
return {**state, "documents": docs}
def grade_node(state: AgenticRAGState) -> AgenticRAGState:
grade = grade_retrieval(state["question"], state["documents"])
# Store grade result in state for router to read
return {**state, "_grade_relevant": grade.relevant}
def rewrite_node(state: AgenticRAGState) -> AgenticRAGState:
new_question = rewrite_query(state["question"])
return {**state, "question": new_question, "retry_count": state["retry_count"] + 1}
def generate_node(state: AgenticRAGState) -> AgenticRAGState:
context = "\n\n".join(d.page_content for d in state["documents"])
generate_prompt = ChatPromptTemplate.from_messages([
("system", "Answer using only the provided context. If the context doesn't contain the answer, say so."),
("human", "Context:\n{context}\n\nQuestion: {question}"),
])
chain = generate_prompt | ChatOpenAI(model="gpt-4o", temperature=0) | StrOutputParser()
answer = chain.invoke({"context": context, "question": state["question"]})
return {**state, "generation": answer}
def should_retry(state: AgenticRAGState) -> str:
# Max 2 retries to avoid runaway loops — critical for production cost control
if state.get("_grade_relevant", False):
return "generate"
if state.get("retry_count", 0) >= 2:
return "generate" # Generate with what we have after max retries
return "rewrite"
# Build the graph
workflow = StateGraph(AgenticRAGState)
workflow.add_node("retrieve", retrieve_node)
workflow.add_node("grade", grade_node)
workflow.add_node("rewrite", rewrite_node)
workflow.add_node("generate", generate_node)
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade")
workflow.add_conditional_edges("grade", should_retry, {
"generate": "generate",
"rewrite": "rewrite",
})
workflow.add_edge("rewrite", "retrieve") # Loop back after rewrite
workflow.add_edge("generate", END)
app = workflow.compile()
Step 7: Run the Agentic RAG Pipeline
result = app.invoke({
"question": "Which PostgreSQL-compatible databases cost under $50/month and have SOC 2 compliance?",
"documents": [],
"generation": "",
"retry_count": 0,
})
print(result["generation"])
Expected output: A grounded answer listing specific database products with their prices and compliance status — sourced directly from your indexed documents.
If it fails:
openai.AuthenticationError→ setOPENAI_API_KEYenvironment variablechromadb.errors.InvalidCollectionException→ collection doesn't exist yet; add documents first withvectorstore.add_documents(docs)SelfQueryRetrieverproduces no results → check that documents in Chroma actually have the metadata fields you defined; runvectorstore.get()to inspect stored metadata
Comparison: Naive RAG vs Self-Querying vs Agentic RAG
| Naive RAG | Self-Querying | Agentic RAG | |
|---|---|---|---|
| Handles metadata filters | ❌ | ✅ | ✅ |
| Retries on poor retrieval | ❌ | ❌ | ✅ |
| Multi-hop reasoning | ❌ | ❌ | ✅ (with subgraph) |
| Cost per query | Low | Medium | Medium–High |
| Setup complexity | Simple | Moderate | Complex |
| Best for | Prototypes | Structured data | Production Q&A |
Self-querying alone gets you 80% of the benefit at 40% of the complexity. Add the adaptive loop when you're seeing >15% irrelevant retrievals in production (track this with LangSmith).
Verification
Run this smoke test against your pipeline to confirm all three layers are working:
# Test 1 — filter extraction
test_result = app.invoke({
"question": "Show me HIPAA-compliant databases updated after 2023",
"documents": [], "generation": "", "retry_count": 0,
})
assert "HIPAA" in test_result["generation"] or "context" in test_result["generation"].lower()
# Test 2 — rewrite triggered (use a question with negation)
test_result2 = app.invoke({
"question": "databases that aren't expensive and don't lack SOC2",
"documents": [], "generation": "", "retry_count": 0,
})
print("Retries:", test_result2.get("retry_count", 0)) # Expect > 0
You should see: Retries: 1 or Retries: 2 for the negation test — confirming the grader caught the mismatch and triggered a rewrite.
What You Learned
SelfQueryRetrievertranslates natural language into vector search + structured filters in a single LLM call — no manual filter logic required- Relevance grading before generation is the most cost-effective quality gate; a cheap
gpt-4o-minigrader call saves expensive regenerations downstream - The retry cap (
retry_count >= 2) is non-negotiable in production — unbounded loops will blow your OpenAI budget in minutes under adversarial or ambiguous queries - Agentic RAG is overkill for FAQ-style Q&A with clean, uniform documents; use it when queries have filter conditions, multiple intents, or require chaining facts across documents
Tested on LangChain 0.3.18, LangGraph 0.2.68, ChromaDB 0.6.3, Python 3.12, Ubuntu 22.04
FAQ
Q: Does SelfQueryRetriever work with vector stores other than Chroma?
A: Yes — LangChain ships SelfQueryRetriever adapters for Pinecone, Weaviate, Qdrant, PGVector, and Milvus. The metadata_field_info schema is portable; only the vectorstore constructor changes.
Q: What is the difference between self-querying and HyDE (Hypothetical Document Embeddings)? A: Self-querying extracts structured filters and a semantic query from the question; HyDE generates a hypothetical answer and embeds that instead of the question. They solve different problems — use self-querying when your documents have rich metadata; use HyDE when queries are short and semantically sparse.
Q: How much does running the full agentic loop cost in USD?
A: A single query with one retry costs roughly $0.003–$0.008 with gpt-4o (grader + rewriter + generator). At 10,000 queries/month that's $30–$80/month — well within AWS us-east-1 free tier for the hosting side.
Q: Can I run this fully self-hosted without OpenAI?
A: Yes. Replace ChatOpenAI with ChatOllama pointing at a local Llama 3.3 70B instance. Self-querying filter quality degrades on models smaller than 13B — test your filter extraction before deploying.
Q: What minimum context window does the LLM need for the grader step? A: The grader concatenates up to 4 document chunks. With 512-token chunks that's ~2,048 tokens of context plus the prompt — any model with an 8K context window handles this comfortably.