Problem: Your RAG Pipeline Answers Confidently—Even When It Shouldn't
Standard RAG retrieves documents and generates a response. It doesn't check whether those documents actually answer the question. The result: confident-sounding hallucinations when retrieval misfires.
You'll learn:
- How Self-RAG adds a critique loop to standard retrieval
- How to implement retrieval relevance and response faithfulness checks
- When to skip retrieval entirely vs. when to fall back to generation
Time: 30 min | Level: Intermediate
Why This Happens
Standard RAG is a one-way pipeline: query → retrieve → generate. There's no feedback loop. If the top-k retrieved chunks are tangentially related, the LLM still uses them—and confidently invents details to fill the gaps.
Self-RAG (introduced by Asai et al., 2023) fixes this by inserting reflection tokens at key decision points. The model asks itself: Is retrieval even needed? Are these chunks relevant? Is my answer grounded in them?
Common symptoms of standard RAG failure:
- Answers that blend real retrieved content with invented facts
- Correct-sounding responses that cite the wrong source
- Retrieval that returns semantically adjacent but factually wrong chunks
Standard RAG passes retrieved docs straight to generation. Self-RAG adds three critique checkpoints.
Solution
We'll build a Self-RAG pipeline in Python using LangChain and an OpenAI-compatible model. The pipeline has three self-check gates:
- Retrieval gate — Does this query need external docs at all?
- Relevance gate — Are the retrieved chunks actually relevant?
- Faithfulness gate — Is the generated answer grounded in the chunks?
Step 1: Install Dependencies
pip install langchain langchain-openai chromadb tiktoken
Step 2: Build the Critique Chain
The critique chain is a lightweight LLM call that scores each checkpoint. Keep it separate from your main generation model so grades don't get mixed into the answer.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
critique_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Relevance check: does this chunk help answer the question?
relevance_prompt = ChatPromptTemplate.from_messages([
("system", "You are a grader. Answer only 'yes' or 'no'."),
("human", "Question: {question}\n\nChunk: {chunk}\n\nIs this chunk relevant to the question?")
])
# Faithfulness check: is the answer grounded in the provided context?
faithfulness_prompt = ChatPromptTemplate.from_messages([
("system", "You are a grader. Answer only 'yes' or 'no'."),
("human", "Context: {context}\n\nAnswer: {answer}\n\nIs the answer fully supported by the context?")
])
relevance_chain = relevance_prompt | critique_llm | StrOutputParser()
faithfulness_chain = faithfulness_prompt | critique_llm | StrOutputParser()
Why temperature=0 on the critique model: Grading needs determinism. You want the same chunk to get the same grade on repeated calls.
Step 3: Implement the Self-RAG Pipeline
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
# Main generation model
gen_llm = ChatOpenAI(model="gpt-4o", temperature=0.2)
generate_prompt = ChatPromptTemplate.from_messages([
("system", "Answer using only the provided context. If you cannot answer from context, say so."),
("human", "Context:\n{context}\n\nQuestion: {question}")
])
generate_chain = generate_prompt | gen_llm | StrOutputParser()
def self_rag(question: str, vectorstore: Chroma, k: int = 4) -> dict:
# Gate 1: Does this query need retrieval?
retrieval_needed = needs_retrieval(question)
if not retrieval_needed:
# Answer from parametric knowledge directly
direct_answer = gen_llm.invoke(question).content
return {"answer": direct_answer, "grounded": None, "source": "parametric"}
# Gate 2: Retrieve and filter relevant chunks
raw_docs = vectorstore.similarity_search(question, k=k)
relevant_docs = [
doc for doc in raw_docs
if relevance_chain.invoke({"question": question, "chunk": doc.page_content}).strip().lower() == "yes"
]
if not relevant_docs:
# Retrieval found nothing useful—flag it rather than hallucinate
return {"answer": "I couldn't find relevant information to answer this.", "grounded": False, "source": "none"}
context = "\n\n".join(doc.page_content for doc in relevant_docs)
# Generate answer
answer = generate_chain.invoke({"context": context, "question": question})
# Gate 3: Is the answer grounded in the retrieved context?
grounded = faithfulness_chain.invoke({"context": context, "answer": answer}).strip().lower() == "yes"
return {
"answer": answer,
"grounded": grounded,
"source": "retrieval",
"docs_used": len(relevant_docs)
}
def needs_retrieval(question: str) -> bool:
# Simple heuristic: skip retrieval for math, greetings, definitions
# Replace with an LLM call for production use
no_retrieval_keywords = ["what is 2+2", "hello", "who are you"]
return not any(kw in question.lower() for kw in no_retrieval_keywords)
If the faithfulness gate returns False: Don't serve the answer. Either retry with a stricter generation prompt or return the "couldn't find relevant information" fallback. Serving an ungrounded answer is worse than admitting uncertainty.
Step 4: Wire It to a Vector Store
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load and chunk your docs
loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)
# Build vectorstore
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
# Run Self-RAG
result = self_rag("What deployment strategies does the platform support?", vectorstore)
print(result)
Expected:
{
"answer": "The platform supports blue-green and canary deployments...",
"grounded": true,
"source": "retrieval",
"docs_used": 3
}
A grounded answer includes the docs_used count and source field for auditability.
Verification
python -c "
from your_module import self_rag, vectorstore
# Should return grounded: True
result = self_rag('What is the refund policy?', vectorstore)
assert result['grounded'] == True, 'Answer not grounded!'
print('Pass:', result['answer'][:80])
"
You should see: Pass: followed by the first 80 characters of a grounded answer. If you see AssertionError, the faithfulness gate caught an ungrounded response—check your chunk size (too large = diluted relevance).
What You Learned
- Self-RAG adds three critique gates that prevent confident hallucination
- Separating the critique model from the generation model keeps grades uncontaminated
- The faithfulness gate is your last line of defense—treat a
Falseresult as a retrieval failure, not a generation prompt problem - Limitation: Each Self-RAG call makes 2–5 LLM calls instead of 1. Latency roughly doubles. Cache critique results for repeated queries in production.
- When NOT to use this: Real-time applications where latency matters more than factual precision. For chatbots, a simpler citation-grounding check is often enough.
Tested on Python 3.12, LangChain 0.3, OpenAI gpt-4o + gpt-4o-mini, ChromaDB 0.5