Self-Reflective RAG: Making Your AI Check Its Own Sources

Problem: Your RAG Pipeline Answers Confidently—Even When It Shouldn't

Standard RAG retrieves documents and generates a response. It doesn't check whether those documents actually answer the question. The result: confident-sounding hallucinations when retrieval misfires.

You'll learn:

How Self-RAG adds a critique loop to standard retrieval
How to implement retrieval relevance and response faithfulness checks
When to skip retrieval entirely vs. when to fall back to generation

Time: 30 min | Level: Intermediate

Why This Happens

Standard RAG is a one-way pipeline: query → retrieve → generate. There's no feedback loop. If the top-k retrieved chunks are tangentially related, the LLM still uses them—and confidently invents details to fill the gaps.

Self-RAG (introduced by Asai et al., 2023) fixes this by inserting reflection tokens at key decision points. The model asks itself: Is retrieval even needed? Are these chunks relevant? Is my answer grounded in them?

Common symptoms of standard RAG failure:

Answers that blend real retrieved content with invented facts
Correct-sounding responses that cite the wrong source
Retrieval that returns semantically adjacent but factually wrong chunks

Diagram showing standard RAG vs Self-RAG flow Standard RAG passes retrieved docs straight to generation. Self-RAG adds three critique checkpoints.

Solution

We'll build a Self-RAG pipeline in Python using LangChain and an OpenAI-compatible model. The pipeline has three self-check gates:

Retrieval gate — Does this query need external docs at all?
Relevance gate — Are the retrieved chunks actually relevant?
Faithfulness gate — Is the generated answer grounded in the chunks?

Step 1: Install Dependencies

pip install langchain langchain-openai chromadb tiktoken

Step 2: Build the Critique Chain

The critique chain is a lightweight LLM call that scores each checkpoint. Keep it separate from your main generation model so grades don't get mixed into the answer.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

critique_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Relevance check: does this chunk help answer the question?
relevance_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a grader. Answer only 'yes' or 'no'."),
    ("human", "Question: {question}\n\nChunk: {chunk}\n\nIs this chunk relevant to the question?")
])

# Faithfulness check: is the answer grounded in the provided context?
faithfulness_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a grader. Answer only 'yes' or 'no'."),
    ("human", "Context: {context}\n\nAnswer: {answer}\n\nIs the answer fully supported by the context?")
])

relevance_chain = relevance_prompt | critique_llm | StrOutputParser()
faithfulness_chain = faithfulness_prompt | critique_llm | StrOutputParser()

Why temperature=0 on the critique model: Grading needs determinism. You want the same chunk to get the same grade on repeated calls.

Step 3: Implement the Self-RAG Pipeline

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Main generation model
gen_llm = ChatOpenAI(model="gpt-4o", temperature=0.2)

generate_prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer using only the provided context. If you cannot answer from context, say so."),
    ("human", "Context:\n{context}\n\nQuestion: {question}")
])

generate_chain = generate_prompt | gen_llm | StrOutputParser()

def self_rag(question: str, vectorstore: Chroma, k: int = 4) -> dict:
    # Gate 1: Does this query need retrieval?
    retrieval_needed = needs_retrieval(question)
    if not retrieval_needed:
        # Answer from parametric knowledge directly
        direct_answer = gen_llm.invoke(question).content
        return {"answer": direct_answer, "grounded": None, "source": "parametric"}

    # Gate 2: Retrieve and filter relevant chunks
    raw_docs = vectorstore.similarity_search(question, k=k)
    relevant_docs = [
        doc for doc in raw_docs
        if relevance_chain.invoke({"question": question, "chunk": doc.page_content}).strip().lower() == "yes"
    ]

    if not relevant_docs:
        # Retrieval found nothing useful—flag it rather than hallucinate
        return {"answer": "I couldn't find relevant information to answer this.", "grounded": False, "source": "none"}

    context = "\n\n".join(doc.page_content for doc in relevant_docs)

    # Generate answer
    answer = generate_chain.invoke({"context": context, "question": question})

    # Gate 3: Is the answer grounded in the retrieved context?
    grounded = faithfulness_chain.invoke({"context": context, "answer": answer}).strip().lower() == "yes"

    return {
        "answer": answer,
        "grounded": grounded,
        "source": "retrieval",
        "docs_used": len(relevant_docs)
    }

def needs_retrieval(question: str) -> bool:
    # Simple heuristic: skip retrieval for math, greetings, definitions
    # Replace with an LLM call for production use
    no_retrieval_keywords = ["what is 2+2", "hello", "who are you"]
    return not any(kw in question.lower() for kw in no_retrieval_keywords)

If the faithfulness gate returns False: Don't serve the answer. Either retry with a stricter generation prompt or return the "couldn't find relevant information" fallback. Serving an ungrounded answer is worse than admitting uncertainty.

Step 4: Wire It to a Vector Store

from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load and chunk your docs
loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)

# Build vectorstore
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# Run Self-RAG
result = self_rag("What deployment strategies does the platform support?", vectorstore)
print(result)

Expected:

{
  "answer": "The platform supports blue-green and canary deployments...",
  "grounded": true,
  "source": "retrieval",
  "docs_used": 3
}

Terminal showing Self-RAG output with grounded: true A grounded answer includes the docs_used count and source field for auditability.

Verification

python -c "
from your_module import self_rag, vectorstore

# Should return grounded: True
result = self_rag('What is the refund policy?', vectorstore)
assert result['grounded'] == True, 'Answer not grounded!'
print('Pass:', result['answer'][:80])
"

You should see: Pass: followed by the first 80 characters of a grounded answer. If you see AssertionError, the faithfulness gate caught an ungrounded response—check your chunk size (too large = diluted relevance).

What You Learned

Self-RAG adds three critique gates that prevent confident hallucination
Separating the critique model from the generation model keeps grades uncontaminated
The faithfulness gate is your last line of defense—treat a False result as a retrieval failure, not a generation prompt problem
Limitation: Each Self-RAG call makes 2–5 LLM calls instead of 1. Latency roughly doubles. Cache critique results for repeated queries in production.
When NOT to use this: Real-time applications where latency matters more than factual precision. For chatbots, a simpler citation-grounding check is often enough.

Tested on Python 3.12, LangChain 0.3, OpenAI gpt-4o + gpt-4o-mini, ChromaDB 0.5