Problem: Dense Retrieval Returns Irrelevant Chunks
BGE Reranker cross-encoder reranking fixes the single biggest failure mode in production RAG — your vector search returns the top-k chunks by embedding similarity, but similarity ≠ relevance. The wrong passages reach the LLM, and hallucinations follow.
This happens because bi-encoder embeddings compress meaning into a fixed vector. They're fast, but they can't model the interaction between a query and a document. A cross-encoder reads both together and scores their relevance directly — no compression, no approximation.
You'll learn:
- Why bi-encoder retrieval fails at precision and when a reranker fixes it
- How to install and run
BAAI/bge-reranker-v2-m3locally withFlagEmbedding - How to wire a BGE Reranker into a LangChain RAG pipeline with a
ContextualCompressionRetriever - How to benchmark reranked vs. non-reranked retrieval on your own dataset
Time: 20 min | Difficulty: Intermediate
Why Dense Retrieval Loses Precision
Vector search works by converting your query and every stored chunk into embeddings, then finding the nearest neighbors by cosine distance. At retrieval time, the query and each document are encoded independently — the model never sees them side by side.
That independence is the tradeoff. Embeddings are fast to compute and cheap to store, but they can't capture fine-grained lexical overlap, negation, or conditional relevance. A chunk about "Python memory management" scores high for "Python garbage collection errors" even if it doesn't answer the question.
Symptoms:
- LLM answer cites passages that are topically related but factually off
- Increasing
top_kimproves recall but floods the context window with noise - Retrieval eval MRR@10 looks good but end-to-end answer quality is poor
The fix is a two-stage pipeline: fast bi-encoder retrieval for recall, then a cross-encoder reranker for precision.
How BGE Reranker Works
Stage 1 (bi-encoder): retrieve top-20 candidates fast. Stage 2 (cross-encoder): score each query–chunk pair jointly, return top-3 to the LLM.
A cross-encoder takes a query–document pair as a single input sequence — [CLS] query [SEP] document [SEP] — and outputs a scalar relevance score. Because both texts are processed together through every transformer layer, attention can model their interaction directly.
BAAI/bge-reranker-v2-m3 is the current best open-weight reranker for English and multilingual workloads. It's based on a MiniLM architecture (278M params) — small enough to run on CPU for batches under 100 pairs, fast on a single GPU for larger workloads.
| Model | Params | BEIR nDCG@10 | Latency (CPU, batch=16) |
|---|---|---|---|
bge-reranker-base | 278M | 49.2 | ~120ms |
bge-reranker-v2-m3 | 278M | 51.8 | ~130ms |
bge-reranker-large | 560M | 52.1 | ~310ms |
| Cohere Rerank 3 (API) | N/A | 53.4 | ~200ms (network) |
Choose bge-reranker-v2-m3 for self-hosted, cost-free reranking — it matches Cohere's quality at zero API cost. Choose Cohere Rerank 3 if you're already paying for Cohere's API ($1.00/1k searches) and want to avoid GPU infra.
Solution
Step 1: Install Dependencies
# FlagEmbedding provides FlagReranker — the inference wrapper for all BGE models
pip install FlagEmbedding langchain langchain-community chromadb sentence-transformers --break-system-packages
Verify the install:
python -c "from FlagEmbedding import FlagReranker; print('FlagEmbedding OK')"
Expected output: FlagEmbedding OK
If it fails:
ModuleNotFoundError: FlagEmbedding→pip install FlagEmbedding --upgrade --break-system-packagesRuntimeError: CUDA not available→ setuse_fp16=False(runs on CPU)
Step 2: Score Query–Document Pairs with FlagReranker
from FlagEmbedding import FlagReranker
reranker = FlagReranker(
"BAAI/bge-reranker-v2-m3",
use_fp16=True, # fp16 halves VRAM; set False on CPU-only machines
)
query = "How does Python garbage collection handle circular references?"
# Simulated bi-encoder top-5 candidates
candidates = [
"Python uses reference counting as its primary memory management strategy.",
"The gc module handles cyclic garbage collection in CPython using a generational algorithm.",
"Garbage collection in Java relies on the JVM heap and generational GC.",
"CPython's cyclic GC runs in three generations; gen0 collects most frequently.",
"Python 3.12 introduced incremental garbage collection to reduce GC pause times.",
]
pairs = [[query, doc] for doc in candidates]
scores = reranker.compute_score(
pairs,
batch_size=16, # increase to 32–64 on GPU; keep at 8–16 on CPU
normalize=True, # sigmoid-normalize to [0, 1] for easier threshold filtering
)
ranked = sorted(zip(scores, candidates), reverse=True)
for score, doc in ranked:
print(f"{score:.4f} | {doc[:80]}")
Expected output:
0.9821 | CPython's cyclic GC runs in three generations; gen0 collects most frequently.
0.9634 | The gc module handles cyclic garbage collection in CPython using a generational...
0.8912 | Python 3.12 introduced incremental garbage collection to reduce GC pause times.
0.3201 | Python uses reference counting as its primary memory management strategy.
0.0741 | Garbage collection in Java relies on the JVM heap and generational GC.
The Java doc drops to the bottom. The Java sentence is topically close to "garbage collection" but the cross-encoder correctly penalizes the Python/Java mismatch.
Step 3: Build the BGE Reranker as a LangChain Document Compressor
LangChain's ContextualCompressionRetriever wraps any base retriever and a DocumentCompressor. You implement the compressor once and swap it in front of any retriever — Chroma, Pinecone, pgvector, FAISS.
from typing import Optional, Sequence
from langchain_core.documents import Document
from langchain_core.callbacks.manager import Callbacks
from langchain.retrievers.document_compressors.base import BaseDocumentCompressor
from FlagEmbedding import FlagReranker
from pydantic import Field
class BGERerankerCompressor(BaseDocumentCompressor):
"""LangChain document compressor backed by BAAI/bge-reranker-v2-m3."""
model_name: str = Field(default="BAAI/bge-reranker-v2-m3")
top_n: int = Field(default=3) # how many docs to keep after reranking
score_threshold: float = Field(default=0.1) # drop anything below this score
_reranker: FlagReranker = None
def model_post_init(self, __context):
# Lazy-load so the compressor can be serialized before the model downloads
self._reranker = FlagReranker(self.model_name, use_fp16=True)
def compress_documents(
self,
documents: Sequence[Document],
query: str,
callbacks: Optional[Callbacks] = None,
) -> Sequence[Document]:
if not documents:
return []
pairs = [[query, doc.page_content] for doc in documents]
scores = self._reranker.compute_score(pairs, normalize=True)
# Attach score to metadata so the LLM chain can inspect it later
scored_docs = [
(score, doc)
for score, doc in zip(scores, documents)
if score >= self.score_threshold
]
scored_docs.sort(key=lambda x: x[0], reverse=True)
results = []
for score, doc in scored_docs[: self.top_n]:
doc.metadata["rerank_score"] = round(score, 4)
results.append(doc)
return results
Step 4: Wire Into a Full RAG Pipeline
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.retrievers import ContextualCompressionRetriever
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
# --- Embeddings + vector store (swap Chroma for pgvector, Pinecone, etc.) ---
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings,
)
# Stage 1: retrieve top-20 by cosine similarity (high recall, lower precision)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
# Stage 2: rerank, keep top-3 (high precision, low noise to LLM)
reranker_compressor = BGERerankerCompressor(top_n=3, score_threshold=0.15)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker_compressor,
base_retriever=base_retriever,
)
# --- RAG chain ---
prompt = ChatPromptTemplate.from_template(
"""Answer the question using only the context below.
If the context doesn't contain the answer, say "I don't know."
Context:
{context}
Question: {question}
"""
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def format_docs(docs):
return "\n\n".join(
f"[score={d.metadata.get('rerank_score', '?')}] {d.page_content}"
for d in docs
)
rag_chain = (
{
"context": compression_retriever | format_docs,
"question": lambda x: x,
}
| prompt
| llm
| StrOutputParser()
)
answer = rag_chain.invoke("How does Python garbage collection handle circular references?")
print(answer)
The rerank_score in the formatted context is optional but useful — it lets you spot-check which chunks the LLM is relying on.
Step 5: Run BGE Reranker in Docker (Optional, Self-Hosted)
If you want the reranker isolated from your app or running on a dedicated GPU node:
# Dockerfile
FROM python:3.12-slim
WORKDIR /app
RUN pip install FlagEmbedding fastapi uvicorn --break-system-packages
COPY reranker_server.py .
CMD ["uvicorn", "reranker_server:app", "--host", "0.0.0.0", "--port", "8080"]
# reranker_server.py
from fastapi import FastAPI
from pydantic import BaseModel
from FlagEmbedding import FlagReranker
app = FastAPI()
reranker = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True)
class RerankRequest(BaseModel):
query: str
documents: list[str]
top_n: int = 3
@app.post("/rerank")
def rerank(req: RerankRequest):
pairs = [[req.query, doc] for doc in req.documents]
scores = reranker.compute_score(pairs, normalize=True)
ranked = sorted(
zip(scores, req.documents), reverse=True
)[:req.top_n]
return [{"score": s, "document": d} for s, d in ranked]
docker build -t bge-reranker .
docker run -p 8080:8080 bge-reranker
Expected output: Uvicorn running on http://0.0.0.0:8080
Test it:
curl -X POST http://localhost:8080/rerank \
-H "Content-Type: application/json" \
-d '{
"query": "Python garbage collection circular references",
"documents": ["CPython gc module uses generational GC", "Java uses JVM heap"],
"top_n": 1
}'
Verification
Run this end-to-end smoke test to confirm the full pipeline works:
from langchain_core.documents import Document
test_docs = [
Document(page_content="CPython's gc module uses generational garbage collection."),
Document(page_content="Java's JVM uses a mark-and-sweep GC by default."),
Document(page_content="Python 3.12 introduced incremental GC to reduce pause times."),
]
compressor = BGERerankerCompressor(top_n=2)
results = compressor.compress_documents(
test_docs,
query="How does Python handle circular reference garbage collection?",
)
assert len(results) == 2
assert results[0].metadata["rerank_score"] > results[1].metadata["rerank_score"]
assert "Java" not in results[0].page_content # Java doc should not rank first
print("All assertions passed.")
for r in results:
print(r.metadata["rerank_score"], r.page_content[:60])
You should see: All assertions passed. followed by two Python-related chunks in descending score order.
What You Learned
- Two-stage retrieval (bi-encoder recall → cross-encoder reranking) is the production pattern for RAG precision. Bi-encoders are fast; cross-encoders are accurate.
FlagReranker.compute_scoreaccepts raw string pairs — no tokenization boilerplate. Passnormalize=Trueto get[0, 1]scores you can threshold.BGERerankerCompressorslots intoContextualCompressionRetrieverand works with any LangChain vector store — no retriever rewrites needed.- Don't over-retrieve. A
top_k=20 → top_n=3ratio is a good default. Going wider than 50 candidates adds latency without meaningfully improving recall. - When NOT to use a reranker: sub-100ms latency requirements where you can't afford the cross-encoder pass, or when your retrieval corpus is small enough that bi-encoder MRR@5 already exceeds 0.85.
Tested on Python 3.12, FlagEmbedding 1.2.9, LangChain 0.3.x, BAAI/bge-reranker-v2-m3, macOS (M2 Max) and Ubuntu 22.04 + RTX 4090
FAQ
Q: Does BGE Reranker work without a GPU?
A: Yes. Set use_fp16=False when initializing FlagReranker. On CPU, expect ~130ms per 16-pair batch — acceptable for batches under 50 candidates. Above that, latency stacks up quickly and a GPU becomes worth it.
Q: What is the difference between bge-reranker-v2-m3 and bge-reranker-large?
A: v2-m3 is multilingual and scores 51.8 nDCG@10 on BEIR at 278M params. bge-reranker-large adds ~2 nDCG@10 points but doubles latency at 560M params. Unless your eval shows a measurable quality gap on your dataset, v2-m3 is the better default.
Q: How many candidates should I pass to the reranker? A: 10–30 is the practical range. Below 10, you may miss relevant chunks that the bi-encoder ranked low. Above 50, cross-encoder latency grows linearly (~8ms per pair on CPU) and the marginal recall gain is minimal.
Q: Can this work with Pinecone or pgvector instead of Chroma?
A: Yes. ContextualCompressionRetriever wraps any BaseRetriever. Swap vectorstore.as_retriever() for your Pinecone or pgvector retriever — the BGERerankerCompressor is retriever-agnostic.
Q: What does compute_score batch_size control?
A: batch_size sets how many query–document pairs are forwarded through the cross-encoder in a single tensor operation. Larger batches improve GPU utilization but increase peak VRAM. Start at 16 on GPU (uses ~1.2GB VRAM for v2-m3) and scale up until VRAM is the bottleneck.