Problem: Your RAG Pipeline Returns the Wrong Chunks
RAG reranking with Cohere and FlashRank fixes the most common failure mode in production retrieval pipelines — high cosine similarity scores that still return semantically off-target chunks.
Vector similarity is fast, but it ranks by embedding proximity, not by actual relevance to the user's question. A chunk mentioning "transformer architecture" scores highly for "how do I fix a slow API?" because the embeddings overlap. Reranking adds a second pass: a cross-encoder model that reads both the query and the chunk together and scores true semantic fit.
You'll learn:
- Why vector similarity alone fails at precision and when to add a reranker
- How to integrate Cohere Rerank API ($0.10 per 1,000 searches) into a LangChain FAISS pipeline
- How to swap in FlashRank for fully local, zero-cost reranking with no API key
- How to benchmark both and choose based on your latency and cost constraints
Time: 20 min | Difficulty: Intermediate
Why Vector Search Alone Fails at Precision
Embedding models compress meaning into fixed-size vectors. Two chunks can land near the same point in vector space for shallow lexical reasons — shared technical vocabulary, domain overlap, or topic adjacency — even when only one actually answers the query.
Symptoms you've already seen:
- Top-3 retrieved chunks include one or two that feel "almost right" but miss the point
- LLM answers are vague or hedge heavily because the context is ambiguous
- Similarity score for the wrong chunk is 0.87 while the right chunk scores 0.84
A cross-encoder reranker avoids this by reading the full (query, chunk) pair simultaneously. It never embeds them separately, so it can't be fooled by vocabulary overlap.
Two-stage retrieval: FAISS fetches candidates fast, the reranker promotes the genuinely relevant ones
Setup
Step 1: Install dependencies
# uv is faster than pip for dependency resolution
uv pip install langchain langchain-community langchain-cohere \
faiss-cpu flashrank cohere python-dotenv openai
Python 3.12 and above is recommended — flashrank uses dataclasses features unavailable in 3.10.
Expected output: Successfully installed flashrank-0.2.x ...
Step 2: Build a baseline FAISS retriever
Start with a plain FAISS retriever so you can measure what reranking actually improves.
# baseline_retriever.py
import os
from dotenv import load_dotenv
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
load_dotenv()
# Sample corpus — replace with your actual documents
docs_raw = [
"Transformer attention scales quadratically with sequence length.",
"To reduce API latency, add connection pooling and retry with exponential backoff.",
"Self-attention allows each token to attend to all others in the sequence.",
"Rate limiting in FastAPI uses slowapi with Redis as the backend store.",
"The feedforward sublayer in a transformer uses two linear projections with a ReLU.",
"Connection pooling with asyncpg cuts Postgres query overhead by up to 40%.",
"Multi-head attention splits the embedding into H independent attention heads.",
"Use httpx.AsyncClient with limits=Limits(max_connections=100) for async HTTP pooling.",
]
splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
docs = splitter.create_documents(docs_raw)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 6}) # fetch 6, rerank to top 3
query = "how do I reduce API response latency?"
results = retriever.invoke(query)
print("=== Baseline FAISS top-6 ===")
for i, doc in enumerate(results):
print(f"{i+1}. {doc.page_content[:80]}")
Run this first. You'll notice transformer-related chunks appear in the top 6 despite being irrelevant. That's the problem reranking solves.
Step 3: Add Cohere Rerank
Cohere's Rerank API is the lowest-friction option. You get a managed cross-encoder, no GPU needed, with pricing starting at $0.10 per 1,000 searches (USD).
# cohere_rerank.py
import os
from dotenv import load_dotenv
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
load_dotenv() # COHERE_API_KEY must be set in .env
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.load_local("faiss_index", embeddings) # reuse index from Step 2
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 6})
reranker = CohereRerank(
model="rerank-english-v3.0",
top_n=3, # return only the 3 most relevant chunks after reranking
cohere_api_key=os.environ["COHERE_API_KEY"],
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=base_retriever,
)
query = "how do I reduce API response latency?"
reranked = compression_retriever.invoke(query)
print("=== Cohere Reranked top-3 ===")
for i, doc in enumerate(reranked):
print(f"{i+1}. {doc.page_content[:80]}")
Expected output:
=== Cohere Reranked top-3 ===
1. To reduce API latency, add connection pooling and retry with exponential backoff.
2. Connection pooling with asyncpg cuts Postgres query overhead by up to 40%.
3. Use httpx.AsyncClient with limits=Limits(max_connections=100) for async HTTP pooling.
The transformer chunks are gone. top_n=3 is the key parameter — set it lower than k so the reranker has candidates to re-sort.
If it fails:
CohereAPIError: 401→COHERE_API_KEYmissing or wrong in.envImportError: langchain_cohere→ runuv pip install langchain-cohere
Step 4: Add FlashRank for local reranking
FlashRank runs a quantized cross-encoder entirely on CPU. No API key, no network call, no usage bill. Latency on a MacBook M2 is ~40ms for 10 candidates — acceptable for most applications.
# flashrank_rerank.py
from flashrank import Ranker, RerankRequest
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
# ms-marco-MiniLM-L-12-v2 is 64MB — downloads once and caches locally
ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/tmp/flashrank_cache")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.load_local("faiss_index", embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 6})
query = "how do I reduce API response latency?"
candidates = base_retriever.invoke(query)
rerank_request = RerankRequest(
query=query,
passages=[{"id": i, "text": doc.page_content} for i, doc in enumerate(candidates)],
)
results = ranker.rerank(rerank_request)
top3 = results[:3] # take top 3 after reranking
print("=== FlashRank local top-3 ===")
for r in top3:
print(f"score={r['score']:.4f} | {r['text'][:80]}")
Expected output:
=== FlashRank local top-3 ===
score=0.9821 | To reduce API latency, add connection pooling and retry with exponential backoff.
score=0.9104 | Connection pooling with asyncpg cuts Postgres query overhead by up to 40%.
score=0.8773 | Use httpx.AsyncClient with limits=Limits(max_connections=100) for async HTTP pooling.
Step 5: Wire the reranker into an end-to-end RAG chain
# rag_with_reranker.py
import os
from dotenv import load_dotenv
from langchain.chains import RetrievalQA
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
load_dotenv()
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.load_local("faiss_index", embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 8}) # wider net for reranker
reranker = CohereRerank(
model="rerank-english-v3.0",
top_n=3,
cohere_api_key=os.environ["COHERE_API_KEY"],
)
retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=base_retriever,
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True,
)
result = qa_chain.invoke({"query": "how do I reduce API response latency?"})
print(result["result"])
print("\nSources used:")
for doc in result["source_documents"]:
print(f" - {doc.page_content[:60]}")
Cohere vs FlashRank: Which Should You Use?
| Cohere Rerank v3 | FlashRank (ms-marco-MiniLM) | |
|---|---|---|
| Deployment | Managed API | Local CPU |
| Latency (10 candidates) | ~200–400ms (network) | ~30–60ms |
| Cost | $0.10 / 1k searches (USD) | Free |
| Setup | API key only | 64MB model download |
| Accuracy (BEIR benchmark) | Higher (large model) | Slightly lower |
| Self-hosted / air-gapped | ❌ | ✅ |
| Best for | Production apps, highest precision | Local dev, cost-sensitive, HIPAA/air-gapped |
Choose Cohere Rerank if: you need maximum retrieval precision and are comfortable with per-query billing.
Choose FlashRank if: you're building a self-hosted or air-gapped system, running on a tight budget, or need sub-100ms local latency.
Verification
Run this quick sanity check after wiring either reranker:
python -c "
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
import os
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')
vs = FAISS.load_local('faiss_index', embeddings)
results = vs.similarity_search('API latency', k=3)
for r in results:
print(r.page_content[:60])
"
You should see: latency-related chunks in positions 1–3 without transformer noise.
What You Learned
- Vector similarity retrieval is fast but imprecise — retrieval recall is high, precision is low without reranking
- Cohere Rerank uses a server-side cross-encoder; FlashRank runs
ms-marco-MiniLM-L-12-v2locally on CPU - Always set
k(vector fetch count) higher thantop_n(reranked output count) — the reranker needs candidates to work with; a good ratio is 2–3x - FlashRank is good enough for most use cases and eliminates API dependency entirely
Tested on Python 3.12, LangChain 0.3.x, flashrank 0.2.x, faiss-cpu 1.8.x, macOS Sequoia and Ubuntu 24.04
FAQ
Q: How many candidates should I pass to the reranker?
A: Fetch 2–3x your desired top_n. If you want 3 final chunks, fetch 6–9 from FAISS. Diminishing returns kick in past 20 candidates.
Q: Does FlashRank support non-English documents?
A: Use ms-marco-MultiBERT-L-12 for multilingual corpora. English-only ms-marco-MiniLM-L-12-v2 degrades noticeably on non-English text.
Q: What is the minimum RAM to run FlashRank locally?
A: The ms-marco-MiniLM-L-12-v2 model needs roughly 200MB RAM at inference. It runs fine on any machine with 4GB+ available.
Q: Can I use FlashRank inside a LangChain ContextualCompressionRetriever?
A: Not directly — FlashRank doesn't implement the BaseDocumentCompressor interface yet. Wrap it in a custom compressor class or call it manually after the base retriever, as shown in Step 4.
Q: Is Cohere Rerank available on AWS us-east-1? A: Cohere's API is region-agnostic — your request routes to the nearest endpoint automatically. For data-residency requirements in the US, check Cohere's enterprise tier which supports dedicated deployments in AWS us-east-1.