Problem: Your RAG Pipeline Returns the Wrong Chunks

RAG reranking with Cohere and FlashRank fixes the most common failure mode in production retrieval pipelines — high cosine similarity scores that still return semantically off-target chunks.

Vector similarity is fast, but it ranks by embedding proximity, not by actual relevance to the user's question. A chunk mentioning "transformer architecture" scores highly for "how do I fix a slow API?" because the embeddings overlap. Reranking adds a second pass: a cross-encoder model that reads both the query and the chunk together and scores true semantic fit.

You'll learn:

Why vector similarity alone fails at precision and when to add a reranker
How to integrate Cohere Rerank API ($0.10 per 1,000 searches) into a LangChain FAISS pipeline
How to swap in FlashRank for fully local, zero-cost reranking with no API key
How to benchmark both and choose based on your latency and cost constraints

Time: 20 min | Difficulty: Intermediate

Why Vector Search Alone Fails at Precision

Embedding models compress meaning into fixed-size vectors. Two chunks can land near the same point in vector space for shallow lexical reasons — shared technical vocabulary, domain overlap, or topic adjacency — even when only one actually answers the query.

Symptoms you've already seen:

Top-3 retrieved chunks include one or two that feel "almost right" but miss the point
LLM answers are vague or hedge heavily because the context is ambiguous
Similarity score for the wrong chunk is 0.87 while the right chunk scores 0.84

A cross-encoder reranker avoids this by reading the full (query, chunk) pair simultaneously. It never embeds them separately, so it can't be fooled by vocabulary overlap.

RAG reranking pipeline: FAISS retrieval → Cohere or FlashRank reranker → top-K context → LLM Two-stage retrieval: FAISS fetches candidates fast, the reranker promotes the genuinely relevant ones

Setup

Step 1: Install dependencies

# uv is faster than pip for dependency resolution
uv pip install langchain langchain-community langchain-cohere \
  faiss-cpu flashrank cohere python-dotenv openai

Python 3.12 and above is recommended — flashrank uses dataclasses features unavailable in 3.10.

Expected output: Successfully installed flashrank-0.2.x ...

Step 2: Build a baseline FAISS retriever

Start with a plain FAISS retriever so you can measure what reranking actually improves.

# baseline_retriever.py
import os
from dotenv import load_dotenv
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

load_dotenv()

# Sample corpus — replace with your actual documents
docs_raw = [
    "Transformer attention scales quadratically with sequence length.",
    "To reduce API latency, add connection pooling and retry with exponential backoff.",
    "Self-attention allows each token to attend to all others in the sequence.",
    "Rate limiting in FastAPI uses slowapi with Redis as the backend store.",
    "The feedforward sublayer in a transformer uses two linear projections with a ReLU.",
    "Connection pooling with asyncpg cuts Postgres query overhead by up to 40%.",
    "Multi-head attention splits the embedding into H independent attention heads.",
    "Use httpx.AsyncClient with limits=Limits(max_connections=100) for async HTTP pooling.",
]

splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
docs = splitter.create_documents(docs_raw)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 6})  # fetch 6, rerank to top 3

query = "how do I reduce API response latency?"
results = retriever.invoke(query)

print("=== Baseline FAISS top-6 ===")
for i, doc in enumerate(results):
    print(f"{i+1}. {doc.page_content[:80]}")

Run this first. You'll notice transformer-related chunks appear in the top 6 despite being irrelevant. That's the problem reranking solves.

Step 3: Add Cohere Rerank

Cohere's Rerank API is the lowest-friction option. You get a managed cross-encoder, no GPU needed, with pricing starting at $0.10 per 1,000 searches (USD).

# cohere_rerank.py
import os
from dotenv import load_dotenv
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings

load_dotenv()  # COHERE_API_KEY must be set in .env

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.load_local("faiss_index", embeddings)  # reuse index from Step 2
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 6})

reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=3,           # return only the 3 most relevant chunks after reranking
    cohere_api_key=os.environ["COHERE_API_KEY"],
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever,
)

query = "how do I reduce API response latency?"
reranked = compression_retriever.invoke(query)

print("=== Cohere Reranked top-3 ===")
for i, doc in enumerate(reranked):
    print(f"{i+1}. {doc.page_content[:80]}")

Expected output:

=== Cohere Reranked top-3 ===
1. To reduce API latency, add connection pooling and retry with exponential backoff.
2. Connection pooling with asyncpg cuts Postgres query overhead by up to 40%.
3. Use httpx.AsyncClient with limits=Limits(max_connections=100) for async HTTP pooling.

The transformer chunks are gone. top_n=3 is the key parameter — set it lower than k so the reranker has candidates to re-sort.

If it fails:

CohereAPIError: 401 → COHERE_API_KEY missing or wrong in .env
ImportError: langchain_cohere → run uv pip install langchain-cohere

Step 4: Add FlashRank for local reranking

FlashRank runs a quantized cross-encoder entirely on CPU. No API key, no network call, no usage bill. Latency on a MacBook M2 is ~40ms for 10 candidates — acceptable for most applications.

# flashrank_rerank.py
from flashrank import Ranker, RerankRequest
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings

# ms-marco-MiniLM-L-12-v2 is 64MB — downloads once and caches locally
ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/tmp/flashrank_cache")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.load_local("faiss_index", embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 6})

query = "how do I reduce API response latency?"
candidates = base_retriever.invoke(query)

rerank_request = RerankRequest(
    query=query,
    passages=[{"id": i, "text": doc.page_content} for i, doc in enumerate(candidates)],
)

results = ranker.rerank(rerank_request)
top3 = results[:3]  # take top 3 after reranking

print("=== FlashRank local top-3 ===")
for r in top3:
    print(f"score={r['score']:.4f} | {r['text'][:80]}")

Expected output:

=== FlashRank local top-3 ===
score=0.9821 | To reduce API latency, add connection pooling and retry with exponential backoff.
score=0.9104 | Connection pooling with asyncpg cuts Postgres query overhead by up to 40%.
score=0.8773 | Use httpx.AsyncClient with limits=Limits(max_connections=100) for async HTTP pooling.

Step 5: Wire the reranker into an end-to-end RAG chain

# rag_with_reranker.py
import os
from dotenv import load_dotenv
from langchain.chains import RetrievalQA
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_openai import ChatOpenAI

load_dotenv()

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.load_local("faiss_index", embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 8})  # wider net for reranker

reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=3,
    cohere_api_key=os.environ["COHERE_API_KEY"],
)

retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever,
)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
)

result = qa_chain.invoke({"query": "how do I reduce API response latency?"})
print(result["result"])
print("\nSources used:")
for doc in result["source_documents"]:
    print(f"  - {doc.page_content[:60]}")

Cohere vs FlashRank: Which Should You Use?

	Cohere Rerank v3	FlashRank (ms-marco-MiniLM)
Deployment	Managed API	Local CPU
Latency (10 candidates)	~200–400ms (network)	~30–60ms
Cost	$0.10 / 1k searches (USD)	Free
Setup	API key only	64MB model download
Accuracy (BEIR benchmark)	Higher (large model)	Slightly lower
Self-hosted / air-gapped	❌	✅
Best for	Production apps, highest precision	Local dev, cost-sensitive, HIPAA/air-gapped

Choose Cohere Rerank if: you need maximum retrieval precision and are comfortable with per-query billing.

Choose FlashRank if: you're building a self-hosted or air-gapped system, running on a tight budget, or need sub-100ms local latency.

Verification

Run this quick sanity check after wiring either reranker:

python -c "
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
import os

embeddings = OpenAIEmbeddings(model='text-embedding-3-small')
vs = FAISS.load_local('faiss_index', embeddings)
results = vs.similarity_search('API latency', k=3)
for r in results:
    print(r.page_content[:60])
"

You should see: latency-related chunks in positions 1–3 without transformer noise.

What You Learned

Vector similarity retrieval is fast but imprecise — retrieval recall is high, precision is low without reranking
Cohere Rerank uses a server-side cross-encoder; FlashRank runs ms-marco-MiniLM-L-12-v2 locally on CPU
Always set k (vector fetch count) higher than top_n (reranked output count) — the reranker needs candidates to work with; a good ratio is 2–3x
FlashRank is good enough for most use cases and eliminates API dependency entirely

Tested on Python 3.12, LangChain 0.3.x, flashrank 0.2.x, faiss-cpu 1.8.x, macOS Sequoia and Ubuntu 24.04

FAQ

Q: How many candidates should I pass to the reranker? A: Fetch 2–3x your desired top_n. If you want 3 final chunks, fetch 6–9 from FAISS. Diminishing returns kick in past 20 candidates.

Q: Does FlashRank support non-English documents? A: Use ms-marco-MultiBERT-L-12 for multilingual corpora. English-only ms-marco-MiniLM-L-12-v2 degrades noticeably on non-English text.

Q: What is the minimum RAM to run FlashRank locally? A: The ms-marco-MiniLM-L-12-v2 model needs roughly 200MB RAM at inference. It runs fine on any machine with 4GB+ available.

Q: Can I use FlashRank inside a LangChain ContextualCompressionRetriever? A: Not directly — FlashRank doesn't implement the BaseDocumentCompressor interface yet. Wrap it in a custom compressor class or call it manually after the base retriever, as shown in Step 4.

Q: Is Cohere Rerank available on AWS us-east-1? A: Cohere's API is region-agnostic — your request routes to the nearest endpoint automatically. For data-residency requirements in the US, check Cohere's enterprise tier which supports dedicated deployments in AWS us-east-1.