Problem: Your RAG Pipeline Ignores the Best Evidence
You're running a RAG setup — vector search looks great, the right chunks are retrieved — but the LLM's answers are still vague or miss critical details sitting right there in the context.
The culprit is "lost in the middle" syndrome: LLMs perform significantly worse on information placed in the middle of a long context window compared to content at the start or end.
You'll learn:
- Why LLMs systematically underweight middle context
- How to reorder retrieved chunks so critical content lands at the edges
- How to add a cross-encoder reranker to surface the highest-signal documents
Time: 20 min | Level: Intermediate
Why This Happens
Transformer attention patterns aren't uniform across the context window. Research from Stanford and others shows that recall degrades sharply for tokens positioned in the middle of long prompts — even when the model technically "sees" them.
In a RAG pipeline, your vector store retrieves, say, 8 chunks ranked by cosine similarity. They're typically inserted into the prompt in rank order: chunk 1 first, chunk 8 last. The most relevant chunk (rank 1) lands at position 0. Chunks 2–7 drift into the middle. The LLM leans on chunks 1 and 8, the rest fade.
Common symptoms:
- Answers are superficially correct but miss specific figures, names, or dates you know are in the retrieved context
- Longer context windows make things worse, not better
- Retrieval eval scores look fine but generation eval scores lag
Solution
Step 1: Diagnose With a Position Test
Before changing anything, verify you're actually hitting this problem.
# position_test.py
import os
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
def test_position_sensitivity(context_chunks: list[str], query: str) -> dict:
"""
Place the answer in first, middle, and last position.
If accuracy drops in the middle, you have the syndrome.
"""
llm = ChatOpenAI(model="gpt-4o", temperature=0)
results = {}
for position in ["first", "middle", "last"]:
chunks = context_chunks.copy()
# Move the known-answer chunk to test position
answer_chunk = chunks.pop(0)
if position == "first":
ordered = [answer_chunk] + chunks
elif position == "last":
ordered = chunks + [answer_chunk]
else:
mid = len(chunks) // 2
ordered = chunks[:mid] + [answer_chunk] + chunks[mid:]
context = "\n\n---\n\n".join(ordered)
prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer concisely."
response = llm.invoke([HumanMessage(content=prompt)])
results[position] = response.content
return results
Expected: If answers are noticeably worse for middle, you've confirmed the issue.
Step 2: Add a Cross-Encoder Reranker
Vector similarity finds related chunks. A cross-encoder reads both the query and each chunk together and scores their actual relevance. It's slower but far more accurate.
# reranker.py
from sentence_transformers import CrossEncoder
from dataclasses import dataclass
@dataclass
class RankedChunk:
text: str
original_rank: int
rerank_score: float
def rerank_chunks(
query: str,
chunks: list[str],
model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
top_k: int = 5
) -> list[RankedChunk]:
"""
Score each chunk against the query with a cross-encoder.
Returns top_k chunks sorted by relevance score descending.
"""
encoder = CrossEncoder(model_name)
# Cross-encoder scores query+chunk pairs directly
pairs = [(query, chunk) for chunk in chunks]
scores = encoder.predict(pairs)
ranked = [
RankedChunk(text=chunk, original_rank=i, rerank_score=float(score))
for i, (chunk, score) in enumerate(zip(chunks, scores))
]
# Sort by score, take top_k
ranked.sort(key=lambda x: x.rerank_score, reverse=True)
return ranked[:top_k]
Install the dependency:
pip install sentence-transformers
If it fails:
- CUDA out of memory: Switch to
cross-encoder/ms-marco-TinyBERT-L-2-v2for a smaller model - Slow inference: Run on batches of 8–16 chunks max; anything bigger needs GPU
Step 3: Reorder for Edge Placement
After reranking, don't just insert chunks highest-score-first. Use a "U-shape" ordering that puts the strongest evidence at the beginning and end of the context window, keeping weaker chunks in the middle where attention is lowest anyway.
# rag_pipeline.py
def ushape_order(ranked_chunks: list[RankedChunk]) -> list[str]:
"""
Place highest-scored chunks at the edges of the context.
Lowest-scored chunks land in the middle.
Example with 5 chunks ranked [A, B, C, D, E] by score:
Result order: [A, C, E, D, B]
Positions: [0, 2, 4, 3, 1] ← A and B at edges
"""
texts = [c.text for c in ranked_chunks]
result = []
left, right = 0, len(texts) - 1
# Alternate placing from highest-scored at left/right edges
toggle = True
for chunk in texts:
if toggle:
result.insert(0, chunk) # Push to front
else:
result.append(chunk) # Push to back
toggle = not toggle
return result
def build_rag_prompt(query: str, chunks: list[str]) -> str:
reranked = rerank_chunks(query, chunks, top_k=5)
ordered = ushape_order(reranked)
context = "\n\n---\n\n".join(ordered)
return (
f"Use only the provided context to answer the question.\n\n"
f"Context:\n{context}\n\n"
f"Question: {query}\n\n"
f"Answer:"
)
Step 4: Wire It Into Your Pipeline
# full_pipeline.py
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import HumanMessage
def rag_query(query: str, vectorstore: Chroma, llm: ChatOpenAI) -> str:
# Retrieve more than you need — reranker will trim
raw_docs = vectorstore.similarity_search(query, k=10)
raw_chunks = [doc.page_content for doc in raw_docs]
# Rerank and reorder
prompt = build_rag_prompt(query, raw_chunks)
response = llm.invoke([HumanMessage(content=prompt)])
return response.content
# Usage
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./chroma_db")
llm = ChatOpenAI(model="gpt-4o", temperature=0)
answer = rag_query("What were Q3 revenue figures?", vectorstore, llm)
print(answer)
Verification
Run your existing eval suite, or use this quick sanity check:
python position_test.py
You should see: Answer quality consistent across first/middle/last positions. The cross-encoder scores should surface chunks with actual answer content even if their cosine similarity score was mediocre.
A quick metric to track:
# Before fix: measure how often the correct answer came from a middle-position chunk
# After fix: that number should drop — correct chunks are now at edges
def score_position_hits(results: list[dict]) -> float:
"""Returns fraction of correct answers NOT from middle-position chunks."""
edge_hits = sum(1 for r in results if r["correct_position"] in ("first", "last"))
return edge_hits / len(results)
What You Learned
- LLM attention is U-shaped: strong at the start and end, weak in the middle of the context
- Vector similarity alone doesn't rank chunks by answer relevance — cross-encoders do
- Reordering chunks into a U-shape after reranking recovers significant accuracy without changing your retrieval infrastructure
- Retrieve more candidates than you need (10–20) and let the reranker filter to 4–6
Limitation: Cross-encoders add ~100–300ms latency per query depending on hardware. Cache reranker results for repeated queries. For real-time applications, run the encoder on GPU or use a hosted reranking API (Cohere Rerank, Jina Reranker).
When NOT to use this: If your context fits in under ~2,000 tokens, lost-in-the-middle effects are minimal. Focus optimization effort on retrieval quality instead.
Tested on Python 3.12, sentence-transformers 3.x, LangChain 0.3+, OpenAI GPT-4o