Problem: Your RAG Pipeline Ignores the Best Evidence

You're running a RAG setup — vector search looks great, the right chunks are retrieved — but the LLM's answers are still vague or miss critical details sitting right there in the context.

The culprit is "lost in the middle" syndrome: LLMs perform significantly worse on information placed in the middle of a long context window compared to content at the start or end.

You'll learn:

Why LLMs systematically underweight middle context
How to reorder retrieved chunks so critical content lands at the edges
How to add a cross-encoder reranker to surface the highest-signal documents

Time: 20 min | Level: Intermediate

Why This Happens

Transformer attention patterns aren't uniform across the context window. Research from Stanford and others shows that recall degrades sharply for tokens positioned in the middle of long prompts — even when the model technically "sees" them.

In a RAG pipeline, your vector store retrieves, say, 8 chunks ranked by cosine similarity. They're typically inserted into the prompt in rank order: chunk 1 first, chunk 8 last. The most relevant chunk (rank 1) lands at position 0. Chunks 2–7 drift into the middle. The LLM leans on chunks 1 and 8, the rest fade.

Common symptoms:

Answers are superficially correct but miss specific figures, names, or dates you know are in the retrieved context
Longer context windows make things worse, not better
Retrieval eval scores look fine but generation eval scores lag

Solution

Step 1: Diagnose With a Position Test

Before changing anything, verify you're actually hitting this problem.

# position_test.py
import os
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage

def test_position_sensitivity(context_chunks: list[str], query: str) -> dict:
    """
    Place the answer in first, middle, and last position.
    If accuracy drops in the middle, you have the syndrome.
    """
    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    results = {}

    for position in ["first", "middle", "last"]:
        chunks = context_chunks.copy()
        # Move the known-answer chunk to test position
        answer_chunk = chunks.pop(0)

        if position == "first":
            ordered = [answer_chunk] + chunks
        elif position == "last":
            ordered = chunks + [answer_chunk]
        else:
            mid = len(chunks) // 2
            ordered = chunks[:mid] + [answer_chunk] + chunks[mid:]

        context = "\n\n---\n\n".join(ordered)
        prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer concisely."
        response = llm.invoke([HumanMessage(content=prompt)])
        results[position] = response.content

    return results

Expected: If answers are noticeably worse for middle, you've confirmed the issue.

Step 2: Add a Cross-Encoder Reranker

Vector similarity finds related chunks. A cross-encoder reads both the query and each chunk together and scores their actual relevance. It's slower but far more accurate.

# reranker.py
from sentence_transformers import CrossEncoder
from dataclasses import dataclass

@dataclass
class RankedChunk:
    text: str
    original_rank: int
    rerank_score: float

def rerank_chunks(
    query: str,
    chunks: list[str],
    model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
    top_k: int = 5
) -> list[RankedChunk]:
    """
    Score each chunk against the query with a cross-encoder.
    Returns top_k chunks sorted by relevance score descending.
    """
    encoder = CrossEncoder(model_name)

    # Cross-encoder scores query+chunk pairs directly
    pairs = [(query, chunk) for chunk in chunks]
    scores = encoder.predict(pairs)

    ranked = [
        RankedChunk(text=chunk, original_rank=i, rerank_score=float(score))
        for i, (chunk, score) in enumerate(zip(chunks, scores))
    ]

    # Sort by score, take top_k
    ranked.sort(key=lambda x: x.rerank_score, reverse=True)
    return ranked[:top_k]

Install the dependency:

pip install sentence-transformers

If it fails:

CUDA out of memory: Switch to cross-encoder/ms-marco-TinyBERT-L-2-v2 for a smaller model
Slow inference: Run on batches of 8–16 chunks max; anything bigger needs GPU

Step 3: Reorder for Edge Placement

After reranking, don't just insert chunks highest-score-first. Use a "U-shape" ordering that puts the strongest evidence at the beginning and end of the context window, keeping weaker chunks in the middle where attention is lowest anyway.

# rag_pipeline.py
def ushape_order(ranked_chunks: list[RankedChunk]) -> list[str]:
    """
    Place highest-scored chunks at the edges of the context.
    Lowest-scored chunks land in the middle.

    Example with 5 chunks ranked [A, B, C, D, E] by score:
    Result order: [A, C, E, D, B]
    Positions:    [0, 2, 4, 3, 1]  ← A and B at edges
    """
    texts = [c.text for c in ranked_chunks]
    result = []
    left, right = 0, len(texts) - 1

    # Alternate placing from highest-scored at left/right edges
    toggle = True
    for chunk in texts:
        if toggle:
            result.insert(0, chunk)   # Push to front
        else:
            result.append(chunk)      # Push to back
        toggle = not toggle

    return result


def build_rag_prompt(query: str, chunks: list[str]) -> str:
    reranked = rerank_chunks(query, chunks, top_k=5)
    ordered = ushape_order(reranked)
    context = "\n\n---\n\n".join(ordered)

    return (
        f"Use only the provided context to answer the question.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {query}\n\n"
        f"Answer:"
    )

Step 4: Wire It Into Your Pipeline

# full_pipeline.py
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import HumanMessage

def rag_query(query: str, vectorstore: Chroma, llm: ChatOpenAI) -> str:
    # Retrieve more than you need — reranker will trim
    raw_docs = vectorstore.similarity_search(query, k=10)
    raw_chunks = [doc.page_content for doc in raw_docs]

    # Rerank and reorder
    prompt = build_rag_prompt(query, raw_chunks)

    response = llm.invoke([HumanMessage(content=prompt)])
    return response.content


# Usage
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./chroma_db")
llm = ChatOpenAI(model="gpt-4o", temperature=0)

answer = rag_query("What were Q3 revenue figures?", vectorstore, llm)
print(answer)

Verification

Run your existing eval suite, or use this quick sanity check:

python position_test.py

You should see: Answer quality consistent across first/middle/last positions. The cross-encoder scores should surface chunks with actual answer content even if their cosine similarity score was mediocre.

A quick metric to track:

# Before fix: measure how often the correct answer came from a middle-position chunk
# After fix: that number should drop — correct chunks are now at edges

def score_position_hits(results: list[dict]) -> float:
    """Returns fraction of correct answers NOT from middle-position chunks."""
    edge_hits = sum(1 for r in results if r["correct_position"] in ("first", "last"))
    return edge_hits / len(results)

What You Learned

LLM attention is U-shaped: strong at the start and end, weak in the middle of the context
Vector similarity alone doesn't rank chunks by answer relevance — cross-encoders do
Reordering chunks into a U-shape after reranking recovers significant accuracy without changing your retrieval infrastructure
Retrieve more candidates than you need (10–20) and let the reranker filter to 4–6

Limitation: Cross-encoders add ~100–300ms latency per query depending on hardware. Cache reranker results for repeated queries. For real-time applications, run the encoder on GPU or use a hosted reranking API (Cohere Rerank, Jina Reranker).

When NOT to use this: If your context fits in under ~2,000 tokens, lost-in-the-middle effects are minimal. Focus optimization effort on retrieval quality instead.

Tested on Python 3.12, sentence-transformers 3.x, LangChain 0.3+, OpenAI GPT-4o