Hacking the Context Window: How to Feed 10M Tokens to Gemini

Learn how to structure, chunk, and send up to 10 million tokens to Gemini 1.5 Pro without hitting limits or losing coherence.

Problem: You Have More Data Than Your LLM Can Handle

Most developers hit the context wall fast. You want to analyze an entire codebase, a year of logs, or a 500-page PDF — but your LLM chokes, hallucinates, or just silently drops half the input.

Gemini 1.5 Pro has a 1 million token context window. Gemini 1.5 Pro 002 pushes that to 2 million. And with the right batching strategy, you can go further — feeding effectively 10M+ tokens of context across a structured session.

You'll learn:

  • How Gemini's context window actually works under the hood
  • How to chunk, compress, and sequence massive inputs without losing coherence
  • A working Python pattern to send multi-million token payloads via the API

Time: 25 min | Level: Advanced


Why This Happens

Gemini's context window is real — but naive usage kills it. Paste a 900k-token document and you'll hit one of three failure modes: the API returns a 400 error, the model silently truncates your input, or attention degrades so badly the output is useless.

The deeper issue is that "context window" describes maximum capacity, not optimal capacity. Empirically, retrieval quality degrades past ~60-70% utilization on most models. At 95% fill, you're essentially gambling.

Common symptoms:

  • 400 INVALID_ARGUMENT: Request payload size exceeds the limit
  • Model answers questions based on the first 20% of your document
  • Summaries that miss critical sections in the middle of long inputs
  • Inconsistent results when you re-run identical prompts

Diagram showing token utilization vs response quality curve Quality degrades non-linearly — stay under 70% for reliable results


Solution

Step 1: Measure What You're Actually Sending

Don't guess token counts. Use the official tokenizer before you send anything.

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-1.5-pro-002")

def count_tokens(text: str) -> int:
    # Use the model's own tokenizer — never estimate with word count
    response = model.count_tokens(text)
    return response.total_tokens

with open("large_document.txt", "r") as f:
    content = f.read()

token_count = count_tokens(content)
print(f"Document: {token_count:,} tokens")
print(f"Window utilization: {token_count / 2_000_000:.1%}")

Expected: A token count and utilization percentage. If you're over 70%, move to Step 2.

If it fails:

  • Error: google.api_core.exceptions.InvalidArgument: Your text contains unsupported characters — strip non-UTF-8 bytes first
  • Slow count for huge files: Tokenize in 100k-character chunks and sum the results

Step 2: Compress Before You Send

Raw text is wasteful. Code, logs, and structured data can be compressed 40-60% before tokenization without losing meaning.

import re

def compress_for_context(text: str) -> str:
    # Remove blank lines — they're invisible tokens eating your budget
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    # Collapse repeated whitespace
    text = re.sub(r'[ \t]{2,}', ' ', text)
    
    # Strip comment-only lines in code (optional — keep if comments carry meaning)
    text = re.sub(r'^\s*#.*\n', '', text, flags=re.MULTILINE)
    
    return text.strip()

def compress_logs(log_text: str, keep_levels: list = ["ERROR", "WARN"]) -> str:
    # For log files: keep only relevant severity levels
    lines = log_text.splitlines()
    filtered = [
        line for line in lines
        if any(level in line for level in keep_levels)
    ]
    return "\n".join(filtered)

Test the savings:

original = open("app.log").read()
compressed = compress_logs(original, keep_levels=["ERROR", "WARN", "CRITICAL"])

before = count_tokens(original)
after = count_tokens(compressed)
print(f"Reduced from {before:,} to {after:,} tokens ({(1 - after/before):.0%} savings)")

Expected: 30-60% token reduction on typical log files.


Step 3: Structure Your Context With a Map

When sending large inputs, always prepend a structural map. Gemini (like all transformers) weighs the beginning of context heavily — use that to your advantage.

def build_context_payload(sections: dict[str, str]) -> str:
    """
    sections: {"Section Name": "content...", ...}
    Returns a structured string optimized for large context retrieval.
    """
    
    map_header = "## CONTEXT MAP\n"
    map_header += "This context contains the following sections:\n"
    
    for i, name in enumerate(sections.keys(), 1):
        token_est = count_tokens(sections[name])
        map_header += f"{i}. {name} (~{token_est:,} tokens)\n"
    
    map_header += "\n---\n\n"
    
    # Build the full payload
    body = ""
    for name, content in sections.items():
        body += f"## SECTION: {name}\n\n"
        body += content
        body += "\n\n---\n\n"
    
    return map_header + body

# Example usage
sections = {
    "API Contracts": open("api_contracts.md").read(),
    "Database Schema": open("schema.sql").read(),
    "Application Logs (last 24h)": compress_logs(open("app.log").read()),
    "Recent Git Commits": open("git_log.txt").read(),
}

payload = build_context_payload(sections)
print(f"Total payload: {count_tokens(payload):,} tokens")

Why this works: The context map gives the model a table of contents it can reference when building attention patterns. Without it, retrieval from section 8 of a 12-section document is unreliable.

Diagram showing context map structure with sections labeled The map header anchors the model's attention before the dense content begins


Step 4: Send With the Right Configuration

Large context requests need specific API parameters to behave reliably.

import google.generativeai as genai
from google.generativeai.types import GenerationConfig

def query_with_large_context(
    payload: str,
    question: str,
    model_name: str = "gemini-1.5-pro-002"
) -> str:
    
    model = genai.GenerativeModel(
        model_name=model_name,
        generation_config=GenerationConfig(
            temperature=0.1,      # Low temp = more deterministic on factual retrieval
            max_output_tokens=8192,
            candidate_count=1,    # Never request multiple candidates on large context
        )
    )
    
    # System instruction goes here, NOT in the context payload
    system = (
        "You are analyzing a structured context. "
        "Always cite the section name when referencing specific information. "
        "If information is not present in the context, say so explicitly."
    )
    
    prompt = f"{system}\n\n{payload}\n\n## QUESTION\n\n{question}"
    
    token_count = count_tokens(prompt)
    print(f"Sending {token_count:,} tokens ({token_count/2_000_000:.1%} of window)")
    
    # Stream the response — don't wait for full completion on large outputs
    response = model.generate_content(prompt, stream=True)
    
    result = ""
    for chunk in response:
        result += chunk.text
        print(chunk.text, end="", flush=True)
    
    return result

If it fails:

  • ResourceExhausted error: You've hit rate limits — add exponential backoff, or switch to the Batch API for non-interactive workloads
  • Empty response chunks: The model hit max_output_tokens — increase it or split your question into smaller sub-questions
  • Model ignores sections: Your payload exceeds reliable attention range — use Step 5

Step 5: Go Beyond 2M With Sequential Context Chaining

For truly massive datasets (10M+ effective tokens), use a map-reduce pattern across multiple API calls.

from dataclasses import dataclass

@dataclass
class ChunkSummary:
    chunk_id: int
    source: str
    summary: str
    key_facts: list[str]

def summarize_chunk(chunk: str, chunk_id: int, source: str) -> ChunkSummary:
    """Compress one chunk into a structured summary."""
    model = genai.GenerativeModel("gemini-1.5-pro-002")
    
    prompt = f"""Analyze this content and respond in this exact format:

SUMMARY: [2-3 sentence summary]
KEY_FACTS:
- [fact 1]
- [fact 2]
- [fact 3]
(up to 10 facts)

CONTENT:
{chunk}"""
    
    response = model.generate_content(prompt)
    text = response.text
    
    # Parse the structured response
    summary_match = re.search(r'SUMMARY:\s*(.+?)(?=KEY_FACTS:)', text, re.DOTALL)
    facts_match = re.findall(r'- (.+)', text)
    
    return ChunkSummary(
        chunk_id=chunk_id,
        source=source,
        summary=summary_match.group(1).strip() if summary_match else "",
        key_facts=facts_match
    )

def process_massive_dataset(documents: dict[str, str], question: str) -> str:
    """
    Process more data than fits in one context window.
    Map: summarize each document independently
    Reduce: combine summaries and answer the question
    """
    
    summaries = []
    
    for source, content in documents.items():
        # Chunk documents that are themselves too large
        tokens = count_tokens(content)
        
        if tokens < 800_000:  # Fits in one call at safe utilization
            summary = summarize_chunk(content, len(summaries), source)
            summaries.append(summary)
        else:
            # Split into ~500k token chunks with overlap
            words = content.split()
            chunk_size = 300_000  # Conservative estimate: ~4 chars/token
            overlap = 10_000
            
            for i in range(0, len(words), chunk_size - overlap):
                chunk = " ".join(words[i:i + chunk_size])
                summary = summarize_chunk(chunk, len(summaries), f"{source} (part {i//chunk_size + 1})")
                summaries.append(summary)
    
    # Build the reduce payload from summaries
    reduce_context = "## DOCUMENT SUMMARIES\n\n"
    for s in summaries:
        reduce_context += f"### {s.source}\n"
        reduce_context += f"{s.summary}\n\n"
        reduce_context += "Key facts:\n"
        for fact in s.key_facts:
            reduce_context += f"- {fact}\n"
        reduce_context += "\n"
    
    reduce_context += f"\n## QUESTION\n\n{question}"
    
    model = genai.GenerativeModel("gemini-1.5-pro-002")
    response = model.generate_content(reduce_context)
    return response.text

Diagram showing map-reduce pattern across multiple Gemini API calls Map phase summarizes chunks independently; reduce phase synthesizes across all summaries


Verification

Run this end-to-end test with a known dataset:

# Test with documents where you know the answer
test_docs = {
    "Q3 Report": "...revenue was $4.2M in Q3, up 18% YoY...",
    "Q4 Report": "...revenue was $5.1M in Q4, up 21% YoY...",
}

result = process_massive_dataset(
    test_docs,
    "What was total revenue across Q3 and Q4, and which quarter grew faster?"
)

print(result)

You should see: The model correctly identifies $9.3M total and Q4 as the faster-growing quarter, citing both source documents.

If it answers incorrectly, your chunk summaries are losing critical facts — lower the compression ratio in summarize_chunk and increase max_output_tokens for the map phase.


What You Learned

  • Token count before you send — never guess, always measure with count_tokens()
  • Structural maps dramatically improve retrieval from large contexts
  • Stay under 70% window utilization for reliable attention
  • Map-reduce unlocks effectively unlimited context by chaining summarization calls

Limitations to know: Map-reduce loses inter-document relationships that span chunks. If your question requires synthesizing two facts that are in different chunks, the map phase may drop one of them. Mitigate with generous chunk overlap and explicit instructions to preserve cross-references.

When NOT to use this: If your question only requires one section of a large document, use RAG (retrieval-augmented generation) instead — it's cheaper and faster. This pattern is for when you genuinely need the model to reason across the entire corpus.


Tested on Gemini 1.5 Pro 002, Python 3.12, google-generativeai SDK 0.8.x