Problem: You Have More Data Than Your LLM Can Handle
Most developers hit the context wall fast. You want to analyze an entire codebase, a year of logs, or a 500-page PDF — but your LLM chokes, hallucinates, or just silently drops half the input.
Gemini 1.5 Pro has a 1 million token context window. Gemini 1.5 Pro 002 pushes that to 2 million. And with the right batching strategy, you can go further — feeding effectively 10M+ tokens of context across a structured session.
You'll learn:
- How Gemini's context window actually works under the hood
- How to chunk, compress, and sequence massive inputs without losing coherence
- A working Python pattern to send multi-million token payloads via the API
Time: 25 min | Level: Advanced
Why This Happens
Gemini's context window is real — but naive usage kills it. Paste a 900k-token document and you'll hit one of three failure modes: the API returns a 400 error, the model silently truncates your input, or attention degrades so badly the output is useless.
The deeper issue is that "context window" describes maximum capacity, not optimal capacity. Empirically, retrieval quality degrades past ~60-70% utilization on most models. At 95% fill, you're essentially gambling.
Common symptoms:
400 INVALID_ARGUMENT: Request payload size exceeds the limit- Model answers questions based on the first 20% of your document
- Summaries that miss critical sections in the middle of long inputs
- Inconsistent results when you re-run identical prompts
Quality degrades non-linearly — stay under 70% for reliable results
Solution
Step 1: Measure What You're Actually Sending
Don't guess token counts. Use the official tokenizer before you send anything.
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-1.5-pro-002")
def count_tokens(text: str) -> int:
# Use the model's own tokenizer — never estimate with word count
response = model.count_tokens(text)
return response.total_tokens
with open("large_document.txt", "r") as f:
content = f.read()
token_count = count_tokens(content)
print(f"Document: {token_count:,} tokens")
print(f"Window utilization: {token_count / 2_000_000:.1%}")
Expected: A token count and utilization percentage. If you're over 70%, move to Step 2.
If it fails:
- Error:
google.api_core.exceptions.InvalidArgument: Your text contains unsupported characters — strip non-UTF-8 bytes first - Slow count for huge files: Tokenize in 100k-character chunks and sum the results
Step 2: Compress Before You Send
Raw text is wasteful. Code, logs, and structured data can be compressed 40-60% before tokenization without losing meaning.
import re
def compress_for_context(text: str) -> str:
# Remove blank lines — they're invisible tokens eating your budget
text = re.sub(r'\n{3,}', '\n\n', text)
# Collapse repeated whitespace
text = re.sub(r'[ \t]{2,}', ' ', text)
# Strip comment-only lines in code (optional — keep if comments carry meaning)
text = re.sub(r'^\s*#.*\n', '', text, flags=re.MULTILINE)
return text.strip()
def compress_logs(log_text: str, keep_levels: list = ["ERROR", "WARN"]) -> str:
# For log files: keep only relevant severity levels
lines = log_text.splitlines()
filtered = [
line for line in lines
if any(level in line for level in keep_levels)
]
return "\n".join(filtered)
Test the savings:
original = open("app.log").read()
compressed = compress_logs(original, keep_levels=["ERROR", "WARN", "CRITICAL"])
before = count_tokens(original)
after = count_tokens(compressed)
print(f"Reduced from {before:,} to {after:,} tokens ({(1 - after/before):.0%} savings)")
Expected: 30-60% token reduction on typical log files.
Step 3: Structure Your Context With a Map
When sending large inputs, always prepend a structural map. Gemini (like all transformers) weighs the beginning of context heavily — use that to your advantage.
def build_context_payload(sections: dict[str, str]) -> str:
"""
sections: {"Section Name": "content...", ...}
Returns a structured string optimized for large context retrieval.
"""
map_header = "## CONTEXT MAP\n"
map_header += "This context contains the following sections:\n"
for i, name in enumerate(sections.keys(), 1):
token_est = count_tokens(sections[name])
map_header += f"{i}. {name} (~{token_est:,} tokens)\n"
map_header += "\n---\n\n"
# Build the full payload
body = ""
for name, content in sections.items():
body += f"## SECTION: {name}\n\n"
body += content
body += "\n\n---\n\n"
return map_header + body
# Example usage
sections = {
"API Contracts": open("api_contracts.md").read(),
"Database Schema": open("schema.sql").read(),
"Application Logs (last 24h)": compress_logs(open("app.log").read()),
"Recent Git Commits": open("git_log.txt").read(),
}
payload = build_context_payload(sections)
print(f"Total payload: {count_tokens(payload):,} tokens")
Why this works: The context map gives the model a table of contents it can reference when building attention patterns. Without it, retrieval from section 8 of a 12-section document is unreliable.
The map header anchors the model's attention before the dense content begins
Step 4: Send With the Right Configuration
Large context requests need specific API parameters to behave reliably.
import google.generativeai as genai
from google.generativeai.types import GenerationConfig
def query_with_large_context(
payload: str,
question: str,
model_name: str = "gemini-1.5-pro-002"
) -> str:
model = genai.GenerativeModel(
model_name=model_name,
generation_config=GenerationConfig(
temperature=0.1, # Low temp = more deterministic on factual retrieval
max_output_tokens=8192,
candidate_count=1, # Never request multiple candidates on large context
)
)
# System instruction goes here, NOT in the context payload
system = (
"You are analyzing a structured context. "
"Always cite the section name when referencing specific information. "
"If information is not present in the context, say so explicitly."
)
prompt = f"{system}\n\n{payload}\n\n## QUESTION\n\n{question}"
token_count = count_tokens(prompt)
print(f"Sending {token_count:,} tokens ({token_count/2_000_000:.1%} of window)")
# Stream the response — don't wait for full completion on large outputs
response = model.generate_content(prompt, stream=True)
result = ""
for chunk in response:
result += chunk.text
print(chunk.text, end="", flush=True)
return result
If it fails:
ResourceExhaustederror: You've hit rate limits — add exponential backoff, or switch to the Batch API for non-interactive workloads- Empty response chunks: The model hit
max_output_tokens— increase it or split your question into smaller sub-questions - Model ignores sections: Your payload exceeds reliable attention range — use Step 5
Step 5: Go Beyond 2M With Sequential Context Chaining
For truly massive datasets (10M+ effective tokens), use a map-reduce pattern across multiple API calls.
from dataclasses import dataclass
@dataclass
class ChunkSummary:
chunk_id: int
source: str
summary: str
key_facts: list[str]
def summarize_chunk(chunk: str, chunk_id: int, source: str) -> ChunkSummary:
"""Compress one chunk into a structured summary."""
model = genai.GenerativeModel("gemini-1.5-pro-002")
prompt = f"""Analyze this content and respond in this exact format:
SUMMARY: [2-3 sentence summary]
KEY_FACTS:
- [fact 1]
- [fact 2]
- [fact 3]
(up to 10 facts)
CONTENT:
{chunk}"""
response = model.generate_content(prompt)
text = response.text
# Parse the structured response
summary_match = re.search(r'SUMMARY:\s*(.+?)(?=KEY_FACTS:)', text, re.DOTALL)
facts_match = re.findall(r'- (.+)', text)
return ChunkSummary(
chunk_id=chunk_id,
source=source,
summary=summary_match.group(1).strip() if summary_match else "",
key_facts=facts_match
)
def process_massive_dataset(documents: dict[str, str], question: str) -> str:
"""
Process more data than fits in one context window.
Map: summarize each document independently
Reduce: combine summaries and answer the question
"""
summaries = []
for source, content in documents.items():
# Chunk documents that are themselves too large
tokens = count_tokens(content)
if tokens < 800_000: # Fits in one call at safe utilization
summary = summarize_chunk(content, len(summaries), source)
summaries.append(summary)
else:
# Split into ~500k token chunks with overlap
words = content.split()
chunk_size = 300_000 # Conservative estimate: ~4 chars/token
overlap = 10_000
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
summary = summarize_chunk(chunk, len(summaries), f"{source} (part {i//chunk_size + 1})")
summaries.append(summary)
# Build the reduce payload from summaries
reduce_context = "## DOCUMENT SUMMARIES\n\n"
for s in summaries:
reduce_context += f"### {s.source}\n"
reduce_context += f"{s.summary}\n\n"
reduce_context += "Key facts:\n"
for fact in s.key_facts:
reduce_context += f"- {fact}\n"
reduce_context += "\n"
reduce_context += f"\n## QUESTION\n\n{question}"
model = genai.GenerativeModel("gemini-1.5-pro-002")
response = model.generate_content(reduce_context)
return response.text
Map phase summarizes chunks independently; reduce phase synthesizes across all summaries
Verification
Run this end-to-end test with a known dataset:
# Test with documents where you know the answer
test_docs = {
"Q3 Report": "...revenue was $4.2M in Q3, up 18% YoY...",
"Q4 Report": "...revenue was $5.1M in Q4, up 21% YoY...",
}
result = process_massive_dataset(
test_docs,
"What was total revenue across Q3 and Q4, and which quarter grew faster?"
)
print(result)
You should see: The model correctly identifies $9.3M total and Q4 as the faster-growing quarter, citing both source documents.
If it answers incorrectly, your chunk summaries are losing critical facts — lower the compression ratio in summarize_chunk and increase max_output_tokens for the map phase.
What You Learned
- Token count before you send — never guess, always measure with
count_tokens() - Structural maps dramatically improve retrieval from large contexts
- Stay under 70% window utilization for reliable attention
- Map-reduce unlocks effectively unlimited context by chaining summarization calls
Limitations to know: Map-reduce loses inter-document relationships that span chunks. If your question requires synthesizing two facts that are in different chunks, the map phase may drop one of them. Mitigate with generous chunk overlap and explicit instructions to preserve cross-references.
When NOT to use this: If your question only requires one section of a large document, use RAG (retrieval-augmented generation) instead — it's cheaper and faster. This pattern is for when you genuinely need the model to reason across the entire corpus.
Tested on Gemini 1.5 Pro 002, Python 3.12, google-generativeai SDK 0.8.x