Problem: Your Dataset Is Too Big for Most AI Tools
You have a 500k-row CSV, a sprawling codebase, or thousands of customer support tickets — and every AI tool you've tried either truncates the data or forces you to chunk it manually. The insights you need require seeing everything at once.
Gemini 3.1 Pro's 2M token context window changes that. You can load entire datasets, multi-file codebases, or hours of transcripts in a single prompt and ask questions across all of it.
You'll learn:
- How to estimate token count before sending large payloads
- How to structure prompts that get accurate answers from massive inputs
- How to extract structured outputs (JSON, tables) from unstructured bulk data
Time: 20 min | Level: Intermediate
Why This Matters
Most LLMs cap out at 128k–200k tokens. Gemini 3.1 Pro's 2M token window (roughly 1.5 million words) means you can fit an entire novel, a year of logs, or a large database export into a single API call — no chunking, no vector databases, no retrieval pipelines for many use cases.
Common use cases:
- Querying large CSVs or JSON exports without a database
- Auditing an entire codebase for security issues or patterns
- Summarizing and cross-referencing thousands of documents
- Finding anomalies across full log files
The tradeoff: Large contexts cost more tokens and take longer to process. Use this approach when the data genuinely requires holistic analysis, not just keyword search.
Solution
Step 1: Estimate Your Token Count
Before sending, check your payload fits within the limit. A rough rule: 1 token ≈ 4 characters in English. For structured data (CSV, JSON), it's closer to 3 characters per token due to punctuation.
# Quick local estimate — no API call needed
def estimate_tokens(text: str) -> int:
# Conservative estimate: 3 chars per token for structured data
return len(text) // 3
with open("dataset.csv", "r") as f:
content = f.read()
estimated = estimate_tokens(content)
print(f"Estimated tokens: {estimated:,}")
print(f"Fits in 2M window: {estimated < 2_000_000}")
Expected: Token estimate printed. If you're over 1.8M, trim columns you don't need before sending.
If it fails:
- MemoryError on large files: Stream the file in chunks to estimate, don't load all at once.
- Estimate seems wrong: Use the official
google-generativeaiSDK'scount_tokens()for an exact count (costs nothing — no generation happens).
Step 2: Load and Send the Data
Use the official Google Generative AI SDK. Install it first:
pip install google-generativeai
Then structure your call to keep the data and the instruction clearly separated:
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-2.0-pro-exp") # Use latest Pro model
with open("sales_data.csv", "r") as f:
csv_content = f.read()
# Keep instruction short and explicit — the model reads the data, not you
prompt = f"""You are a data analyst. Below is a CSV of sales records for 2025.
DATA:
{csv_content}
TASK:
1. Identify the top 5 products by total revenue.
2. Flag any months where revenue dropped more than 20% month-over-month.
3. Return results as JSON with keys: top_products, revenue_drops.
"""
response = model.generate_content(prompt)
print(response.text)
Expected: A JSON block with top_products and revenue_drops arrays.
If it fails:
ResourceExhaustederror: You've hit rate limits. Addtime.sleep(30)and retry.- Truncated response: Add
generation_config=genai.GenerationConfig(max_output_tokens=8192)to the model init. - Model hallucinating data: Add
"Only use data from the CSV. Do not infer or invent values."to your prompt.
Step 3: Parse the Structured Output
Gemini will often wrap JSON in markdown fences. Strip them before parsing:
import json
import re
def extract_json(text: str) -> dict:
# Strip markdown fences if present
cleaned = re.sub(r"```(?:json)?\n?", "", text).strip()
return json.loads(cleaned)
result = extract_json(response.text)
top_products = result["top_products"]
revenue_drops = result["revenue_drops"]
print(f"Top 5 products: {[p['name'] for p in top_products]}")
print(f"Revenue drops detected: {len(revenue_drops)} months")
Expected: Clean Python dicts ready for further processing or export.
Step 4: Handle Multi-File Inputs
For codebases or multi-document analysis, concatenate files with clear delimiters so the model knows where each file starts and ends:
import os
def load_codebase(directory: str, extensions: list[str]) -> str:
parts = []
for root, _, files in os.walk(directory):
for file in files:
if any(file.endswith(ext) for ext in extensions):
path = os.path.join(root, file)
with open(path, "r", errors="ignore") as f:
content = f.read()
# Clear delimiter — the model uses this to track file context
parts.append(f"=== FILE: {path} ===\n{content}\n")
return "\n".join(parts)
codebase = load_codebase("./src", [".py", ".ts", ".go"])
estimated = estimate_tokens(codebase)
print(f"Codebase is ~{estimated:,} tokens")
prompt = f"""Audit the following codebase for SQL injection vulnerabilities.
For each issue found, return: file path, line number (approximate), and severity.
CODEBASE:
{codebase}
Return findings as a JSON array.
"""
response = model.generate_content(prompt)
Expected: A JSON array of vulnerability findings across all files.
Verification
Run a sanity check against known values in your dataset:
# Add a verification question to your prompt
verification_prompt = f"""
{csv_content}
What is the total number of rows in this dataset (excluding the header)?
Also, what is the exact revenue for product 'Widget-A' in March 2025?
"""
check = model.generate_content(verification_prompt)
print(check.text)
# Compare against your ground truth
You should see: Exact values that match what you'd get from pandas or a SQL query. If they don't match, your data may have formatting issues confusing the model.
What You Learned
- Token estimation lets you sanity-check payloads before spending on API calls.
- Clear data/instruction separation in prompts improves accuracy on large inputs.
- Structured output (JSON) with explicit schema instructions makes responses parseable.
- Multi-file inputs need delimiters so the model tracks context across files.
Limitation: At full 2M context, latency can reach 30–90 seconds per request. For real-time applications, this isn't the right tool. Use this for batch analysis jobs.
When NOT to use this: If you're running the same query across thousands of separate documents, a vector database + RAG pipeline will be cheaper and faster. The 2M window shines when you need cross-document reasoning — finding patterns, contradictions, or relationships that span the full dataset.
Tested on Gemini 2.0 Pro (Experimental), Python 3.12, google-generativeai 0.8.x, Ubuntu 24.04 & macOS Sequoia