Problem: Ollama Cuts Off Long Prompts and Loses Context
Ollama context length defaults to 2048 tokens on every model — even when the underlying weights support 128k. If you paste a long document, a large codebase, or a multi-turn chat history and the model starts forgetting earlier content or silently truncating your input, this is why.
You'll learn:
- Why the 2048-token default exists and when it hurts you
- How to raise
num_ctxper-request, per-session, and permanently via a Modelfile - How to calculate safe context sizes for your available VRAM or RAM
- How to verify the active context window at runtime
Time: 20 min | Difficulty: Intermediate
Why This Happens
Ollama ships with num_ctx: 2048 as a conservative default. The goal is to boot instantly on any hardware — including laptops with 8 GB of unified memory — without requiring the user to know their VRAM budget upfront.
The problem is that many modern models support far larger windows:
| Model | Trained context | Ollama default | Gap |
|---|---|---|---|
| Llama 3.3 70B | 128k tokens | 2048 tokens | 63× |
| Qwen2.5 72B | 128k tokens | 2048 tokens | 63× |
| Mistral 7B v0.3 | 32k tokens | 2048 tokens | 16× |
| Gemma 3 27B | 128k tokens | 2048 tokens | 63× |
| DeepSeek-R1 8B | 64k tokens | 2048 tokens | 32× |
Ollama silently clips any input that exceeds num_ctx. You won't see an error — the model simply never receives the truncated tokens. For RAG pipelines, long system prompts, or multi-document analysis, this produces wrong answers that are hard to debug.
Symptoms:
- Model answers questions about the end of a document but not the beginning (or vice versa)
- Chat history "forgets" context from more than a few exchanges back
ollama psshows a context window far smaller than the model card advertises- Long code files get silently truncated and the model produces incomplete completions
num_ctx sets the KV-cache allocation in llama.cpp. Tokens beyond this limit are dropped before inference.
How Context Length Affects VRAM
Before raising num_ctx, know the cost. The KV cache grows linearly with context length and is the main reason large contexts require more memory.
A rough formula for KV cache size in bytes:
kv_cache_bytes = num_ctx × num_layers × 2 × head_dim × num_heads × bytes_per_element
In practice, use these estimates per 1k tokens of context:
| Model size | KV cache per 1k tokens (fp16) |
|---|---|
| 7B (32 layers) | ~200 MB |
| 13B (40 layers) | ~320 MB |
| 70B (80 layers) | ~1.3 GB |
So raising a 7B model from 2k to 32k context costs roughly 6 GB extra VRAM. Plan accordingly before setting num_ctx: 131072 on a 70B model.
Solution
Step 1: Check the Current Active Context Window
Before changing anything, confirm what Ollama is actually using.
# Pull model status including active context size
ollama ps
Expected output:
NAME ID SIZE PROCESSOR UNTIL
llama3.3:70b abc123def456 47 GB 100% GPU 4 minutes from now
The SIZE field reflects loaded weights but not the KV cache. To see the active num_ctx, query the running model via the API:
curl http://localhost:11434/api/show -d '{"name": "llama3.3:70b"}' | jq '.parameters'
Look for num_ctx in the output. If it shows 2048 (or nothing — meaning the default is in effect), proceed to the next steps.
Step 2: Set num_ctx Per-Request via the API
The fastest way to test a larger context — no model changes needed.
# Single generation with 32k context
curl http://localhost:11434/api/generate -d '{
"model": "llama3.3:70b",
"prompt": "Summarize the following document: ...",
"options": {
"num_ctx": 32768
}
}'
For the chat endpoint (used by most OpenAI-compatible clients):
curl http://localhost:11434/api/chat -d '{
"model": "llama3.3:70b",
"messages": [{"role": "user", "content": "Your long prompt here..."}],
"options": {
"num_ctx": 32768
}
}'
If it fails:
error: model requires more memory than available→ reducenum_ctxor offload fewer layers. See Step 5.exit status 1in Ollama logs → your system ran out of RAM/VRAM mid-generation. Lower context or use a smaller quantization.
Step 3: Set num_ctx Permanently with a Modelfile
Per-request options require every client to pass num_ctx. A Modelfile bakes it in as the default for that model variant.
# Create a Modelfile for a high-context variant
cat > Modelfile << 'EOF'
FROM llama3.3:70b
# Set context window to 32k tokens
# Safe for 80GB VRAM (A100/H100) or 64GB unified memory (M2 Ultra)
PARAMETER num_ctx 32768
# Optional: raise repeat penalty window to match context
PARAMETER repeat_last_n 32768
EOF
Build and tag the new variant:
ollama create llama3.3-32k -f Modelfile
Verify it loaded correctly:
ollama show llama3.3-32k --parameters
Expected output:
num_ctx 32768
repeat_last_n 32768
Now any client that calls llama3.3-32k gets 32k context without any extra options.
Step 4: Set num_ctx for the OpenAI-Compatible Endpoint
If you're using Ollama's OpenAI-compatible API at /v1/chat/completions (common with LangChain, LlamaIndex, and Open WebUI), pass context size through the num_ctx option in the request body — it maps directly:
import openai
client = openai.OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # value is ignored but required by the SDK
)
response = client.chat.completions.create(
model="llama3.3:70b",
messages=[{"role": "user", "content": long_document_prompt}],
extra_body={
"options": {
"num_ctx": 32768 # passes through to llama.cpp n_ctx
}
}
)
For LangChain's ChatOllama:
from langchain_ollama import ChatOllama
llm = ChatOllama(
model="llama3.3:70b",
num_ctx=32768, # direct parameter — no extra_body needed
)
Step 5: Set VRAM Offloading to Support Larger Contexts
If you're on a machine with both GPU and CPU RAM (e.g., a workstation with 24 GB VRAM and 64 GB system RAM), Ollama can offload KV cache layers to CPU to fit larger contexts. Set num_gpu to reserve GPU layers for weights and let the KV cache overflow to RAM:
# Keep 60 transformer layers on GPU, let KV cache spill to RAM
OLLAMA_NUM_GPU=60 ollama run llama3.3:70b
Or set it system-wide in the Ollama service environment:
# /etc/systemd/system/ollama.service.d/override.conf (Linux systemd)
[Service]
Environment="OLLAMA_NUM_GPU=60"
Environment="OLLAMA_MAX_VRAM=20000000000" # 20 GB — reserve 4 GB headroom
Reload and restart:
sudo systemctl daemon-reload && sudo systemctl restart ollama
If it fails:
- Generation speed drops to < 1 token/sec → too many layers on CPU. Raise
OLLAMA_NUM_GPUuntil GPU utilization returns to near 100%. - OOM on CPU RAM → your
num_ctxis still too large for combined VRAM + RAM. Use a smaller quantization (Q4_K_M instead of Q8_0) to free headroom.
Step 6: Choose the Right Context Size for Your Hardware
Use this reference table as a starting point. Values assume Q4_K_M quantization and fp16 KV cache.
| Hardware | Model | Safe max num_ctx |
|---|---|---|
| 8 GB VRAM (RTX 3070 / M1 Pro) | 7B | 16,384 |
| 16 GB VRAM (RTX 4080 / M2 Pro) | 7B | 65,536 |
| 16 GB VRAM | 13B | 16,384 |
| 24 GB VRAM (RTX 4090 / A10G) | 7B | 131,072 |
| 24 GB VRAM | 13B | 32,768 |
| 24 GB VRAM | 70B | 4,096 |
| 48 GB VRAM (2× RTX 4090 / A6000) | 70B | 16,384 |
| 80 GB VRAM (A100 / H100) | 70B | 65,536 |
| 64 GB unified (M2 Ultra / M3 Max) | 70B | 32,768 |
These are practical ceilings — stay 10–20% below to avoid OOM during long generations.
Verification
Confirm the context window is active and tokens are not being truncated:
# Generate a prompt that requires reading from > 2048 tokens back
# and check the response references early content correctly
python3 - << 'EOF'
import requests, json
# Build a prompt with a unique marker near the start and a question at the end
marker = "UNIQUE_MARKER_XJ7Q"
filler = "This is filler text to pad the context. " * 200 # ~800 tokens
question = f"What was the unique marker mentioned at the very beginning of this text?"
prompt = f"Remember this: {marker}. {filler} {question}"
resp = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3.3:70b",
"prompt": prompt,
"stream": False,
"options": {"num_ctx": 8192}
})
print(resp.json()["response"])
EOF
You should see: The model correctly returns UNIQUE_MARKER_XJ7Q in its answer. If it says it doesn't know or invents a different marker, num_ctx is still too low or the model isn't loaded with the new setting.
Also verify via the metrics field in the API response:
curl -s http://localhost:11434/api/generate -d '{
"model": "llama3.3:70b",
"prompt": "Hello",
"stream": false,
"options": {"num_ctx": 32768}
}' | jq '{prompt_eval_count, eval_count, context_length: .prompt_eval_count}'
prompt_eval_count shows the number of tokens actually processed. If it's capped at 2048 regardless of input length, the new num_ctx hasn't taken effect — restart Ollama and re-run.
What You Learned
- Ollama defaults to
num_ctx: 2048on every model regardless of what the weights support — this is intentional hardware safety, not a bug - Raising
num_ctxincreases KV cache allocation linearly; a 7B model at 32k context costs ~6 GB of VRAM versus ~400 MB at 2k - Modelfiles are the right tool for permanent per-model defaults;
options.num_ctxin the API body is right for per-request overrides OLLAMA_NUM_GPUlets you offload KV cache to CPU RAM when VRAM is the bottleneck- Always verify with a marker test — silent truncation is the most common failure mode and produces no error messages
Tested on Ollama 0.6.x, Llama 3.3 70B Q4_K_M, Qwen2.5 14B Q8_0 — Ubuntu 24.04 (CUDA 12.4) and macOS Sequoia (Metal / M2 Ultra)
FAQ
Q: Does raising num_ctx slow down generation speed? A: Yes, but only slightly for prompt processing. Prefill (reading your input) scales with context length. Token generation speed (tokens/sec) is mostly unaffected unless the KV cache overflows to CPU RAM, which can cut throughput by 10–50×.
Q: What is the maximum num_ctx Ollama supports?
A: Ollama passes num_ctx directly to llama.cpp as n_ctx. The practical ceiling is the model's trained context length — 128k for Llama 3.3, 32k for Mistral 7B v0.3. Going above the trained context produces degraded quality regardless of hardware.
Q: Can I set num_ctx globally for all models at once?
A: Not via a single environment variable. The cleanest approach is to set it per Modelfile for each model you use regularly, or always pass options.num_ctx in your API client. A global OLLAMA_NUM_CTX env var is not supported as of Ollama 0.6.
Q: Does num_ctx affect the context window shown in Open WebUI?
A: Yes. Open WebUI reads the active model parameters at session start. If you created a Modelfile variant with a higher num_ctx, select that model name in the UI and it will use the new window size automatically.
Q: Why is my generation slow even with enough VRAM for the larger context? A: Flash Attention must be enabled for large context windows to be efficient. Ollama enables it automatically when running on CUDA. On Metal (Apple Silicon), attention performance at very long contexts (> 64k) can still be slower than CUDA — this is a known limitation as of early 2026.