Extend Ollama Context Length Beyond Default Limits 2026

Extend Ollama context length beyond the 2048-token default using num_ctx, Modelfiles, and API parameters. Tested on Llama 3.3, Qwen2.5, and Mistral with CUDA and Metal.

Problem: Ollama Cuts Off Long Prompts and Loses Context

Ollama context length defaults to 2048 tokens on every model — even when the underlying weights support 128k. If you paste a long document, a large codebase, or a multi-turn chat history and the model starts forgetting earlier content or silently truncating your input, this is why.

You'll learn:

  • Why the 2048-token default exists and when it hurts you
  • How to raise num_ctx per-request, per-session, and permanently via a Modelfile
  • How to calculate safe context sizes for your available VRAM or RAM
  • How to verify the active context window at runtime

Time: 20 min | Difficulty: Intermediate


Why This Happens

Ollama ships with num_ctx: 2048 as a conservative default. The goal is to boot instantly on any hardware — including laptops with 8 GB of unified memory — without requiring the user to know their VRAM budget upfront.

The problem is that many modern models support far larger windows:

ModelTrained contextOllama defaultGap
Llama 3.3 70B128k tokens2048 tokens63×
Qwen2.5 72B128k tokens2048 tokens63×
Mistral 7B v0.332k tokens2048 tokens16×
Gemma 3 27B128k tokens2048 tokens63×
DeepSeek-R1 8B64k tokens2048 tokens32×

Ollama silently clips any input that exceeds num_ctx. You won't see an error — the model simply never receives the truncated tokens. For RAG pipelines, long system prompts, or multi-document analysis, this produces wrong answers that are hard to debug.

Symptoms:

  • Model answers questions about the end of a document but not the beginning (or vice versa)
  • Chat history "forgets" context from more than a few exchanges back
  • ollama ps shows a context window far smaller than the model card advertises
  • Long code files get silently truncated and the model produces incomplete completions

Ollama context length: how num_ctx maps through llama.cpp to the KV cache num_ctx sets the KV-cache allocation in llama.cpp. Tokens beyond this limit are dropped before inference.


How Context Length Affects VRAM

Before raising num_ctx, know the cost. The KV cache grows linearly with context length and is the main reason large contexts require more memory.

A rough formula for KV cache size in bytes:

kv_cache_bytes = num_ctx × num_layers × 2 × head_dim × num_heads × bytes_per_element

In practice, use these estimates per 1k tokens of context:

Model sizeKV cache per 1k tokens (fp16)
7B (32 layers)~200 MB
13B (40 layers)~320 MB
70B (80 layers)~1.3 GB

So raising a 7B model from 2k to 32k context costs roughly 6 GB extra VRAM. Plan accordingly before setting num_ctx: 131072 on a 70B model.


Solution

Step 1: Check the Current Active Context Window

Before changing anything, confirm what Ollama is actually using.

# Pull model status including active context size
ollama ps

Expected output:

NAME                    ID              SIZE      PROCESSOR    UNTIL
llama3.3:70b            abc123def456    47 GB     100% GPU     4 minutes from now

The SIZE field reflects loaded weights but not the KV cache. To see the active num_ctx, query the running model via the API:

curl http://localhost:11434/api/show -d '{"name": "llama3.3:70b"}' | jq '.parameters'

Look for num_ctx in the output. If it shows 2048 (or nothing — meaning the default is in effect), proceed to the next steps.


Step 2: Set num_ctx Per-Request via the API

The fastest way to test a larger context — no model changes needed.

# Single generation with 32k context
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3:70b",
  "prompt": "Summarize the following document: ...",
  "options": {
    "num_ctx": 32768
  }
}'

For the chat endpoint (used by most OpenAI-compatible clients):

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.3:70b",
  "messages": [{"role": "user", "content": "Your long prompt here..."}],
  "options": {
    "num_ctx": 32768
  }
}'

If it fails:

  • error: model requires more memory than available → reduce num_ctx or offload fewer layers. See Step 5.
  • exit status 1 in Ollama logs → your system ran out of RAM/VRAM mid-generation. Lower context or use a smaller quantization.

Step 3: Set num_ctx Permanently with a Modelfile

Per-request options require every client to pass num_ctx. A Modelfile bakes it in as the default for that model variant.

# Create a Modelfile for a high-context variant
cat > Modelfile << 'EOF'
FROM llama3.3:70b

# Set context window to 32k tokens
# Safe for 80GB VRAM (A100/H100) or 64GB unified memory (M2 Ultra)
PARAMETER num_ctx 32768

# Optional: raise repeat penalty window to match context
PARAMETER repeat_last_n 32768
EOF

Build and tag the new variant:

ollama create llama3.3-32k -f Modelfile

Verify it loaded correctly:

ollama show llama3.3-32k --parameters

Expected output:

num_ctx                        32768
repeat_last_n                  32768

Now any client that calls llama3.3-32k gets 32k context without any extra options.


Step 4: Set num_ctx for the OpenAI-Compatible Endpoint

If you're using Ollama's OpenAI-compatible API at /v1/chat/completions (common with LangChain, LlamaIndex, and Open WebUI), pass context size through the num_ctx option in the request body — it maps directly:

import openai

client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # value is ignored but required by the SDK
)

response = client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": long_document_prompt}],
    extra_body={
        "options": {
            "num_ctx": 32768  # passes through to llama.cpp n_ctx
        }
    }
)

For LangChain's ChatOllama:

from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="llama3.3:70b",
    num_ctx=32768,  # direct parameter — no extra_body needed
)

Step 5: Set VRAM Offloading to Support Larger Contexts

If you're on a machine with both GPU and CPU RAM (e.g., a workstation with 24 GB VRAM and 64 GB system RAM), Ollama can offload KV cache layers to CPU to fit larger contexts. Set num_gpu to reserve GPU layers for weights and let the KV cache overflow to RAM:

# Keep 60 transformer layers on GPU, let KV cache spill to RAM
OLLAMA_NUM_GPU=60 ollama run llama3.3:70b

Or set it system-wide in the Ollama service environment:

# /etc/systemd/system/ollama.service.d/override.conf  (Linux systemd)
[Service]
Environment="OLLAMA_NUM_GPU=60"
Environment="OLLAMA_MAX_VRAM=20000000000"  # 20 GB — reserve 4 GB headroom

Reload and restart:

sudo systemctl daemon-reload && sudo systemctl restart ollama

If it fails:

  • Generation speed drops to < 1 token/sec → too many layers on CPU. Raise OLLAMA_NUM_GPU until GPU utilization returns to near 100%.
  • OOM on CPU RAM → your num_ctx is still too large for combined VRAM + RAM. Use a smaller quantization (Q4_K_M instead of Q8_0) to free headroom.

Step 6: Choose the Right Context Size for Your Hardware

Use this reference table as a starting point. Values assume Q4_K_M quantization and fp16 KV cache.

HardwareModelSafe max num_ctx
8 GB VRAM (RTX 3070 / M1 Pro)7B16,384
16 GB VRAM (RTX 4080 / M2 Pro)7B65,536
16 GB VRAM13B16,384
24 GB VRAM (RTX 4090 / A10G)7B131,072
24 GB VRAM13B32,768
24 GB VRAM70B4,096
48 GB VRAM (2× RTX 4090 / A6000)70B16,384
80 GB VRAM (A100 / H100)70B65,536
64 GB unified (M2 Ultra / M3 Max)70B32,768

These are practical ceilings — stay 10–20% below to avoid OOM during long generations.


Verification

Confirm the context window is active and tokens are not being truncated:

# Generate a prompt that requires reading from > 2048 tokens back
# and check the response references early content correctly
python3 - << 'EOF'
import requests, json

# Build a prompt with a unique marker near the start and a question at the end
marker = "UNIQUE_MARKER_XJ7Q"
filler = "This is filler text to pad the context. " * 200  # ~800 tokens
question = f"What was the unique marker mentioned at the very beginning of this text?"

prompt = f"Remember this: {marker}. {filler} {question}"

resp = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.3:70b",
    "prompt": prompt,
    "stream": False,
    "options": {"num_ctx": 8192}
})

print(resp.json()["response"])
EOF

You should see: The model correctly returns UNIQUE_MARKER_XJ7Q in its answer. If it says it doesn't know or invents a different marker, num_ctx is still too low or the model isn't loaded with the new setting.

Also verify via the metrics field in the API response:

curl -s http://localhost:11434/api/generate -d '{
  "model": "llama3.3:70b",
  "prompt": "Hello",
  "stream": false,
  "options": {"num_ctx": 32768}
}' | jq '{prompt_eval_count, eval_count, context_length: .prompt_eval_count}'

prompt_eval_count shows the number of tokens actually processed. If it's capped at 2048 regardless of input length, the new num_ctx hasn't taken effect — restart Ollama and re-run.


What You Learned

  • Ollama defaults to num_ctx: 2048 on every model regardless of what the weights support — this is intentional hardware safety, not a bug
  • Raising num_ctx increases KV cache allocation linearly; a 7B model at 32k context costs ~6 GB of VRAM versus ~400 MB at 2k
  • Modelfiles are the right tool for permanent per-model defaults; options.num_ctx in the API body is right for per-request overrides
  • OLLAMA_NUM_GPU lets you offload KV cache to CPU RAM when VRAM is the bottleneck
  • Always verify with a marker test — silent truncation is the most common failure mode and produces no error messages

Tested on Ollama 0.6.x, Llama 3.3 70B Q4_K_M, Qwen2.5 14B Q8_0 — Ubuntu 24.04 (CUDA 12.4) and macOS Sequoia (Metal / M2 Ultra)


FAQ

Q: Does raising num_ctx slow down generation speed? A: Yes, but only slightly for prompt processing. Prefill (reading your input) scales with context length. Token generation speed (tokens/sec) is mostly unaffected unless the KV cache overflows to CPU RAM, which can cut throughput by 10–50×.

Q: What is the maximum num_ctx Ollama supports? A: Ollama passes num_ctx directly to llama.cpp as n_ctx. The practical ceiling is the model's trained context length — 128k for Llama 3.3, 32k for Mistral 7B v0.3. Going above the trained context produces degraded quality regardless of hardware.

Q: Can I set num_ctx globally for all models at once? A: Not via a single environment variable. The cleanest approach is to set it per Modelfile for each model you use regularly, or always pass options.num_ctx in your API client. A global OLLAMA_NUM_CTX env var is not supported as of Ollama 0.6.

Q: Does num_ctx affect the context window shown in Open WebUI? A: Yes. Open WebUI reads the active model parameters at session start. If you created a Modelfile variant with a higher num_ctx, select that model name in the UI and it will use the new window size automatically.

Q: Why is my generation slow even with enough VRAM for the larger context? A: Flash Attention must be enabled for large context windows to be efficient. Ollama enables it automatically when running on CUDA. On Metal (Apple Silicon), attention performance at very long contexts (> 64k) can still be slower than CUDA — this is a known limitation as of early 2026.