Problem: Ollama Unloads Your Model Between Requests

Ollama keep-alive memory management is the fastest way to stop wasting 10–30 seconds on model reload latency every time your app sends a request after a short idle. By default, Ollama unloads a model from GPU memory 5 minutes after its last use. If your agent, API, or dev workflow sends requests intermittently, you pay that cold-start tax on every gap.

You'll learn:

How keep_alive works at the request and server level
How to pin a model in VRAM permanently — or evict it instantly on demand
How to tune memory residency for multi-model setups and production Docker deployments

Time: 15 min | Difficulty: Intermediate

Why Ollama Evicts Models from Memory

Ollama loads a model into GPU (or CPU) memory on first inference and starts a countdown timer. When the keep_alive window expires with no new requests, Ollama calls the unload routine — freeing VRAM so another model can load.

This default is sensible for local dev where you switch models often. It becomes a problem when:

Your LLM API backend has uneven traffic (bursts + idle gaps)
You're running an AI agent loop with think-time between calls
You want sub-second first-token latency on every request, not just the hot ones

Symptoms:

First request after idle takes 15–45 seconds (model reloading from disk)
GPU memory spikes then drops to near-zero during idle
Logs show model loaded repeatedly for the same model

Ollama keep-alive model lifecycle: load, idle timer, evict, reload Ollama's model lifecycle: the keep_alive window is the only thing standing between hot inference and a full reload penalty

Solution

Step 1: Set keep_alive Per Request

The fastest way to control residency is in the request body itself. The keep_alive field accepts a Go duration string or a plain integer (seconds).

# Keep the model loaded for 30 minutes after this request
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Explain VRAM paging",
    "keep_alive": "30m"
  }'

# Pin indefinitely — model stays until you explicitly unload or restart Ollama
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Explain VRAM paging",
    "keep_alive": -1
  }'

# Evict immediately after this request completes — frees VRAM right away
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Explain VRAM paging",
    "keep_alive": 0
  }'

Valid duration formats: "5m", "1h", "30s", "-1" (infinite), 0 (immediate evict)

Expected output: Response streams normally. Check GPU memory with nvidia-smi or sudo powermetrics --samplers gpu_power on Apple Silicon — the model stays resident after the -1 request.

If it fails:

unknown field keep_alive → You're hitting an Ollama version older than 0.1.24. Run ollama --version and upgrade with curl -fsSL https://ollama.com/install.sh | sh.
Model still evicts after "keep_alive": -1 → The server-level OLLAMA_KEEP_ALIVE env var may be overriding per-request values on some builds. Set both (Step 2).

Step 2: Set the Server-Level Default with OLLAMA_KEEP_ALIVE

Per-request keep_alive controls individual calls. For a persistent server where every request should keep the model hot, set the environment variable instead of touching every API call.

# systemd — edit the Ollama service drop-in
sudo systemctl edit ollama.service

[Service]
Environment="OLLAMA_KEEP_ALIVE=1h"

sudo systemctl daemon-reload && sudo systemctl restart ollama

For Docker deployments (the most common production pattern in 2026):

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    runtime: nvidia          # requires nvidia-container-toolkit
    environment:
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_NUM_PARALLEL=4    # handle concurrent requests without unloading
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  ollama_data:

docker compose up -d

Expected output: ollama container starts, GPU memory shows the model resident after the first warm-up request.

If it fails:

OLLAMA_KEEP_ALIVE has no effect → Confirm with docker exec ollama env | grep OLLAMA. If missing, check YAML indentation — environment must be a list under the service, not under deploy.
OOM on GPU → You set keep_alive=-1 on a model that takes most of your VRAM, then triggered a second model load. See Step 4.

Step 3: Pre-Load a Model at Server Startup

OLLAMA_KEEP_ALIVE only keeps a model in memory after its first inference. To load it before any request arrives, use ollama run in detached mode or a startup curl ping.

# Warm up llama3.2 immediately — model stays loaded per OLLAMA_KEEP_ALIVE setting
curl -s http://localhost:11434/api/generate \
  -d '{"model": "llama3.2", "prompt": "", "keep_alive": -1}' \
  > /dev/null

Add this to your Docker command or a systemd ExecStartPost so every restart auto-warms the model:

# In a shell script run after Ollama starts
#!/bin/bash
sleep 3  # give Ollama time to bind the port
curl -s http://localhost:11434/api/generate \
  -d '{"model": "llama3.2", "prompt": "", "keep_alive": -1}' \
  > /dev/null
echo "Model warm."

Expected output: nvidia-smi shows VRAM allocated for the model before any real traffic hits.

Step 4: Manage Multiple Models and VRAM Budgets

When you run two or more models, Ollama evicts the least-recently-used model when a new one needs to load — unless you've set keep_alive: -1 on both, which blocks eviction and causes OOM.

The production pattern: pin your primary model to infinite, and let secondary models use a finite window.

# Primary model — always hot (8–12 GB VRAM on a 24 GB card)
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.2", "prompt": "", "keep_alive": -1}' \
  > /dev/null

# Secondary model — 20-minute window covers most session lengths
curl http://localhost:11434/api/generate \
  -d '{"model": "nomic-embed-text", "prompt": "", "keep_alive": "20m"}' \
  > /dev/null

VRAM budget planning for common 2026 setups:

Model	Quant	VRAM	Recommended keep_alive
llama3.2 3B	Q4_K_M	~2.4 GB	`-1` (always-on)
llama3.3 70B	Q4_K_M	~43 GB	`"1h"` (session-scoped)
mistral-nemo 12B	Q4_K_M	~8 GB	`-1` on 24 GB card
nomic-embed-text	F16	~0.3 GB	`"30m"`
qwen2.5-coder 32B	Q4_K_M	~20 GB	`"2h"` on A100

Step 5: Force-Unload a Model on Demand

Sometimes you need to free VRAM immediately — before a heavier model call, or to hot-swap to a different quantization. Set keep_alive: 0 on any request to trigger immediate eviction.

# Force unload — useful before loading a large model or after a heavy batch job
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.2", "prompt": "", "keep_alive": 0}'

You can also list currently loaded models to verify what's resident:

curl http://localhost:11434/api/ps

Expected output:

{
  "models": [
    {
      "name": "llama3.2:latest",
      "size_vram": 2516582400,
      "expires_at": "0001-01-01T00:00:00Z"
    }
  ]
}

"expires_at": "0001-01-01..." means the model is pinned with keep_alive: -1. A real timestamp means it will evict at that time.

Verification

# Check what's loaded and when it expires
curl -s http://localhost:11434/api/ps | python3 -m json.tool

You should see: Your pinned model with expires_at showing the zero value (infinite) or a future timestamp matching your duration setting.

# Real-time VRAM usage (NVIDIA)
watch -n 2 nvidia-smi --query-gpu=memory.used,memory.free --format=csv

# Apple Silicon
sudo powermetrics --samplers gpu_power -i 2000 -n 1

You should see: VRAM stays allocated during idle if keep_alive is non-zero. It drops only after expiry or an explicit keep_alive: 0 call.

What You Learned

keep_alive is a per-request field that overrides the server default for that call's residency window
OLLAMA_KEEP_ALIVE sets the server-wide default — -1 pins all models, 0 evicts immediately after every request
/api/ps is your live view into which models are resident and when they expire
Pinning multiple large models with keep_alive: -1 on a single GPU causes OOM — budget VRAM explicitly and use finite windows for secondary models
Pre-loading with an empty-prompt curl at startup eliminates cold-start latency for production traffic

Tested on Ollama 0.6.2, CUDA 12.4, Docker 27, Ubuntu 24.04 LTS, and macOS Sequoia (M2 Max)

FAQ

Q: Does keep_alive: -1 survive an Ollama restart? A: No. Infinite pinning is runtime state only. After a restart, the model is unloaded from memory — you need a warm-up request or startup script to reload it. See Step 3.

Q: What is the difference between keep_alive and num_keep? A: They control different things. keep_alive is a duration — it says how long to keep the model in memory after its last use. num_keep is a token count — it tells Ollama how many tokens of the KV cache to preserve across requests to avoid recomputing context. Both are useful, but solve separate latency problems.

Q: Can I set different keep_alive values for different models on the same server? A: Yes, per-request keep_alive is model-scoped. Set keep_alive: -1 in your primary model's warm-up call and a shorter duration in the secondary model's call — Ollama tracks expiry per loaded model independently.

Q: What happens if VRAM fills up while a model is pinned with keep_alive: -1? A: Ollama will fail to load the new model and return an error rather than evict the pinned one. Use /api/ps to check current residency and either increase your VRAM budget, use a smaller quantization, or explicitly unload the pinned model with keep_alive: 0 first.

Q: Does OLLAMA_KEEP_ALIVE work on the Ollama Docker image from Docker Hub? A: Yes — pass it as an environment variable in your docker run -e OLLAMA_KEEP_ALIVE=1h or in the environment block of docker-compose.yml. Confirm it's set with docker exec <container> env | grep OLLAMA_KEEP_ALIVE.