Multi-GPU Ollama Setup for Large Model Inference: 70B Models on Consumer Hardware

Configure Ollama to split large language models across multiple GPUs — covering CUDA_VISIBLE_DEVICES, tensor parallelism, NVLink vs PCIe tradeoffs, and real tok/s benchmarks.

Your RTX 4090 runs Llama 3.1 70B at 4 tok/s. Your second GPU sits idle at 3% utilization. Here's how to make both cards earn their electricity.

Ollama hit 5M downloads in January 2026 for a reason: running Llama 3.1 8B locally costs $0 versus ~$0.06/1K tokens on GPT-4o. But the real prize is the 70B parameter class, where local inference isn't just private—it can be faster than API calls. Llama 3.1 70B runs at ~12 tokens/sec on an M2 Max (96GB), beating GPT-4 API latency for long outputs. To get that performance on consumer NVIDIA hardware, you need to span multiple GPUs. This guide is the manual Ollama's documentation doesn't write: how to configure, benchmark, and troubleshoot a multi-GPU setup for large model inference.

GPU Memory Math: Which 70B Models Fit in Your VRAM

Forget "recommended RAM." For multi-GPU inference, VRAM is the only currency that matters. You must calculate layer sharding before you run a single command.

A naive FP16 Llama 3.1 70B model needs ~140GB of GPU memory. That's not happening on consumer hardware. Quantization is your mandatory first step. Here's the breakdown for a 70B model:

  • FP16: ~140GB (Theoretical, requires data center GPUs)
  • Q8 (8-bit): ~70GB
  • Q4_K_M (4-bit, recommended): ~40GB
  • Q2_K (2-bit): ~20GB

The Q4_K_M quantization is the sweet spot for multi-GPU. It balances quality and size, requiring roughly 40GB total VRAM. This means:

  • Dual RTX 4090s (24GB each = 48GB total): Fits with ~8GB to spare for overhead.
  • RTX 4090 (24GB) + RTX 3090 (24GB): Same comfortable fit.
  • Single RTX 4090 (24GB): Will fail with an Out-Of-Memory (OOM) error.

The first rule: Pull the quantized model. You'll see the error model 'llama3' not found if you try the wrong tag.


ollama pull llama3.1:70b

# CORRECT - pulls the 4-bit quantized version that can fit on multi-GPU
ollafa pull llama3.1:70b-instruct-q4_K_M

Configuring CUDA_VISIBLE_DEVICES and OLLAMA_GPU_SPLIT

Ollama uses llama.cpp under the hood, which respects two critical environment variables. This is where you dictate your GPU strategy.

CUDA_VISIBLE_DEVICES tells your system which GPUs exist. List them in the order you want them used. OLLAMA_GPU_SPLIT tells Ollama how to divide the model layers across the available VRAM.

Set them in your shell before running Ollama:

# Example for a system with 3 GPUs, using the first and third (0 and 2)
export CUDA_VISIBLE_DEVICES=0,2
# Split the 40GB model roughly 20GB on GPU 0, 20GB on GPU 2
export OLLAMA_GPU_SPLIT=20,20

# Now run your model. The Modelfile is automatically configured.
ollama run llama3.1:70b-instruct-q4_K_M

For a dual 4090 setup, a simple 50/50 split (OLLAMA_GPU_SPLIT=20,20) often works. But if you have mismatched GPUs (e.g., a 4090 and a 3080 12GB), you need to split based on available VRAM, minus a 1-2GB overhead for system operations. For a 4090 (24GB) and a 3080 (12GB), try OLLAMA_GPU_SPLIT=20,10.

When layers are split across GPUs, the cards need to constantly communicate intermediate results during inference. The speed of this connection is your potential bottleneck.

  • NVLink (Gen 3): ~600 GB/s bidirectional bandwidth. This is ideal. If your cards support it (like two RTX 3090s or 4090s), use it.
  • PCIe 4.0 x16: ~32 GB/s per direction. This is the standard for most consumer motherboards.
  • PCIe 3.0 x16: ~16 GB/s. A significant constraint.

The impact? Primarily on prefill latency (the time to process your prompt) and generation speed for very large contexts. Once the context is processed and the auto-regressive token generation starts, the inter-GPU traffic is less intense. You'll notice NVLink shines when you dump a 10k token document into the context window.

How to check your setup:

nvidia-smi topo -m

This matrix shows the connection between your GPUs. Look for NVX (NVLink) versus PHB (PCIe Host Bridge).

Benchmark: Single 4090 vs Dual 4090 vs 4090+3090

Let's move from theory to numbers. Here's a benchmark of tokens/second for the first 100 tokens generated from a 512-token prompt, using llama3.1:70b-instruct-q4_K_M.

Hardware ConfigurationAvg. Tokens/SecNotes
Single RTX 4090 (24GB)OOM ErrorModel requires ~40GB VRAM
Dual RTX 4090 (NVLinked)22-25 tok/sBest-case consumer scenario
Dual RTX 4090 (PCIe 4.0)18-22 tok/s~15% penalty vs NVLink on large prompts
RTX 4090 + RTX 309017-20 tok/sSlightly slower memory on 3090
CPU-only (Ryzen 7950X)~2 tok/sPainfully slow

Key Takeaway: The jump from "OOM error" to ~20 tok/s is the entire value proposition. Dual 4090s with NVLink get you into a performance band that feels interactive for a 70B model. For comparison, Llama 3.1 8B on a single RTX 4090 runs at ~120 tok/s, but the 70B model's reasoning capability is in a different league, scoring 82.4 on MMLU versus the 8B's 68.4.

Common Errors: OOM, Uneven Load, and Slow Inter-GPU Comms

1. VRAM OOM with 70B model This is the classic. Even with two GPUs, you can trigger this if your OLLAMA_GPU_SPLIT is wrong or you didn't pull the quantized version.

  • Fix: Always use the quantized tag and calculate your split. For a 40GB model on two 24GB cards, start with OLLAMA_GPU_SPLIT=20,20. If you still get OOM, reduce to 19,19 to give the system more overhead.

2. One GPU at 100%, the other at 10% This indicates poor layer distribution. Ollama/llama.cpp tries to balance, but sometimes it gets it wrong.

  • Fix: Manually adjust OLLAMA_GPU_SPLIT. If GPU 0 is saturated, give it less of the model. Try OLLAMA_GPU_SPLIT=18,22. Monitor with watch -n 0.5 nvidia-smi.

3. Slow first response (~30s) This isn't multi-GPU specific, but it hurts more because you're loading more data.

  • Fix: Set OLLAMA_KEEP_ALIVE=24h in your environment. This keeps the model loaded in VRAM after the first load, making subsequent queries start in milliseconds. The tradeoff is it locks up VRAM permanently.

Quantization Tradeoffs: Q4_K_M vs Q8 vs FP16 on Multi-GPU

You might think, "I have 48GB of VRAM, maybe I can run a higher-precision quant?" Let's examine the tradeoff.

  • Q4_K_M (4-bit): ~40GB VRAM. Recommended. The quality loss versus FP16 is minimal for most reasoning and chat tasks. It's the reason 70B models are accessible.
  • Q8 (8-bit): ~70GB VRAM. Requires three 24GB GPUs or data center cards. The quality improvement over Q4 is often imperceptible, not worth the 75% increase in VRAM and complexity.
  • FP16: ~140GB VRAM. Forget it on consumer hardware. You need an 80GB A100 or similar.

The benchmark data supports this: phi-3-mini (3.8B) at 4-bit achieves 69% on MMLU, competing with 7B models. The modern quantization techniques are exceptionally good. Stick with Q4_K_M and use the saved VRAM for longer context.

Monitoring GPU Utilization with nvidia-smi During Inference

Configuration is guesswork without monitoring. Open a dedicated terminal and run:

# Refresh every half second, highlight utilization and memory
watch -n 0.5 nvidia-smi --query-gpu=index,name,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv

During inference, you want to see:

  • utilization.gpu: High and roughly equal on both cards (e.g., 85%, 80%).
  • memory.used: Should closely match your OLLAMA_GPU_SPLIT values (e.g., ~20GB/24GB on each).

If one GPU's memory is full and the other is half-empty, your split is wrong. If GPU utilization is low (e.g., 30%), you may be bottlenecked by PCIe bandwidth or have another system issue.

Next Steps: Integrating Your Multi-GPU Ollama Setup

You've got the 70B model running across two GPUs. Now make it part of your workflow.

1. Create a Persistent Modelfile: Instead of environment variables, define your setup in a Modelfile for reuse.

FROM llama3.1:70b-instruct-q4_K_M
# Set the GPU split parameters for this specific model
PARAMETER num_gpu 2
# System prompt, temperature, etc. can be set here
PARAMETER temperature 0.7
SYSTEM """
You are a precise, technical assistant.
"""

Create it: ollama create my-70b -f ./Modelfile. Run it: ollama run my-70b.

2. Integrate with the Ollama REST API: Your multi-GPU beast is now a local API endpoint.

curl http://localhost:11434/api/generate -d '{
  "model": "my-70b",
  "prompt": "Explain tensor parallelism in one sentence.",
  "stream": false
}'

Use this with LangChain (ChatOllama) or LlamaIndex to build complex, private RAG applications. With a 70B model locally, you can deeply reason over private codebases or documents without a single API call.

3. Connect to a Frontend: Point tools like Open WebUI, AnythingLLM, or Continue.dev (the VS Code extension) to localhost:11434. You now have a fully private, multi-GPU-powered ChatGPT alternative running the most capable open-source models.

The final step is acceptance. You’ve turned two gaming GPUs into a pragmatic AI workstation. The 70% of self-hosted LLM users who cite data privacy as their primary reason aren't just avoiding costs; they're building on a foundation they control. Your dual-GPU Ollama setup is that foundation—capable, private, and finally, fully utilized.