Problem: DeepSeek R1 Reasoning Model Won't Fit Your GPU
DeepSeek R1 full is 671B parameters — completely out of reach for consumer hardware. But the R1-Distill-Qwen-7B variant brings the same chain-of-thought reasoning capability down to a size that fits on a 6GB GPU. The catch: most guides skip the quantization and GPU offload details that make the difference between a usable and unusable experience.
You'll learn:
- Which quantization format to use for 6GB, 8GB, and 12GB VRAM
- How to run DeepSeek R1-Distill-Qwen-7B via Ollama with persistent GPU config
- How to verify reasoning output is working (the
<think>traces)
Time: 20 min | Difficulty: Intermediate
Why R1-Distill-Qwen-7B Is Worth Running Locally
DeepSeek distilled the reasoning behavior of R1 into smaller base models. The 7B variant uses Qwen2.5-7B as the base and was trained on R1-generated chain-of-thought data. You get step-by-step reasoning traces inside <think> tags — the same pattern as the full model — at a fraction of the cost.
Comparison of DeepSeek R1 variants:
| Model | Parameters | Min VRAM (fp16) | Recommended |
|---|---|---|---|
| R1 Full | 671B | ~400GB | Data center only |
| R1-Distill-Llama-70B | 70B | ~40GB | 2× A100 80GB |
| R1-Distill-Qwen-14B | 14B | ~28GB | RTX 3090/4090 |
| R1-Distill-Qwen-7B | 7B | ~14GB | RTX 3060 12GB |
| R1-Distill-Qwen-7B Q4 | 7B (Q4_K_M) | ~4.5GB | GTX 1660 / RX 6600 |
The Q4_K_M quantization of the 7B model runs on GPUs as old as a GTX 1060 6GB.
Solution
Step 1: Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows (PowerShell, run as admin)
winget install Ollama.Ollama
# Verify
ollama --version
Expected output:
ollama version 0.5.x
If it fails:
command not found→ Restart your terminal after install; PATH wasn't updated- Windows:
winget not found→ Install App Installer from the Microsoft Store first
Step 2: Pull the Model
Ollama hosts DeepSeek R1 distill variants directly. Pull the 7B model:
# Default pull — Ollama picks Q4_K_M automatically (~4.7GB download)
ollama pull deepseek-r1:7b
# Verify the model is present
ollama list
Expected output:
NAME ID SIZE MODIFIED
deepseek-r1:7b a42b25d8c10a 4.7 GB Just now
If you need higher quality output and have 8–12GB VRAM, pull the Q8 variant instead:
# Q8_0 — better quality, ~7.7GB, needs 8GB+ VRAM
ollama pull deepseek-r1:7b-qwen-distill-q8_0
Step 3: Run a Reasoning Test
ollama run deepseek-r1:7b "What is 17 × 24? Show your work."
You should see a <think> block before the answer — this confirms the reasoning traces are active:
<think>
17 × 24
= 17 × 20 + 17 × 4
= 340 + 68
= 408
</think>
17 × 24 = **408**
If you only see the final answer with no <think> block, the wrong model variant was pulled. Re-pull with the explicit tag above.
Step 4: Configure GPU Layer Offloading
By default, Ollama detects your GPU and offloads as many layers as VRAM allows. If you have a low-VRAM card or want to pin a specific layer count, create a Modelfile:
cat > Modelfile << 'EOF'
FROM deepseek-r1:7b
# num_gpu: layers offloaded to GPU
# 33 = all layers for 7B on 6GB VRAM with Q4_K_M
# Reduce to 28 if you see OOM errors; increase to 33 on 8GB
PARAMETER num_gpu 33
# Context window — 4096 is safe on 6GB; raise to 8192 on 8GB
PARAMETER num_ctx 4096
# Keep model loaded between prompts (avoids cold-start lag)
PARAMETER keep_alive -1
EOF
ollama create deepseek-r1-local -f Modelfile
ollama run deepseek-r1-local
VRAM reference for layer count:
| VRAM | Recommended num_gpu | Context window |
|---|---|---|
| 6GB | 28–33 | 4096 |
| 8GB | 33 (all) | 8192 |
| 12GB | 33 (all) | 16384 |
If it fails:
Error: model requires more VRAM→ Dropnum_gpuby 4 and retry- Runs but very slow → You're hitting RAM offload; lower
num_ctxto 2048
Step 5: Use the OpenAI-Compatible API
Ollama exposes an OpenAI-compatible endpoint. This means you can swap DeepSeek R1 into any app that already uses OpenAI:
from openai import OpenAI
# Point to local Ollama instead of api.openai.com
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Required by the client library, value is ignored
)
response = client.chat.completions.create(
model="deepseek-r1-local",
messages=[
{"role": "user", "content": "Explain how backpropagation works in 3 steps."}
],
temperature=0.6, # R1 distill models work well at 0.5–0.7
)
# The <think> block is inside the content — strip it if you don't need it
print(response.choices[0].message.content)
Strip reasoning traces in Python:
import re
def strip_think(text: str) -> str:
# Remove <think>...</think> blocks including multiline
return re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
answer = strip_think(response.choices[0].message.content)
Verification
Run a multi-step reasoning task that will expose any GPU memory issues:
ollama run deepseek-r1-local "A train leaves at 9am traveling 60mph. Another leaves the same station at 10am at 90mph. When does the second train catch the first?"
You should see: a <think> block working through the algebra, followed by a clean answer. Total response time should be under 30 seconds on a 6GB GPU.
Check active GPU usage:
# NVIDIA
nvidia-smi --query-gpu=name,memory.used,memory.total --format=csv -l 2
# AMD (Linux)
rocm-smi --showmemuse
# Apple Silicon
sudo powermetrics --samplers gpu_power -i 1000 -n 3
For the Q4_K_M 7B model with num_ctx 4096, expect ~4.2–4.8GB VRAM in use on NVIDIA hardware.
What You Learned
- R1-Distill-Qwen-7B at Q4_K_M fits in 6GB VRAM — the
<think>reasoning traces work at this quantization num_gpuandnum_ctxin a Modelfile give you reliable, reproducible GPU configuration- The Ollama OpenAI-compatible endpoint makes it a drop-in replacement for API-dependent tools
Limitation: At Q4_K_M, complex multi-step math and code reasoning degrades compared to fp16. For production coding tasks, run Q8_0 on an 8GB card instead. The 14B distill model is a significant step up in accuracy if your GPU supports it.
Tested on Ollama 0.5.4, RTX 3060 12GB and GTX 1660 Super 6GB, Ubuntu 24.04 and Windows 11