Run DeepSeek R1-Distill-Qwen-7B on Consumer GPU with Ollama

Set up DeepSeek R1-Distill-Qwen-7B on a 6–8GB consumer GPU using Ollama. Covers install, quantization, GPU config, and benchmarks.

Problem: DeepSeek R1 Reasoning Model Won't Fit Your GPU

DeepSeek R1 full is 671B parameters — completely out of reach for consumer hardware. But the R1-Distill-Qwen-7B variant brings the same chain-of-thought reasoning capability down to a size that fits on a 6GB GPU. The catch: most guides skip the quantization and GPU offload details that make the difference between a usable and unusable experience.

You'll learn:

  • Which quantization format to use for 6GB, 8GB, and 12GB VRAM
  • How to run DeepSeek R1-Distill-Qwen-7B via Ollama with persistent GPU config
  • How to verify reasoning output is working (the <think> traces)

Time: 20 min | Difficulty: Intermediate


Why R1-Distill-Qwen-7B Is Worth Running Locally

DeepSeek distilled the reasoning behavior of R1 into smaller base models. The 7B variant uses Qwen2.5-7B as the base and was trained on R1-generated chain-of-thought data. You get step-by-step reasoning traces inside <think> tags — the same pattern as the full model — at a fraction of the cost.

Comparison of DeepSeek R1 variants:

ModelParametersMin VRAM (fp16)Recommended
R1 Full671B~400GBData center only
R1-Distill-Llama-70B70B~40GB2× A100 80GB
R1-Distill-Qwen-14B14B~28GBRTX 3090/4090
R1-Distill-Qwen-7B7B~14GBRTX 3060 12GB
R1-Distill-Qwen-7B Q47B (Q4_K_M)~4.5GBGTX 1660 / RX 6600

The Q4_K_M quantization of the 7B model runs on GPUs as old as a GTX 1060 6GB.


Solution

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows (PowerShell, run as admin)
winget install Ollama.Ollama

# Verify
ollama --version

Expected output:

ollama version 0.5.x

If it fails:

  • command not found → Restart your terminal after install; PATH wasn't updated
  • Windows: winget not found → Install App Installer from the Microsoft Store first

Step 2: Pull the Model

Ollama hosts DeepSeek R1 distill variants directly. Pull the 7B model:

# Default pull — Ollama picks Q4_K_M automatically (~4.7GB download)
ollama pull deepseek-r1:7b

# Verify the model is present
ollama list

Expected output:

NAME                    ID              SIZE    MODIFIED
deepseek-r1:7b          a42b25d8c10a    4.7 GB  Just now

If you need higher quality output and have 8–12GB VRAM, pull the Q8 variant instead:

# Q8_0 — better quality, ~7.7GB, needs 8GB+ VRAM
ollama pull deepseek-r1:7b-qwen-distill-q8_0

Step 3: Run a Reasoning Test

ollama run deepseek-r1:7b "What is 17 × 24? Show your work."

You should see a <think> block before the answer — this confirms the reasoning traces are active:

<think>
17 × 24
= 17 × 20 + 17 × 4
= 340 + 68
= 408
</think>

17 × 24 = **408**

If you only see the final answer with no <think> block, the wrong model variant was pulled. Re-pull with the explicit tag above.


Step 4: Configure GPU Layer Offloading

By default, Ollama detects your GPU and offloads as many layers as VRAM allows. If you have a low-VRAM card or want to pin a specific layer count, create a Modelfile:

cat > Modelfile << 'EOF'
FROM deepseek-r1:7b

# num_gpu: layers offloaded to GPU
# 33 = all layers for 7B on 6GB VRAM with Q4_K_M
# Reduce to 28 if you see OOM errors; increase to 33 on 8GB
PARAMETER num_gpu 33

# Context window — 4096 is safe on 6GB; raise to 8192 on 8GB
PARAMETER num_ctx 4096

# Keep model loaded between prompts (avoids cold-start lag)
PARAMETER keep_alive -1
EOF

ollama create deepseek-r1-local -f Modelfile
ollama run deepseek-r1-local

VRAM reference for layer count:

VRAMRecommended num_gpuContext window
6GB28–334096
8GB33 (all)8192
12GB33 (all)16384

If it fails:

  • Error: model requires more VRAM → Drop num_gpu by 4 and retry
  • Runs but very slow → You're hitting RAM offload; lower num_ctx to 2048

Step 5: Use the OpenAI-Compatible API

Ollama exposes an OpenAI-compatible endpoint. This means you can swap DeepSeek R1 into any app that already uses OpenAI:

from openai import OpenAI

# Point to local Ollama instead of api.openai.com
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required by the client library, value is ignored
)

response = client.chat.completions.create(
    model="deepseek-r1-local",
    messages=[
        {"role": "user", "content": "Explain how backpropagation works in 3 steps."}
    ],
    temperature=0.6,  # R1 distill models work well at 0.5–0.7
)

# The <think> block is inside the content — strip it if you don't need it
print(response.choices[0].message.content)

Strip reasoning traces in Python:

import re

def strip_think(text: str) -> str:
    # Remove <think>...</think> blocks including multiline
    return re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()

answer = strip_think(response.choices[0].message.content)

Verification

Run a multi-step reasoning task that will expose any GPU memory issues:

ollama run deepseek-r1-local "A train leaves at 9am traveling 60mph. Another leaves the same station at 10am at 90mph. When does the second train catch the first?"

You should see: a <think> block working through the algebra, followed by a clean answer. Total response time should be under 30 seconds on a 6GB GPU.

Check active GPU usage:

# NVIDIA
nvidia-smi --query-gpu=name,memory.used,memory.total --format=csv -l 2

# AMD (Linux)
rocm-smi --showmemuse

# Apple Silicon
sudo powermetrics --samplers gpu_power -i 1000 -n 3

For the Q4_K_M 7B model with num_ctx 4096, expect ~4.2–4.8GB VRAM in use on NVIDIA hardware.


What You Learned

  • R1-Distill-Qwen-7B at Q4_K_M fits in 6GB VRAM — the <think> reasoning traces work at this quantization
  • num_gpu and num_ctx in a Modelfile give you reliable, reproducible GPU configuration
  • The Ollama OpenAI-compatible endpoint makes it a drop-in replacement for API-dependent tools

Limitation: At Q4_K_M, complex multi-step math and code reasoning degrades compared to fp16. For production coding tasks, run Q8_0 on an 8GB card instead. The 14B distill model is a significant step up in accuracy if your GPU supports it.

Tested on Ollama 0.5.4, RTX 3060 12GB and GTX 1660 Super 6GB, Ubuntu 24.04 and Windows 11