Deploy Qwen2.5-VL Locally: Vision Language Model Setup 2026

Run Qwen2.5-VL 7B or 72B locally with Ollama or vLLM for image understanding, OCR, and visual reasoning. Tested on Python 3.12, CUDA 12, and Apple Silicon.

Problem: Running a Local Vision Language Model That Actually Works

Qwen2.5-VL local deployment is the fastest path to private, production-grade image understanding without paying $0.01–$0.03 per image to cloud APIs. Whether you need OCR on scanned PDFs, visual QA on product screenshots, or structured data extraction from tables and charts, Qwen2.5-VL handles it on your own hardware.

You'll learn:

  • How to pull and run Qwen2.5-VL 7B via Ollama (CPU + GPU offloading)
  • How to serve Qwen2.5-VL 72B with vLLM for production throughput
  • How to send image inputs through the OpenAI-compatible API in Python

Time: 25 min | Difficulty: Intermediate


Why Qwen2.5-VL Beats the Previous Generation

Qwen2.5-VL is Alibaba's current flagship vision-language model family, released in late 2024. It replaces Qwen-VL and Qwen2-VL with a significantly improved visual encoder and a dynamic resolution system — it processes images at their native resolution up to 1280×28×28 patches instead of forcing everything into a fixed 448px crop.

Why this matters for local use:

  • Native resolution input means OCR accuracy on dense documents is far better than LLaVA or the original Qwen-VL
  • The 7B model fits in 8 GB VRAM with Q4 quantization — a single RTX 3080 or 4070 handles it
  • The 72B model matches GPT-4o on several visual benchmarks at zero API cost
  • Apache 2.0 license — commercial use is allowed with no restrictions

Model sizes available:

ModelFull precision VRAMQ4_K_M VRAMBest for
Qwen2.5-VL-3B~7 GB~2.5 GBEdge, Raspberry Pi 5, testing
Qwen2.5-VL-7B~16 GB~5 GBSingle consumer GPU, M2/M3
Qwen2.5-VL-72B~144 GB~45 GBMulti-GPU server, production

Prerequisites

Before you start, confirm:

  • GPU path: NVIDIA GPU with CUDA 12.1+ or Apple Silicon (M1/M2/M3/M4)
  • CPU-only path: Works, but expect 5–30 tokens/sec on 7B depending on your CPU
  • Docker 25+ installed (optional but recommended for vLLM)
  • Python 3.11 or 3.12 for the API client examples

Check your CUDA version:

nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
nvcc --version

Expected output:

NVIDIA GeForce RTX 4080, 16376 MiB
nvcc: release 12.3

Solution

Step 1: Pull Qwen2.5-VL with Ollama

Ollama 0.5+ ships with Qwen2.5-VL support built in. Install or update first:

# macOS / Linux — installs to /usr/local/bin/ollama
curl -fsSL https://ollama.com/install.sh | sh

# Verify version — must be 0.5.0 or later for VL models
ollama --version

Pull the 7B model with Q4_K_M quantization — this is the best quality-to-VRAM ratio for single-GPU use:

# Q4_K_M = ~5 GB download, runs on 8 GB VRAM or 16 GB unified memory
ollama pull qwen2.5vl:7b

# For 72B on a multi-GPU machine with 48+ GB VRAM total
ollama pull qwen2.5vl:72b

Start the Ollama server if it isn't already running:

ollama serve

Expected output:

time=2026-03-11T08:00:00Z level=INFO msg="Listening on 127.0.0.1:11434"

Step 2: Send Your First Image Prompt

Test the model directly from the CLI with a local image:

# Pass a local file path — Ollama base64-encodes it automatically
ollama run qwen2.5vl:7b "Describe what is in this image" --image ./screenshot.png

For a URL-referenced image:

ollama run qwen2.5vl:7b "Extract all text from this image" \
  --image https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png

If it fails:

  • Error: model requires vision support → Your Ollama version is below 0.5.0. Run ollama --version and reinstall.
  • Error: out of memory → Switch to the 3B model: ollama pull qwen2.5vl:3b

Step 3: Call the OpenAI-Compatible API in Python

Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. Use the openai SDK without changing anything except the base_url.

Install the SDK:

pip install openai --quiet

Send an image from a local file:

import base64
from pathlib import Path
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Ollama ignores the key — any non-empty string works
)

def encode_image(path: str) -> str:
    # Base64-encode so the model receives raw bytes, not a file path
    return base64.b64encode(Path(path).read_bytes()).decode("utf-8")

image_data = encode_image("invoice.png")

response = client.chat.completions.create(
    model="qwen2.5vl:7b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_data}"
                    },
                },
                {
                    "type": "text",
                    "text": "Extract the invoice number, date, and total amount as JSON.",
                },
            ],
        }
    ],
    max_tokens=512,
    temperature=0.1,  # Low temp for structured extraction — reduces hallucination on numbers
)

print(response.choices[0].message.content)

Expected output:

{
  "invoice_number": "INV-2026-0042",
  "date": "2026-03-10",
  "total_amount": "$1,240.00"
}

Step 4: Serve Qwen2.5-VL 72B with vLLM (Production)

For higher throughput — multiple concurrent users, batch jobs, or latency under 500 ms — use vLLM instead of Ollama. vLLM handles continuous batching and PagedAttention, which Ollama does not.

Pull the model from Hugging Face and serve it:

# Requires 2x A100 80GB or 4x RTX 4090 for the full 72B BF16 model
docker run --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-VL-72B-Instruct \
  --tensor-parallel-size 4 \   # Match to your GPU count
  --max-model-len 32768 \      # 32K context handles large images + long prompts
  --dtype bfloat16

For the 7B model on a single A10G (24 GB, common on AWS g5.xlarge at ~$1.01/hr us-east-1):

docker run --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-VL-7B-Instruct \
  --max-model-len 16384 \
  --dtype bfloat16

Expected output:

INFO:     Started server process
INFO:     Uvicorn running on http://0.0.0.0:8000

Point the same Python client at vLLM by changing base_url:

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",  # vLLM also ignores the key value
)
# Change model name to match the HF repo ID
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    ...
)

Step 5: Multi-Image and Document Understanding

Qwen2.5-VL accepts multiple images in a single turn. This is useful for comparing before/after screenshots, multi-page documents, or visual diff tasks.

def make_image_message(paths: list[str], prompt: str) -> dict:
    content = []

    for path in paths:
        encoded = encode_image(path)
        ext = Path(path).suffix.lstrip(".")  # png, jpg, webp — all supported
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/{ext};base64,{encoded}"}
        })

    content.append({"type": "text", "text": prompt})
    return {"role": "user", "content": content}

response = client.chat.completions.create(
    model="qwen2.5vl:7b",
    messages=[
        make_image_message(
            paths=["page1.png", "page2.png", "page3.png"],
            prompt="Summarize the key figures from each page of this report."
        )
    ],
    max_tokens=1024,
)

Token budget note: Each 1024×1024 image consumes roughly 1,280 tokens with Qwen2.5-VL's dynamic resolution encoder. At num_ctx=8192 (Ollama default), you can fit ~5 full-resolution images plus a detailed prompt. Raise num_ctx if you need more:

# In Modelfile or via API options
PARAMETER num_ctx 16384

Verification

Run this end-to-end smoke test to confirm the stack is working:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5vl:7b",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"}},
          {"type": "text", "text": "What colors are in this image?"}
        ]
      }
    ],
    "max_tokens": 100
  }'

You should see: A JSON response with choices[0].message.content describing colors in the image. If content is null or you get model does not support vision, re-pull with ollama pull qwen2.5vl:7b.


What You Learned

  • Qwen2.5-VL uses native-resolution image encoding — do not resize images before sending, it degrades accuracy
  • The Ollama endpoint is OpenAI-compatible: swap base_url and model and your existing GPT-4 vision code works immediately
  • For batch jobs processing 100+ images, vLLM on an AWS g5.xlarge (A10G, us-east-1, ~$1.01/hr on-demand) is more cost-efficient than calling GPT-4o at $0.01–$0.03 per image
  • num_ctx controls the token window — raise it for multi-image tasks, lower it to save VRAM on constrained hardware

Tested on Qwen2.5-VL-7B-Instruct (Ollama 0.5.4), vLLM 0.6.x, Python 3.12, CUDA 12.3, Ubuntu 22.04 and macOS 15 (M3 Max)


FAQ

Q: Does Qwen2.5-VL work on CPU only without a GPU? A: Yes — Ollama runs it on CPU using llama.cpp's GGUF backend. Expect 3–8 tokens/sec on a modern 8-core CPU with the 7B Q4 model. The 72B model is not practical without a GPU.

Q: What image formats does Qwen2.5-VL accept? A: PNG, JPEG, WEBP, and GIF (first frame only). Send images as base64 data URIs or public URLs. Local file paths are not accepted by the HTTP API — you must encode them first.

Q: What is the maximum image resolution Qwen2.5-VL supports? A: The model supports dynamic resolution up to 1,280 patch tokens per image, which corresponds roughly to 1280×1280 pixels. Images larger than this are resized automatically by the serving layer.

Q: Can Qwen2.5-VL run alongside another Ollama model at the same time? A: Ollama loads one model into VRAM at a time by default. Set OLLAMA_MAX_LOADED_MODELS=2 if you have enough VRAM to keep both resident and want to avoid cold-start latency when switching.

Q: Is Qwen2.5-VL suitable for HIPAA or SOC 2 workloads? A: Running it locally on your own infrastructure means no data leaves your environment, which is a prerequisite for HIPAA and SOC 2. You are still responsible for access controls, audit logging, and encryption at rest on your infra — the model itself does not provide compliance guarantees.