Problem: Running a Local Vision Language Model That Actually Works
Qwen2.5-VL local deployment is the fastest path to private, production-grade image understanding without paying $0.01–$0.03 per image to cloud APIs. Whether you need OCR on scanned PDFs, visual QA on product screenshots, or structured data extraction from tables and charts, Qwen2.5-VL handles it on your own hardware.
You'll learn:
- How to pull and run Qwen2.5-VL 7B via Ollama (CPU + GPU offloading)
- How to serve Qwen2.5-VL 72B with vLLM for production throughput
- How to send image inputs through the OpenAI-compatible API in Python
Time: 25 min | Difficulty: Intermediate
Why Qwen2.5-VL Beats the Previous Generation
Qwen2.5-VL is Alibaba's current flagship vision-language model family, released in late 2024. It replaces Qwen-VL and Qwen2-VL with a significantly improved visual encoder and a dynamic resolution system — it processes images at their native resolution up to 1280×28×28 patches instead of forcing everything into a fixed 448px crop.
Why this matters for local use:
- Native resolution input means OCR accuracy on dense documents is far better than LLaVA or the original Qwen-VL
- The 7B model fits in 8 GB VRAM with Q4 quantization — a single RTX 3080 or 4070 handles it
- The 72B model matches GPT-4o on several visual benchmarks at zero API cost
- Apache 2.0 license — commercial use is allowed with no restrictions
Model sizes available:
| Model | Full precision VRAM | Q4_K_M VRAM | Best for |
|---|---|---|---|
| Qwen2.5-VL-3B | ~7 GB | ~2.5 GB | Edge, Raspberry Pi 5, testing |
| Qwen2.5-VL-7B | ~16 GB | ~5 GB | Single consumer GPU, M2/M3 |
| Qwen2.5-VL-72B | ~144 GB | ~45 GB | Multi-GPU server, production |
Prerequisites
Before you start, confirm:
- GPU path: NVIDIA GPU with CUDA 12.1+ or Apple Silicon (M1/M2/M3/M4)
- CPU-only path: Works, but expect 5–30 tokens/sec on 7B depending on your CPU
- Docker 25+ installed (optional but recommended for vLLM)
- Python 3.11 or 3.12 for the API client examples
Check your CUDA version:
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
nvcc --version
Expected output:
NVIDIA GeForce RTX 4080, 16376 MiB
nvcc: release 12.3
Solution
Step 1: Pull Qwen2.5-VL with Ollama
Ollama 0.5+ ships with Qwen2.5-VL support built in. Install or update first:
# macOS / Linux — installs to /usr/local/bin/ollama
curl -fsSL https://ollama.com/install.sh | sh
# Verify version — must be 0.5.0 or later for VL models
ollama --version
Pull the 7B model with Q4_K_M quantization — this is the best quality-to-VRAM ratio for single-GPU use:
# Q4_K_M = ~5 GB download, runs on 8 GB VRAM or 16 GB unified memory
ollama pull qwen2.5vl:7b
# For 72B on a multi-GPU machine with 48+ GB VRAM total
ollama pull qwen2.5vl:72b
Start the Ollama server if it isn't already running:
ollama serve
Expected output:
time=2026-03-11T08:00:00Z level=INFO msg="Listening on 127.0.0.1:11434"
Step 2: Send Your First Image Prompt
Test the model directly from the CLI with a local image:
# Pass a local file path — Ollama base64-encodes it automatically
ollama run qwen2.5vl:7b "Describe what is in this image" --image ./screenshot.png
For a URL-referenced image:
ollama run qwen2.5vl:7b "Extract all text from this image" \
--image https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png
If it fails:
Error: model requires vision support→ Your Ollama version is below 0.5.0. Runollama --versionand reinstall.Error: out of memory→ Switch to the 3B model:ollama pull qwen2.5vl:3b
Step 3: Call the OpenAI-Compatible API in Python
Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. Use the openai SDK without changing anything except the base_url.
Install the SDK:
pip install openai --quiet
Send an image from a local file:
import base64
from pathlib import Path
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Ollama ignores the key — any non-empty string works
)
def encode_image(path: str) -> str:
# Base64-encode so the model receives raw bytes, not a file path
return base64.b64encode(Path(path).read_bytes()).decode("utf-8")
image_data = encode_image("invoice.png")
response = client.chat.completions.create(
model="qwen2.5vl:7b",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_data}"
},
},
{
"type": "text",
"text": "Extract the invoice number, date, and total amount as JSON.",
},
],
}
],
max_tokens=512,
temperature=0.1, # Low temp for structured extraction — reduces hallucination on numbers
)
print(response.choices[0].message.content)
Expected output:
{
"invoice_number": "INV-2026-0042",
"date": "2026-03-10",
"total_amount": "$1,240.00"
}
Step 4: Serve Qwen2.5-VL 72B with vLLM (Production)
For higher throughput — multiple concurrent users, batch jobs, or latency under 500 ms — use vLLM instead of Ollama. vLLM handles continuous batching and PagedAttention, which Ollama does not.
Pull the model from Hugging Face and serve it:
# Requires 2x A100 80GB or 4x RTX 4090 for the full 72B BF16 model
docker run --gpus all --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-VL-72B-Instruct \
--tensor-parallel-size 4 \ # Match to your GPU count
--max-model-len 32768 \ # 32K context handles large images + long prompts
--dtype bfloat16
For the 7B model on a single A10G (24 GB, common on AWS g5.xlarge at ~$1.01/hr us-east-1):
docker run --gpus all --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--max-model-len 16384 \
--dtype bfloat16
Expected output:
INFO: Started server process
INFO: Uvicorn running on http://0.0.0.0:8000
Point the same Python client at vLLM by changing base_url:
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY", # vLLM also ignores the key value
)
# Change model name to match the HF repo ID
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-7B-Instruct",
...
)
Step 5: Multi-Image and Document Understanding
Qwen2.5-VL accepts multiple images in a single turn. This is useful for comparing before/after screenshots, multi-page documents, or visual diff tasks.
def make_image_message(paths: list[str], prompt: str) -> dict:
content = []
for path in paths:
encoded = encode_image(path)
ext = Path(path).suffix.lstrip(".") # png, jpg, webp — all supported
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/{ext};base64,{encoded}"}
})
content.append({"type": "text", "text": prompt})
return {"role": "user", "content": content}
response = client.chat.completions.create(
model="qwen2.5vl:7b",
messages=[
make_image_message(
paths=["page1.png", "page2.png", "page3.png"],
prompt="Summarize the key figures from each page of this report."
)
],
max_tokens=1024,
)
Token budget note: Each 1024×1024 image consumes roughly 1,280 tokens with Qwen2.5-VL's dynamic resolution encoder. At num_ctx=8192 (Ollama default), you can fit ~5 full-resolution images plus a detailed prompt. Raise num_ctx if you need more:
# In Modelfile or via API options
PARAMETER num_ctx 16384
Verification
Run this end-to-end smoke test to confirm the stack is working:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5vl:7b",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"}},
{"type": "text", "text": "What colors are in this image?"}
]
}
],
"max_tokens": 100
}'
You should see: A JSON response with choices[0].message.content describing colors in the image. If content is null or you get model does not support vision, re-pull with ollama pull qwen2.5vl:7b.
What You Learned
- Qwen2.5-VL uses native-resolution image encoding — do not resize images before sending, it degrades accuracy
- The Ollama endpoint is OpenAI-compatible: swap
base_urlandmodeland your existing GPT-4 vision code works immediately - For batch jobs processing 100+ images, vLLM on an AWS g5.xlarge (A10G, us-east-1, ~$1.01/hr on-demand) is more cost-efficient than calling GPT-4o at $0.01–$0.03 per image
num_ctxcontrols the token window — raise it for multi-image tasks, lower it to save VRAM on constrained hardware
Tested on Qwen2.5-VL-7B-Instruct (Ollama 0.5.4), vLLM 0.6.x, Python 3.12, CUDA 12.3, Ubuntu 22.04 and macOS 15 (M3 Max)
FAQ
Q: Does Qwen2.5-VL work on CPU only without a GPU? A: Yes — Ollama runs it on CPU using llama.cpp's GGUF backend. Expect 3–8 tokens/sec on a modern 8-core CPU with the 7B Q4 model. The 72B model is not practical without a GPU.
Q: What image formats does Qwen2.5-VL accept? A: PNG, JPEG, WEBP, and GIF (first frame only). Send images as base64 data URIs or public URLs. Local file paths are not accepted by the HTTP API — you must encode them first.
Q: What is the maximum image resolution Qwen2.5-VL supports? A: The model supports dynamic resolution up to 1,280 patch tokens per image, which corresponds roughly to 1280×1280 pixels. Images larger than this are resized automatically by the serving layer.
Q: Can Qwen2.5-VL run alongside another Ollama model at the same time?
A: Ollama loads one model into VRAM at a time by default. Set OLLAMA_MAX_LOADED_MODELS=2 if you have enough VRAM to keep both resident and want to avoid cold-start latency when switching.
Q: Is Qwen2.5-VL suitable for HIPAA or SOC 2 workloads? A: Running it locally on your own infrastructure means no data leaves your environment, which is a prerequisite for HIPAA and SOC 2. You are still responsible for access controls, audit logging, and encryption at rest on your infra — the model itself does not provide compliance guarantees.