Problem: Serving LLMs at Production Throughput with an OpenAI-Compatible API
vLLM production deployment gives you an OpenAI-compatible /v1/chat/completions endpoint backed by PagedAttention — the same technique that powers many hosted LLM APIs at scale. If you've hit throughput walls with Ollama or llama.cpp, or need to swap in a self-hosted model behind existing OpenAI SDK clients without touching application code, vLLM is the right tool.
You'll learn:
- Run vLLM as a Docker container with GPU passthrough on a single or multi-GPU machine
- Configure tensor parallelism, quantization (AWQ / GPTQ / FP8), and an API key for production use
- Point any OpenAI SDK client — Python, Node.js, or curl — at your vLLM server with zero code changes
Time: 25 min | Difficulty: Intermediate
Why vLLM Outperforms Naive LLM Serving
Standard inference loops load one token at a time into a fixed KV-cache block, leaving GPU memory fragmented and throughput low. vLLM uses PagedAttention, which manages the KV cache like virtual memory pages — slots are allocated and freed per-request rather than reserved upfront. The result is near-zero KV-cache waste and continuous batching across concurrent requests.
vLLM's request lifecycle: incoming prompts are batched, KV cache pages allocated dynamically, and responses streamed back via the OpenAI
/v1 route.
Symptoms that send you to vLLM:
- Ollama throughput drops below 20 tok/s under concurrent load (3+ simultaneous users)
- You need a
/v1/chat/completionsendpoint that existing OpenAI SDK clients hit without modification - Model VRAM exceeds a single GPU — you need tensor parallelism across 2 or 4 GPUs
Prerequisites
Before starting, confirm:
# Verify CUDA driver (needs ≥ 12.1 for vLLM 0.4+)
nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader
# Verify Docker with GPU support
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
Minimum specs:
- GPU: 16 GB VRAM for 7B models (FP16), 24 GB for 13B, 80 GB for 70B (or 2× 40 GB with tensor parallelism)
- CUDA driver: ≥ 525.85 (ships with CUDA 12.1+)
- Docker: 24.x with
nvidia-container-toolkitinstalled - Disk: 15–140 GB depending on model size
Solution
Step 1: Pull the Official vLLM Docker Image
vLLM publishes CUDA-matched images to avoid driver/library mismatches — the single most common source of startup crashes.
# Match your CUDA driver. For CUDA 12.4 hosts:
docker pull vllm/vllm-openai:latest
# Pin to a specific release for reproducible production deploys
docker pull vllm/vllm-openai:v0.6.4.post1
Expected output:
v0.6.4.post1: Pulling from vllm/vllm-openai
...
Status: Downloaded newer image for vllm/vllm-openai:v0.6.4.post1
If it fails:
docker: Error response from daemon: could not select device driver "nvidia"→sudo apt install nvidia-container-toolkit && sudo systemctl restart docker
Step 2: Run vLLM with a Single GPU
Start with a 7B model to validate the stack before scaling.
docker run -d \
--name vllm-server \
--gpus all \
--ipc=host \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}" \
vllm/vllm-openai:v0.6.4.post1 \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--dtype bfloat16 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 # Reserve 10% for CUDA ops — prevents OOM on long contexts
Flags explained:
--ipc=host— Required for PyTorch shared memory between processes. Missing this causes silent hangs.--gpu-memory-utilization 0.90— vLLM pre-allocates this fraction of VRAM for the KV cache. Set lower (0.80) if you share the GPU.--max-model-len 32768— Caps context window. Higher values consume more KV cache pages.
Expected output (docker logs vllm-server):
INFO: Started server process [1]
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Step 3: Add an API Key for Production Auth
By default the server is unauthenticated — anyone on the network can call it. Add a static key:
docker run -d \
--name vllm-server \
--gpus all \
--ipc=host \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}" \
-e VLLM_API_KEY="sk-my-internal-key-change-this" \
vllm/vllm-openai:v0.6.4.post1 \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--api-key sk-my-internal-key-change-this \
--dtype bfloat16 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
Any request missing Authorization: Bearer sk-my-internal-key-change-this now returns 401 Unauthorized.
Step 4: Scale to Multiple GPUs with Tensor Parallelism
For 70B models (or any model that doesn't fit on one GPU), split across GPUs:
docker run -d \
--name vllm-server-70b \
--gpus all \
--ipc=host \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}" \
-e VLLM_API_KEY="sk-my-internal-key-change-this" \
vllm/vllm-openai:v0.6.4.post1 \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \ # Split model weights across 4 GPUs — must divide num attention heads evenly
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92 \
--api-key sk-my-internal-key-change-this
--tensor-parallel-size must be a divisor of the model's attention head count. Llama 3.3 70B has 64 heads — valid values are 1, 2, 4, 8.
Step 5: Enable Quantization to Reduce VRAM Usage
Run a 70B model on 2× A6000 (48 GB each) instead of 4× A100 by using AWQ 4-bit quantization:
docker run -d \
--name vllm-server-awq \
--gpus all \
--ipc=host \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}" \
vllm/vllm-openai:v0.6.4.post1 \
--model casperhansen/llama-3.3-70b-instruct-awq \ # Pre-quantized AWQ checkpoint
--quantization awq \
--tensor-parallel-size 2 \
--dtype float16 \ # AWQ requires float16, not bfloat16
--max-model-len 8192 \
--gpu-memory-utilization 0.92
Quantization options supported by vLLM:
| Method | VRAM reduction | Quality loss | Best for |
|---|---|---|---|
awq (4-bit) | ~60% | Low | Production serving, throughput priority |
gptq (4-bit) | ~60% | Low–medium | Widely quantized model checkpoints |
fp8 (8-bit) | ~30% | Minimal | H100 / H200 — uses native FP8 tensor cores |
| None (BF16) | 0% | None | A100/H100 with full VRAM budget |
Step 6: Call the Server from Your Application
The server is OpenAI API-compatible. Swap base_url in any existing client:
Python (openai SDK):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1", # Point at vLLM instead of api.openai.com
api_key="sk-my-internal-key-change-this",
)
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3", # Must match --model arg exactly
messages=[{"role": "user", "content": "Explain PagedAttention in one paragraph."}],
temperature=0.7,
max_tokens=512,
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Node.js (openai SDK):
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8000/v1",
apiKey: "sk-my-internal-key-change-this",
});
const stream = await client.chat.completions.create({
model: "mistralai/Mistral-7B-Instruct-v0.3",
messages: [{ role: "user", content: "Summarize the vLLM paper." }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}
curl (quick smoke test):
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-my-internal-key-change-this" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 64
}'
Verification
# Check server health
curl http://localhost:8000/health
# List loaded models
curl http://localhost:8000/v1/models \
-H "Authorization: Bearer sk-my-internal-key-change-this"
# Check live GPU utilization while sending requests
watch -n 1 nvidia-smi
You should see:
/health→{"status":"ok"}/v1/models→ JSON with your model IDnvidia-smi→ GPU utilization spiking to 80–99% during inference, not sitting at 0%
Benchmark throughput (optional):
# vLLM ships a benchmark script — run inside the container
docker exec vllm-server python3 /app/benchmarks/benchmark_serving.py \
--backend vllm \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--num-prompts 200 \
--request-rate 10 # 10 requests/sec simulated concurrency
Expected: 400–900 tok/s on a single A10G (24 GB) for Mistral 7B FP16.
Production Docker Compose Setup
For persistent deployments with automatic restarts, use Compose:
# docker-compose.yml
services:
vllm:
image: vllm/vllm-openai:v0.6.4.post1
container_name: vllm-server
restart: unless-stopped
ipc: host
ports:
- "8000:8000"
volumes:
- huggingface_cache:/root/.cache/huggingface
environment:
HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}
VLLM_API_KEY: ${VLLM_API_KEY}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: >
--model mistralai/Mistral-7B-Instruct-v0.3
--dtype bfloat16
--max-model-len 32768
--gpu-memory-utilization 0.90
--api-key ${VLLM_API_KEY}
volumes:
huggingface_cache:
# .env
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=sk-my-internal-key-change-this
docker compose up -d
docker compose logs -f vllm
What You Learned
- vLLM's PagedAttention eliminates KV-cache fragmentation — this is why it outperforms naive serving under concurrent load
--tensor-parallel-sizesplits model weights across GPUs; it must divide the model's attention head count evenly- AWQ 4-bit quantization reduces VRAM by ~60% with minimal quality loss, making 70B models viable on 2× 48 GB GPUs
- The vLLM server is a drop-in OpenAI API replacement —
base_urlis the only change required in existing clients - Don't use vLLM for single-user local inference with no concurrency — Ollama has lower overhead and simpler setup for that use case
Tested on vLLM v0.6.4.post1, CUDA 12.4, Python 3.12, Docker 27.x, Ubuntu 22.04 LTS
FAQ
Q: What is the difference between vLLM and Ollama for production use? A: Ollama is optimized for single-user local inference with a simple CLI. vLLM is built for multi-user concurrent serving — it batches requests automatically and achieves 5–20× higher throughput under load. Use Ollama for local development, vLLM when you need an API that serves multiple users simultaneously.
Q: Does vLLM work without a Hugging Face token?
A: Yes, for publicly gated models. You only need HF_TOKEN for gated models like Llama 3 or Mistral that require accepting a license on huggingface.co first.
Q: What is the minimum VRAM needed to run vLLM? A: 16 GB for a 7B model in BF16, 8 GB with AWQ 4-bit quantization. vLLM itself adds roughly 1–2 GB of overhead on top of model weights.
Q: Can vLLM serve multiple models at the same time? A: Not from a single server instance — one vLLM process serves one model. Run separate containers on different ports, or use a router like LiteLLM ($0/month self-hosted) in front of multiple vLLM instances to present a unified API.
Q: Does vLLM's OpenAI-compatible API support function calling and tool use?
A: Yes, for models that include tool-call templates (Llama 3.x, Mistral, Qwen 2.5). Pass tools and tool_choice in the request body exactly as you would to the OpenAI API. Confirm support per model at GET /v1/models — look for tool_call in the supported_parameters field.