Problem: Serving LLMs at Production Throughput with an OpenAI-Compatible API

vLLM production deployment gives you an OpenAI-compatible /v1/chat/completions endpoint backed by PagedAttention — the same technique that powers many hosted LLM APIs at scale. If you've hit throughput walls with Ollama or llama.cpp, or need to swap in a self-hosted model behind existing OpenAI SDK clients without touching application code, vLLM is the right tool.

You'll learn:

Run vLLM as a Docker container with GPU passthrough on a single or multi-GPU machine
Configure tensor parallelism, quantization (AWQ / GPTQ / FP8), and an API key for production use
Point any OpenAI SDK client — Python, Node.js, or curl — at your vLLM server with zero code changes

Time: 25 min | Difficulty: Intermediate

Why vLLM Outperforms Naive LLM Serving

Standard inference loops load one token at a time into a fixed KV-cache block, leaving GPU memory fragmented and throughput low. vLLM uses PagedAttention, which manages the KV cache like virtual memory pages — slots are allocated and freed per-request rather than reserved upfront. The result is near-zero KV-cache waste and continuous batching across concurrent requests.

vLLM production deployment architecture: Docker container, PagedAttention engine, OpenAI-compatible API, multi-GPU tensor parallelism vLLM's request lifecycle: incoming prompts are batched, KV cache pages allocated dynamically, and responses streamed back via the OpenAI /v1 route.

Symptoms that send you to vLLM:

Ollama throughput drops below 20 tok/s under concurrent load (3+ simultaneous users)
You need a /v1/chat/completions endpoint that existing OpenAI SDK clients hit without modification
Model VRAM exceeds a single GPU — you need tensor parallelism across 2 or 4 GPUs

Prerequisites

Before starting, confirm:

# Verify CUDA driver (needs ≥ 12.1 for vLLM 0.4+)
nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader

# Verify Docker with GPU support
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Minimum specs:

GPU: 16 GB VRAM for 7B models (FP16), 24 GB for 13B, 80 GB for 70B (or 2× 40 GB with tensor parallelism)
CUDA driver: ≥ 525.85 (ships with CUDA 12.1+)
Docker: 24.x with nvidia-container-toolkit installed
Disk: 15–140 GB depending on model size

Solution

Step 1: Pull the Official vLLM Docker Image

vLLM publishes CUDA-matched images to avoid driver/library mismatches — the single most common source of startup crashes.

# Match your CUDA driver. For CUDA 12.4 hosts:
docker pull vllm/vllm-openai:latest

# Pin to a specific release for reproducible production deploys
docker pull vllm/vllm-openai:v0.6.4.post1

Expected output:

v0.6.4.post1: Pulling from vllm/vllm-openai
...
Status: Downloaded newer image for vllm/vllm-openai:v0.6.4.post1

If it fails:

docker: Error response from daemon: could not select device driver "nvidia" → sudo apt install nvidia-container-toolkit && sudo systemctl restart docker

Step 2: Run vLLM with a Single GPU

Start with a 7B model to validate the stack before scaling.

docker run -d \
  --name vllm-server \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}" \
  vllm/vllm-openai:v0.6.4.post1 \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90   # Reserve 10% for CUDA ops — prevents OOM on long contexts

Flags explained:

--ipc=host — Required for PyTorch shared memory between processes. Missing this causes silent hangs.
--gpu-memory-utilization 0.90 — vLLM pre-allocates this fraction of VRAM for the KV cache. Set lower (0.80) if you share the GPU.
--max-model-len 32768 — Caps context window. Higher values consume more KV cache pages.

Expected output (docker logs vllm-server):

INFO:     Started server process [1]
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Step 3: Add an API Key for Production Auth

By default the server is unauthenticated — anyone on the network can call it. Add a static key:

docker run -d \
  --name vllm-server \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}" \
  -e VLLM_API_KEY="sk-my-internal-key-change-this" \
  vllm/vllm-openai:v0.6.4.post1 \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --api-key sk-my-internal-key-change-this \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Any request missing Authorization: Bearer sk-my-internal-key-change-this now returns 401 Unauthorized.

Step 4: Scale to Multiple GPUs with Tensor Parallelism

For 70B models (or any model that doesn't fit on one GPU), split across GPUs:

docker run -d \
  --name vllm-server-70b \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}" \
  -e VLLM_API_KEY="sk-my-internal-key-change-this" \
  vllm/vllm-openai:v0.6.4.post1 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 4 \     # Split model weights across 4 GPUs — must divide num attention heads evenly
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --api-key sk-my-internal-key-change-this

--tensor-parallel-size must be a divisor of the model's attention head count. Llama 3.3 70B has 64 heads — valid values are 1, 2, 4, 8.

Step 5: Enable Quantization to Reduce VRAM Usage

Run a 70B model on 2× A6000 (48 GB each) instead of 4× A100 by using AWQ 4-bit quantization:

docker run -d \
  --name vllm-server-awq \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}" \
  vllm/vllm-openai:v0.6.4.post1 \
  --model casperhansen/llama-3.3-70b-instruct-awq \  # Pre-quantized AWQ checkpoint
  --quantization awq \
  --tensor-parallel-size 2 \
  --dtype float16 \              # AWQ requires float16, not bfloat16
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92

Quantization options supported by vLLM:

Method	VRAM reduction	Quality loss	Best for
`awq` (4-bit)	~60%	Low	Production serving, throughput priority
`gptq` (4-bit)	~60%	Low–medium	Widely quantized model checkpoints
`fp8` (8-bit)	~30%	Minimal	H100 / H200 — uses native FP8 tensor cores
None (BF16)	0%	None	A100/H100 with full VRAM budget

Step 6: Call the Server from Your Application

The server is OpenAI API-compatible. Swap base_url in any existing client:

Python (openai SDK):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",  # Point at vLLM instead of api.openai.com
    api_key="sk-my-internal-key-change-this",
)

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",  # Must match --model arg exactly
    messages=[{"role": "user", "content": "Explain PagedAttention in one paragraph."}],
    temperature=0.7,
    max_tokens=512,
    stream=True,
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Node.js (openai SDK):

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: "sk-my-internal-key-change-this",
});

const stream = await client.chat.completions.create({
  model: "mistralai/Mistral-7B-Instruct-v0.3",
  messages: [{ role: "user", content: "Summarize the vLLM paper." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

curl (quick smoke test):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-my-internal-key-change-this" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 64
  }'

Verification

# Check server health
curl http://localhost:8000/health

# List loaded models
curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer sk-my-internal-key-change-this"

# Check live GPU utilization while sending requests
watch -n 1 nvidia-smi

You should see:

/health → {"status":"ok"}
/v1/models → JSON with your model ID
nvidia-smi → GPU utilization spiking to 80–99% during inference, not sitting at 0%

Benchmark throughput (optional):

# vLLM ships a benchmark script — run inside the container
docker exec vllm-server python3 /app/benchmarks/benchmark_serving.py \
  --backend vllm \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --num-prompts 200 \
  --request-rate 10   # 10 requests/sec simulated concurrency

Expected: 400–900 tok/s on a single A10G (24 GB) for Mistral 7B FP16.

Production Docker Compose Setup

For persistent deployments with automatic restarts, use Compose:

# docker-compose.yml
services:
  vllm:
    image: vllm/vllm-openai:v0.6.4.post1
    container_name: vllm-server
    restart: unless-stopped
    ipc: host
    ports:
      - "8000:8000"
    volumes:
      - huggingface_cache:/root/.cache/huggingface
    environment:
      HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}
      VLLM_API_KEY: ${VLLM_API_KEY}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.3
      --dtype bfloat16
      --max-model-len 32768
      --gpu-memory-utilization 0.90
      --api-key ${VLLM_API_KEY}

volumes:
  huggingface_cache:

# .env
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=sk-my-internal-key-change-this

docker compose up -d
docker compose logs -f vllm

What You Learned

vLLM's PagedAttention eliminates KV-cache fragmentation — this is why it outperforms naive serving under concurrent load
--tensor-parallel-size splits model weights across GPUs; it must divide the model's attention head count evenly
AWQ 4-bit quantization reduces VRAM by ~60% with minimal quality loss, making 70B models viable on 2× 48 GB GPUs
The vLLM server is a drop-in OpenAI API replacement — base_url is the only change required in existing clients
Don't use vLLM for single-user local inference with no concurrency — Ollama has lower overhead and simpler setup for that use case

Tested on vLLM v0.6.4.post1, CUDA 12.4, Python 3.12, Docker 27.x, Ubuntu 22.04 LTS

FAQ

Q: What is the difference between vLLM and Ollama for production use? A: Ollama is optimized for single-user local inference with a simple CLI. vLLM is built for multi-user concurrent serving — it batches requests automatically and achieves 5–20× higher throughput under load. Use Ollama for local development, vLLM when you need an API that serves multiple users simultaneously.

Q: Does vLLM work without a Hugging Face token? A: Yes, for publicly gated models. You only need HF_TOKEN for gated models like Llama 3 or Mistral that require accepting a license on huggingface.co first.

Q: What is the minimum VRAM needed to run vLLM? A: 16 GB for a 7B model in BF16, 8 GB with AWQ 4-bit quantization. vLLM itself adds roughly 1–2 GB of overhead on top of model weights.

Q: Can vLLM serve multiple models at the same time? A: Not from a single server instance — one vLLM process serves one model. Run separate containers on different ports, or use a router like LiteLLM ($0/month self-hosted) in front of multiple vLLM instances to present a unified API.

Q: Does vLLM's OpenAI-compatible API support function calling and tool use? A: Yes, for models that include tool-call templates (Llama 3.x, Mistral, Qwen 2.5). Pass tools and tool_choice in the request body exactly as you would to the OpenAI API. Confirm support per model at GET /v1/models — look for tool_call in the supported_parameters field.