You want to run language models locally but keep using OpenAI's API format for your existing code. Here's how to set up a drop-in replacement that works with your current applications.
You'll learn:
- Why local model serving beats cloud APIs for development
- How to set up an OpenAI-compatible endpoint in under 15 minutes
- Which tools work best for different hardware setups
Time: 15 min | Level: Intermediate
Problem: Cloud APIs Are Slow and Expensive for Development
You're building with OpenAI's API, but every request costs money and takes 2-5 seconds. You want to iterate faster with local models while keeping your existing code unchanged.
Common needs:
- Test prompt changes without API costs
- Work offline or with sensitive data
- Faster iteration during development
- Full control over model versions
Why OpenAI-Compatible APIs Matter
Most LLM tools expect OpenAI's request format. An OpenAI-compatible server lets you swap https://api.openai.com for http://localhost:8000 without changing your application code.
What "compatible" means:
- Same
/v1/chat/completionsendpoint - Same JSON request/response structure
- Works with OpenAI SDKs and libraries
- Drop-in replacement for existing code
Solution
Step 1: Choose Your Serving Tool
Pick based on your hardware:
For NVIDIA GPUs (8GB+ VRAM):
# vLLM - fastest for batch inference
pip install vllm --break-system-packages
For Apple Silicon (M1/M2/M3):
# llama.cpp server - optimized for Metal
brew install llama.cpp
For CPU-only or small GPUs:
# Ollama - easiest setup, good performance
curl -fsSL https://ollama.com/install.sh | sh
Expected: Installation completes in 2-5 minutes depending on your connection.
Step 2: Download a Model
Using Ollama (recommended for beginners):
# Pull a capable 7B model (4GB download)
ollama pull llama3.2
# Verify it works
ollama run llama3.2 "Say hello"
Using llama.cpp:
# Download GGUF format model
cd ~/models
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
Using vLLM:
# Models download automatically on first run
# vLLM uses HuggingFace format directly
If download fails:
- Timeout errors: Use a download manager like
aria2cfor large files - Disk space: 7B models need 4-8GB, 13B models need 8-16GB
- HuggingFace auth: Some models require
huggingface-cli loginfirst
Step 3: Start the OpenAI-Compatible Server
Ollama (automatic OpenAI endpoint):
# Server starts automatically with ollama
# Already running on http://localhost:11434
llama.cpp server:
llama-server \
--model ~/models/llama-2-7b.Q4_K_M.gguf \
--port 8080 \
--ctx-size 4096 \
--n-gpu-layers 35 # Adjust based on your VRAM
vLLM:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--port 8000 \
--dtype float16
Expected: Server starts in 10-30 seconds and prints Uvicorn running on http://0.0.0.0:8000
Step 4: Test the API
# Check server health
curl http://localhost:8000/v1/models
# Make a chat completion request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Explain recursion briefly"}],
"max_tokens": 100
}'
You should see: JSON response with model output in choices[0].message.content
If it fails:
- Connection refused: Server isn't running, check the Terminal for errors
- Model not found: Use exact model name from Step 2 (
ollama listto verify) - Out of memory: Reduce
--ctx-sizeor use a smaller quantization (Q4 → Q3)
Step 5: Update Your Application Code
Python (OpenAI SDK):
from openai import OpenAI
# Point to local server instead of OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # Local servers ignore this
)
response = client.chat.completions.create(
model="llama3.2", # Use your local model name
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
JavaScript/TypeScript:
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:8000/v1',
apiKey: 'not-needed'
});
const response = await client.chat.completions.create({
model: 'llama3.2',
messages: [{ role: 'user', content: 'Hello' }]
});
cURL (for testing):
# Save as test.sh for quick checks
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Test"}]}'
Why this works: OpenAI SDKs are just HTTP clients. Changing base_url redirects all requests to your local server with no other code changes needed.
Verification
Test streaming responses:
stream = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Count to 5"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end='')
You should see: Numbers appear one at a time, confirming streaming works.
Benchmark performance:
time curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 50}'
Expected: First response in 0.5-2 seconds (depending on hardware), subsequent requests faster due to prompt caching.
What You Learned
- OpenAI-compatible servers let you swap cloud APIs for local models with zero code changes
- Ollama is easiest for getting started, vLLM is fastest for production workloads
- Same endpoints and SDKs work with both OpenAI and local servers
Limitations:
- Local models (7-13B) are less capable than GPT-4, good for development/testing
- Function calling support varies by serving tool (vLLM has best compatibility)
- GPU memory limits model size (4GB VRAM → max 7B models with quantization)
When NOT to use this:
- Production apps needing GPT-4 quality (use cloud APIs)
- Shared team environments (consider hosted solutions like Together AI instead)
- Mobile apps (models are too large, use API calls)
Quick Reference
| Tool | Best For | GPU Support | Setup Time |
|---|---|---|---|
| Ollama | Beginners, Mac users | NVIDIA, Metal | 5 min |
| vLLM | High throughput, batch jobs | NVIDIA only | 10 min |
| llama.cpp | Low resource usage, CPU | All (CPU, CUDA, Metal) | 8 min |
Port defaults:
- Ollama:
11434 - vLLM:
8000 - llama.cpp:
8080(configurable)
Model size guide:
- 7B Q4: ~4GB VRAM, good quality
- 13B Q4: ~8GB VRAM, better reasoning
- 34B Q4: ~20GB VRAM, approaches GPT-3.5
Tested on vLLM 0.6.3, Ollama 0.5.2, llama.cpp b3950, NVIDIA RTX 4090 & Apple M3 Max