Serve Local LLMs via OpenAI API in 15 Minutes

You want to run language models locally but keep using OpenAI's API format for your existing code. Here's how to set up a drop-in replacement that works with your current applications.

You'll learn:

Why local model serving beats cloud APIs for development
How to set up an OpenAI-compatible endpoint in under 15 minutes
Which tools work best for different hardware setups

Time: 15 min | Level: Intermediate

Problem: Cloud APIs Are Slow and Expensive for Development

You're building with OpenAI's API, but every request costs money and takes 2-5 seconds. You want to iterate faster with local models while keeping your existing code unchanged.

Common needs:

Test prompt changes without API costs
Work offline or with sensitive data
Faster iteration during development
Full control over model versions

Why OpenAI-Compatible APIs Matter

Most LLM tools expect OpenAI's request format. An OpenAI-compatible server lets you swap https://api.openai.com for http://localhost:8000 without changing your application code.

What "compatible" means:

Same /v1/chat/completions endpoint
Same JSON request/response structure
Works with OpenAI SDKs and libraries
Drop-in replacement for existing code

Solution

Step 1: Choose Your Serving Tool

Pick based on your hardware:

For NVIDIA GPUs (8GB+ VRAM):

# vLLM - fastest for batch inference
pip install vllm --break-system-packages

For Apple Silicon (M1/M2/M3):

# llama.cpp server - optimized for Metal
brew install llama.cpp

For CPU-only or small GPUs:

# Ollama - easiest setup, good performance
curl -fsSL https://ollama.com/install.sh | sh

Expected: Installation completes in 2-5 minutes depending on your connection.

Step 2: Download a Model

Using Ollama (recommended for beginners):

# Pull a capable 7B model (4GB download)
ollama pull llama3.2

# Verify it works
ollama run llama3.2 "Say hello"

Using llama.cpp:

# Download GGUF format model
cd ~/models
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

Using vLLM:

# Models download automatically on first run
# vLLM uses HuggingFace format directly

If download fails:

Timeout errors: Use a download manager like aria2c for large files
Disk space: 7B models need 4-8GB, 13B models need 8-16GB
HuggingFace auth: Some models require huggingface-cli login first

Step 3: Start the OpenAI-Compatible Server

Ollama (automatic OpenAI endpoint):

# Server starts automatically with ollama
# Already running on http://localhost:11434

llama.cpp server:

llama-server \
  --model ~/models/llama-2-7b.Q4_K_M.gguf \
  --port 8080 \
  --ctx-size 4096 \
  --n-gpu-layers 35  # Adjust based on your VRAM

vLLM:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-hf \
  --port 8000 \
  --dtype float16

Expected: Server starts in 10-30 seconds and prints Uvicorn running on http://0.0.0.0:8000

Step 4: Test the API

# Check server health
curl http://localhost:8000/v1/models

# Make a chat completion request
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Explain recursion briefly"}],
    "max_tokens": 100
  }'

You should see: JSON response with model output in choices[0].message.content

If it fails:

Connection refused: Server isn't running, check the Terminal for errors
Model not found: Use exact model name from Step 2 (ollama list to verify)
Out of memory: Reduce --ctx-size or use a smaller quantization (Q4 → Q3)

Step 5: Update Your Application Code

Python (OpenAI SDK):

from openai import OpenAI

# Point to local server instead of OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # Local servers ignore this
)

response = client.chat.completions.create(
    model="llama3.2",  # Use your local model name
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

JavaScript/TypeScript:

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8000/v1',
  apiKey: 'not-needed'
});

const response = await client.chat.completions.create({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Hello' }]
});

cURL (for testing):

# Save as test.sh for quick checks
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Test"}]}'

Why this works: OpenAI SDKs are just HTTP clients. Changing base_url redirects all requests to your local server with no other code changes needed.

Verification

Test streaming responses:

stream = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Count to 5"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='')

You should see: Numbers appear one at a time, confirming streaming works.

Benchmark performance:

time curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 50}'

Expected: First response in 0.5-2 seconds (depending on hardware), subsequent requests faster due to prompt caching.

What You Learned

OpenAI-compatible servers let you swap cloud APIs for local models with zero code changes
Ollama is easiest for getting started, vLLM is fastest for production workloads
Same endpoints and SDKs work with both OpenAI and local servers

Limitations:

Local models (7-13B) are less capable than GPT-4, good for development/testing
Function calling support varies by serving tool (vLLM has best compatibility)
GPU memory limits model size (4GB VRAM → max 7B models with quantization)

When NOT to use this:

Production apps needing GPT-4 quality (use cloud APIs)
Shared team environments (consider hosted solutions like Together AI instead)
Mobile apps (models are too large, use API calls)

Quick Reference

Tool	Best For	GPU Support	Setup Time
Ollama	Beginners, Mac users	NVIDIA, Metal	5 min
vLLM	High throughput, batch jobs	NVIDIA only	10 min
llama.cpp	Low resource usage, CPU	All (CPU, CUDA, Metal)	8 min

Port defaults:

Ollama: 11434
vLLM: 8000
llama.cpp: 8080 (configurable)

Model size guide:

7B Q4: ~4GB VRAM, good quality
13B Q4: ~8GB VRAM, better reasoning
34B Q4: ~20GB VRAM, approaches GPT-3.5

Tested on vLLM 0.6.3, Ollama 0.5.2, llama.cpp b3950, NVIDIA RTX 4090 & Apple M3 Max