Serve Local LLMs via OpenAI API in 15 Minutes

Replace OpenAI API calls with local models using Ollama, vLLM, or llama.cpp. Same code, zero changes, faster iteration.

You want to run language models locally but keep using OpenAI's API format for your existing code. Here's how to set up a drop-in replacement that works with your current applications.

You'll learn:

  • Why local model serving beats cloud APIs for development
  • How to set up an OpenAI-compatible endpoint in under 15 minutes
  • Which tools work best for different hardware setups

Time: 15 min | Level: Intermediate


Problem: Cloud APIs Are Slow and Expensive for Development

You're building with OpenAI's API, but every request costs money and takes 2-5 seconds. You want to iterate faster with local models while keeping your existing code unchanged.

Common needs:

  • Test prompt changes without API costs
  • Work offline or with sensitive data
  • Faster iteration during development
  • Full control over model versions

Why OpenAI-Compatible APIs Matter

Most LLM tools expect OpenAI's request format. An OpenAI-compatible server lets you swap https://api.openai.com for http://localhost:8000 without changing your application code.

What "compatible" means:

  • Same /v1/chat/completions endpoint
  • Same JSON request/response structure
  • Works with OpenAI SDKs and libraries
  • Drop-in replacement for existing code

Solution

Step 1: Choose Your Serving Tool

Pick based on your hardware:

For NVIDIA GPUs (8GB+ VRAM):

# vLLM - fastest for batch inference
pip install vllm --break-system-packages

For Apple Silicon (M1/M2/M3):

# llama.cpp server - optimized for Metal
brew install llama.cpp

For CPU-only or small GPUs:

# Ollama - easiest setup, good performance
curl -fsSL https://ollama.com/install.sh | sh

Expected: Installation completes in 2-5 minutes depending on your connection.


Step 2: Download a Model

Using Ollama (recommended for beginners):

# Pull a capable 7B model (4GB download)
ollama pull llama3.2

# Verify it works
ollama run llama3.2 "Say hello"

Using llama.cpp:

# Download GGUF format model
cd ~/models
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

Using vLLM:

# Models download automatically on first run
# vLLM uses HuggingFace format directly

If download fails:

  • Timeout errors: Use a download manager like aria2c for large files
  • Disk space: 7B models need 4-8GB, 13B models need 8-16GB
  • HuggingFace auth: Some models require huggingface-cli login first

Step 3: Start the OpenAI-Compatible Server

Ollama (automatic OpenAI endpoint):

# Server starts automatically with ollama
# Already running on http://localhost:11434

llama.cpp server:

llama-server \
  --model ~/models/llama-2-7b.Q4_K_M.gguf \
  --port 8080 \
  --ctx-size 4096 \
  --n-gpu-layers 35  # Adjust based on your VRAM

vLLM:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-hf \
  --port 8000 \
  --dtype float16

Expected: Server starts in 10-30 seconds and prints Uvicorn running on http://0.0.0.0:8000


Step 4: Test the API

# Check server health
curl http://localhost:8000/v1/models

# Make a chat completion request
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Explain recursion briefly"}],
    "max_tokens": 100
  }'

You should see: JSON response with model output in choices[0].message.content

If it fails:

  • Connection refused: Server isn't running, check the Terminal for errors
  • Model not found: Use exact model name from Step 2 (ollama list to verify)
  • Out of memory: Reduce --ctx-size or use a smaller quantization (Q4 → Q3)

Step 5: Update Your Application Code

Python (OpenAI SDK):

from openai import OpenAI

# Point to local server instead of OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # Local servers ignore this
)

response = client.chat.completions.create(
    model="llama3.2",  # Use your local model name
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

JavaScript/TypeScript:

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8000/v1',
  apiKey: 'not-needed'
});

const response = await client.chat.completions.create({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Hello' }]
});

cURL (for testing):

# Save as test.sh for quick checks
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Test"}]}'

Why this works: OpenAI SDKs are just HTTP clients. Changing base_url redirects all requests to your local server with no other code changes needed.


Verification

Test streaming responses:

stream = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Count to 5"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='')

You should see: Numbers appear one at a time, confirming streaming works.

Benchmark performance:

time curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 50}'

Expected: First response in 0.5-2 seconds (depending on hardware), subsequent requests faster due to prompt caching.


What You Learned

  • OpenAI-compatible servers let you swap cloud APIs for local models with zero code changes
  • Ollama is easiest for getting started, vLLM is fastest for production workloads
  • Same endpoints and SDKs work with both OpenAI and local servers

Limitations:

  • Local models (7-13B) are less capable than GPT-4, good for development/testing
  • Function calling support varies by serving tool (vLLM has best compatibility)
  • GPU memory limits model size (4GB VRAM → max 7B models with quantization)

When NOT to use this:

  • Production apps needing GPT-4 quality (use cloud APIs)
  • Shared team environments (consider hosted solutions like Together AI instead)
  • Mobile apps (models are too large, use API calls)

Quick Reference

ToolBest ForGPU SupportSetup Time
OllamaBeginners, Mac usersNVIDIA, Metal5 min
vLLMHigh throughput, batch jobsNVIDIA only10 min
llama.cppLow resource usage, CPUAll (CPU, CUDA, Metal)8 min

Port defaults:

  • Ollama: 11434
  • vLLM: 8000
  • llama.cpp: 8080 (configurable)

Model size guide:

  • 7B Q4: ~4GB VRAM, good quality
  • 13B Q4: ~8GB VRAM, better reasoning
  • 34B Q4: ~20GB VRAM, approaches GPT-3.5

Tested on vLLM 0.6.3, Ollama 0.5.2, llama.cpp b3950, NVIDIA RTX 4090 & Apple M3 Max