Problem: OpenClaw API Costs Add Up Fast
You've set up OpenClaw with Claude or GPT-4 API and watched costs climb as your AI agent runs commands, edits files, and automates tasks. At $15-30 per million tokens, a productive week can cost $50-100.
You'll learn:
- How to run Llama 4 Scout locally with vLLM for zero API costs
- Connect OpenClaw to your local model in 5 minutes
- When local deployment beats cloud APIs
Time: 20 min | Level: Intermediate
Why This Matters
OpenClaw is an autonomous AI agent that executes Terminal commands and manipulates files. Cloud APIs work great but create two problems: recurring costs that scale with usage, and sensitive data leaving your infrastructure.
Common triggers for going local:
- Processing proprietary code or internal documentation
- Running continuous automation tasks (monitoring, testing, deployments)
- Monthly API bills exceeding $200
- Regulatory requirements for data residency
Prerequisites
Hardware requirements:
- NVIDIA GPU with 24GB+ VRAM (RTX 4090, A6000, or better)
- 32GB+ system RAM
- 100GB free storage for model weights
Software:
Alternative: Use cloud GPU providers like RunPod ($0.40/hr for 4x RTX 4090) if you lack local hardware.
Solution
Step 1: Install vLLM
vLLM is the production-grade inference engine that makes Llama 4 actually usable. It handles efficient memory management through PagedAttention and supports multi-GPU parallelism.
# Create a virtual environment
python3 -m venv vllm-env
source vllm-env/bin/activate
# Install vLLM (requires CUDA)
pip install vllm --break-system-packages
Expected: Installation completes in 2-3 minutes. You'll see CUDA compatibility checks during install.
If it fails:
- Error: "CUDA not found": Install NVIDIA drivers and CUDA toolkit 12.1+
- Error: "Unsupported GPU": vLLM requires compute capability 7.0+ (Volta architecture or newer)
Step 2: Download and Serve Llama 4 Scout
Llama 4 Scout is a 17B parameter model (16 experts, only ~17B active) optimized for coding and agent tasks. It fits on a single 24GB GPU with FP8 quantization.
# Set environment variable to avoid compilation cache issues
export VLLM_DISABLE_COMPILE_CACHE=1
# Serve Llama 4 Scout with OpenAI-compatible API
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 1 \
--max-model-len 32000 \
--port 8000 \
--api-key local-openclaw-key
Why these flags:
--tensor-parallel-size 1: Uses single GPU (increase for multi-GPU setups)--max-model-len 32000: Context window size (32K tokens = ~24K words)--api-key: Secures your endpoint (use a strong random string in production)
Expected: Model downloads from HuggingFace (38GB), loads in 3-5 minutes, then shows:
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
Test the endpoint:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer local-openclaw-key" \
-d '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"prompt": "Write a Python function to reverse a string:",
"max_tokens": 100
}'
You should see a completion response in JSON format.
Step 3: Install OpenClaw
OpenClaw is the agent framework that connects to your local LLM and provides the execution sandbox.
# Install OpenClaw globally
npm install -g openclaw
# Run the setup wizard
openclaw onboard
During setup:
- When asked "How do you want to hatch your bot?", select "Open the Web UI"
- Choose "local" mode for the gateway
- When prompted for model configuration, select "Skip" (we'll configure manually)
Expected: Web UI opens at http://localhost:18789 with a token-based login.
Step 4: Configure OpenClaw for Local vLLM
OpenClaw needs to know where your local model is running. We'll edit the config file directly.
# Open OpenClaw config
nano ~/.openclaw/openclaw.json
Add this configuration:
{
"models": {
"providers": {
"local-vllm": {
"baseUrl": "http://localhost:8000/v1",
"apiKey": "local-openclaw-key",
"api": "openai-completions",
"models": [
{
"id": "llama4-scout",
"name": "Llama 4 Scout Local",
"reasoning": true,
"input": ["text"],
"cost": {
"input": 0,
"output": 0
},
"contextWindow": 32000,
"maxTokens": 8192
}
]
}
}
},
"agents": {
"defaults": {
"model": {
"primary": "local-vllm/llama4-scout"
},
"workspace": "/home/yourusername/.openclaw/workspace",
"compaction": {
"mode": "safeguard"
},
"maxConcurrent": 4
}
}
}
Critical fields:
baseUrl: Must match vLLM's serving address and portapiKey: Must match the--api-keyyou set in Step 2contextWindow: Match--max-model-lenfrom vLLM configprimary: Format isprovider/model-id
Save and restart OpenClaw:
openclaw restart
Step 5: Verify the Setup
Test OpenClaw's connection to your local model through the Web UI.
# Check OpenClaw status
openclaw status
Expected output:
✓ Config OK: ~/.openclaw/openclaw.json
✓ Workspace OK: ~/.openclaw/workspace
✓ Gateway: Running on http://localhost:18789
✓ Model: local-vllm/llama4-scout (connected)
Test with a simple task:
In the OpenClaw Web UI, send this message:
Create a Python script that prints the Fibonacci sequence up to n=10
You should see:
- OpenClaw acknowledges the request
- It creates a file in
~/.openclaw/workspace/ - The agent executes the script and shows output
- All of this happens using your local Llama 4 model (check vLLM logs to confirm requests)
If it fails:
- Error: "Model not responding": Check vLLM is running (
curl http://localhost:8000/health) - Error: "Authentication failed": Verify
apiKeymatches in both configs - Slow responses (>30s): Llama 4 Scout needs 24GB VRAM minimum; check
nvidia-smifor memory usage
Performance Optimization
For Single GPU (24GB VRAM)
Use FP8 quantization for 2x speedup:
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct-FP8 \
--tensor-parallel-size 1 \
--max-model-len 32000 \
--gpu-memory-utilization 0.95
For Multi-GPU Setup (2x 24GB)
Enable tensor parallelism:
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 2 \
--max-model-len 128000 \
--gpu-memory-utilization 0.90
This increases context window to 128K tokens (~96K words).
Memory Issues?
If you're running out of VRAM:
# Reduce context window
--max-model-len 16000
# Lower GPU memory utilization
--gpu-memory-utilization 0.80
# Use CPU offloading (much slower)
--cpu-offload-gb 8
Cost Comparison
Cloud API (Claude 4.5 Sonnet):
- Input: $3/M tokens
- Output: $15/M tokens
- Typical OpenClaw task (500K tokens/day): ~$270/month
Local vLLM (Single RTX 4090):
- Hardware: $1,599 one-time
- Power: ~$30/month (450W × 24hrs × $0.12/kWh)
- Break-even: 6 months
Cloud GPU (RunPod 4x RTX 4090):
- $0.40/hour for on-demand
- 8 hours/day × 30 days = $96/month
- Good for testing before buying hardware
What You Learned
- vLLM serves Llama 4 Scout with an OpenAI-compatible API that OpenClaw understands
- FP8 quantization makes 17B models viable on consumer GPUs
- Local deployment eliminates API costs and keeps sensitive data private
Limitations:
- Llama 4 Scout is competent but not Claude 4.5 Sonnet quality (expect ~70% task completion vs ~85%)
- Single GPU limits concurrent requests (vLLM batching helps but caps at ~4 concurrent)
- Model updates require manual redownloading (40GB)
When NOT to use local:
- Simple, low-volume tasks (<100K tokens/month) where API costs are negligible
- Need for absolute state-of-the-art performance (Claude Opus 4.5, GPT-4.5)
- No GPU access and unwilling to rent cloud GPUs
Alternative: Ollama for Simpler Setup
If you want dead-simple local LLM serving without vLLM's complexity:
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull Llama 4 Scout (when available in Ollama)
ollama pull llama3.1:70b # Use Llama 3.1 70B until Llama 4 is added
# Serve with OpenAI compatibility
ollama serve
OpenClaw config for Ollama:
{
"models": {
"providers": {
"ollama": {
"baseUrl": "http://localhost:11434/v1",
"apiKey": "ollama",
"api": "openai-completions",
"models": [
{
"id": "llama3.1:70b",
"contextWindow": 128000
}
]
}
}
},
"agents": {
"defaults": {
"model": {
"primary": "ollama/llama3.1:70b"
}
}
}
}
Trade-off: Ollama is easier but vLLM is 5-10x faster for multi-request workloads.
Security Considerations
Your local LLM is private, but the agent isn't automatically secure:
- Workspace isolation: OpenClaw runs commands inside
~/.openclaw/workspaceby default, not your entire filesystem - Docker sandbox (recommended): Run OpenClaw in Docker for true isolation:
docker run -d \
-v ~/.openclaw:/root/.openclaw \
-p 18789:18789 \
--network host \
openclaw/openclaw:latest
- Network exposure: Never expose
http://0.0.0.0:8000publicly without authentication - Model access: Even local models can execute malicious code if the agent is compromised (use
safeguardcompaction mode)
In January 2026, researchers found 42,665 exposed OpenClaw instances. 93.4% were vulnerable to command injection.
Troubleshooting
vLLM won't start:
# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"
# Check GPU memory
nvidia-smi
# Try older vLLM version if latest fails
pip install vllm==0.7.0 --break-system-packages
OpenClaw can't connect:
# Test vLLM endpoint directly
curl http://localhost:8000/v1/models
# Check OpenClaw logs
openclaw logs --follow
# Verify config syntax
python -m json.tool ~/.openclaw/openclaw.json
Out of memory errors:
# Reduce batch size
--max-num-seqs 16
# Use smaller context
--max-model-len 16000
# Try Llama 3.1 8B instead (fits in 16GB)
vllm serve meta-llama/Llama-3.1-8B-Instruct
Tested on vLLM 0.8.3+, Llama 4 Scout, OpenClaw 2026.2.6, Ubuntu 22.04 with RTX 4090 (24GB)