Problem: OpenClaw API Costs Add Up Fast

You've set up OpenClaw with Claude or GPT-4 API and watched costs climb as your AI agent runs commands, edits files, and automates tasks. At $15-30 per million tokens, a productive week can cost $50-100.

You'll learn:

How to run Llama 4 Scout locally with vLLM for zero API costs
Connect OpenClaw to your local model in 5 minutes
When local deployment beats cloud APIs

Time: 20 min | Level: Intermediate

Why This Matters

OpenClaw is an autonomous AI agent that executes Terminal commands and manipulates files. Cloud APIs work great but create two problems: recurring costs that scale with usage, and sensitive data leaving your infrastructure.

Common triggers for going local:

Processing proprietary code or internal documentation
Running continuous automation tasks (monitoring, testing, deployments)
Monthly API bills exceeding $200
Regulatory requirements for data residency

Prerequisites

Hardware requirements:

NVIDIA GPU with 24GB+ VRAM (RTX 4090, A6000, or better)
32GB+ system RAM
100GB free storage for model weights

Software:

Ubuntu 22.04+ or Docker
Python 3.10+
Node.js 22+ (for OpenClaw)
CUDA 12.1+

Alternative: Use cloud GPU providers like RunPod ($0.40/hr for 4x RTX 4090) if you lack local hardware.

Solution

Step 1: Install vLLM

vLLM is the production-grade inference engine that makes Llama 4 actually usable. It handles efficient memory management through PagedAttention and supports multi-GPU parallelism.

# Create a virtual environment
python3 -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM (requires CUDA)
pip install vllm --break-system-packages

Expected: Installation completes in 2-3 minutes. You'll see CUDA compatibility checks during install.

If it fails:

Error: "CUDA not found": Install NVIDIA drivers and CUDA toolkit 12.1+
Error: "Unsupported GPU": vLLM requires compute capability 7.0+ (Volta architecture or newer)

Step 2: Download and Serve Llama 4 Scout

Llama 4 Scout is a 17B parameter model (16 experts, only ~17B active) optimized for coding and agent tasks. It fits on a single 24GB GPU with FP8 quantization.

# Set environment variable to avoid compilation cache issues
export VLLM_DISABLE_COMPILE_CACHE=1

# Serve Llama 4 Scout with OpenAI-compatible API
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 32000 \
  --port 8000 \
  --api-key local-openclaw-key

Why these flags:

--tensor-parallel-size 1: Uses single GPU (increase for multi-GPU setups)
--max-model-len 32000: Context window size (32K tokens = ~24K words)
--api-key: Secures your endpoint (use a strong random string in production)

Expected: Model downloads from HuggingFace (38GB), loads in 3-5 minutes, then shows:

INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000

Test the endpoint:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer local-openclaw-key" \
  -d '{
    "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
    "prompt": "Write a Python function to reverse a string:",
    "max_tokens": 100
  }'

You should see a completion response in JSON format.

Step 3: Install OpenClaw

OpenClaw is the agent framework that connects to your local LLM and provides the execution sandbox.

# Install OpenClaw globally
npm install -g openclaw

# Run the setup wizard
openclaw onboard

During setup:

When asked "How do you want to hatch your bot?", select "Open the Web UI"
Choose "local" mode for the gateway
When prompted for model configuration, select "Skip" (we'll configure manually)

Expected: Web UI opens at http://localhost:18789 with a token-based login.

Step 4: Configure OpenClaw for Local vLLM

OpenClaw needs to know where your local model is running. We'll edit the config file directly.

# Open OpenClaw config
nano ~/.openclaw/openclaw.json

Add this configuration:

{
  "models": {
    "providers": {
      "local-vllm": {
        "baseUrl": "http://localhost:8000/v1",
        "apiKey": "local-openclaw-key",
        "api": "openai-completions",
        "models": [
          {
            "id": "llama4-scout",
            "name": "Llama 4 Scout Local",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0
            },
            "contextWindow": 32000,
            "maxTokens": 8192
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "local-vllm/llama4-scout"
      },
      "workspace": "/home/yourusername/.openclaw/workspace",
      "compaction": {
        "mode": "safeguard"
      },
      "maxConcurrent": 4
    }
  }
}

Critical fields:

baseUrl: Must match vLLM's serving address and port
apiKey: Must match the --api-key you set in Step 2
contextWindow: Match --max-model-len from vLLM config
primary: Format is provider/model-id

Save and restart OpenClaw:

openclaw restart

Step 5: Verify the Setup

Test OpenClaw's connection to your local model through the Web UI.

# Check OpenClaw status
openclaw status

Expected output:

✓ Config OK: ~/.openclaw/openclaw.json
✓ Workspace OK: ~/.openclaw/workspace
✓ Gateway: Running on http://localhost:18789
✓ Model: local-vllm/llama4-scout (connected)

Test with a simple task:

In the OpenClaw Web UI, send this message:

Create a Python script that prints the Fibonacci sequence up to n=10

You should see:

OpenClaw acknowledges the request
It creates a file in ~/.openclaw/workspace/
The agent executes the script and shows output
All of this happens using your local Llama 4 model (check vLLM logs to confirm requests)

If it fails:

Error: "Model not responding": Check vLLM is running (curl http://localhost:8000/health)
Error: "Authentication failed": Verify apiKey matches in both configs
Slow responses (>30s): Llama 4 Scout needs 24GB VRAM minimum; check nvidia-smi for memory usage

Performance Optimization

For Single GPU (24GB VRAM)

Use FP8 quantization for 2x speedup:

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct-FP8 \
  --tensor-parallel-size 1 \
  --max-model-len 32000 \
  --gpu-memory-utilization 0.95

For Multi-GPU Setup (2x 24GB)

Enable tensor parallelism:

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 128000 \
  --gpu-memory-utilization 0.90

This increases context window to 128K tokens (~96K words).

Memory Issues?

If you're running out of VRAM:

# Reduce context window
--max-model-len 16000

# Lower GPU memory utilization
--gpu-memory-utilization 0.80

# Use CPU offloading (much slower)
--cpu-offload-gb 8

Cost Comparison

Cloud API (Claude 4.5 Sonnet):

Input: $3/M tokens
Output: $15/M tokens
Typical OpenClaw task (500K tokens/day): ~$270/month

Local vLLM (Single RTX 4090):

Hardware: $1,599 one-time
Power: ~$30/month (450W × 24hrs × $0.12/kWh)
Break-even: 6 months

Cloud GPU (RunPod 4x RTX 4090):

$0.40/hour for on-demand
8 hours/day × 30 days = $96/month
Good for testing before buying hardware

What You Learned

vLLM serves Llama 4 Scout with an OpenAI-compatible API that OpenClaw understands
FP8 quantization makes 17B models viable on consumer GPUs
Local deployment eliminates API costs and keeps sensitive data private

Limitations:

Llama 4 Scout is competent but not Claude 4.5 Sonnet quality (expect ~70% task completion vs ~85%)
Single GPU limits concurrent requests (vLLM batching helps but caps at ~4 concurrent)
Model updates require manual redownloading (40GB)

When NOT to use local:

Simple, low-volume tasks (<100K tokens/month) where API costs are negligible
Need for absolute state-of-the-art performance (Claude Opus 4.5, GPT-4.5)
No GPU access and unwilling to rent cloud GPUs

Alternative: Ollama for Simpler Setup

If you want dead-simple local LLM serving without vLLM's complexity:

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull Llama 4 Scout (when available in Ollama)
ollama pull llama3.1:70b  # Use Llama 3.1 70B until Llama 4 is added

# Serve with OpenAI compatibility
ollama serve

OpenClaw config for Ollama:

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://localhost:11434/v1",
        "apiKey": "ollama",
        "api": "openai-completions",
        "models": [
          {
            "id": "llama3.1:70b",
            "contextWindow": 128000
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/llama3.1:70b"
      }
    }
  }
}

Trade-off: Ollama is easier but vLLM is 5-10x faster for multi-request workloads.

Security Considerations

Your local LLM is private, but the agent isn't automatically secure:

Workspace isolation: OpenClaw runs commands inside ~/.openclaw/workspace by default, not your entire filesystem
Docker sandbox (recommended): Run OpenClaw in Docker for true isolation:

docker run -d \
  -v ~/.openclaw:/root/.openclaw \
  -p 18789:18789 \
  --network host \
  openclaw/openclaw:latest

Network exposure: Never expose http://0.0.0.0:8000 publicly without authentication
Model access: Even local models can execute malicious code if the agent is compromised (use safeguard compaction mode)

In January 2026, researchers found 42,665 exposed OpenClaw instances. 93.4% were vulnerable to command injection.

Troubleshooting

vLLM won't start:

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# Check GPU memory
nvidia-smi

# Try older vLLM version if latest fails
pip install vllm==0.7.0 --break-system-packages

OpenClaw can't connect:

# Test vLLM endpoint directly
curl http://localhost:8000/v1/models

# Check OpenClaw logs
openclaw logs --follow

# Verify config syntax
python -m json.tool ~/.openclaw/openclaw.json

Out of memory errors:

# Reduce batch size
--max-num-seqs 16

# Use smaller context
--max-model-len 16000

# Try Llama 3.1 8B instead (fits in 16GB)
vllm serve meta-llama/Llama-3.1-8B-Instruct

Tested on vLLM 0.8.3+, Llama 4 Scout, OpenClaw 2026.2.6, Ubuntu 22.04 with RTX 4090 (24GB)