Reduce OpenClaw Latency in 15 Minutes: 5 Proven Optimizations

Cut OpenClaw response time by 50% with these battle-tested configurations. Fix slow API calls, optimize thinking modes, and eliminate WebSocket throttling.

Problem: OpenClaw Takes 2-3 Seconds Per Response

Your OpenClaw AI agent feels sluggish. Simple queries take 2.2-3.4 seconds when they should be instant. Voice interactions lag noticeably. The gateway adds 10+ seconds of overhead compared to direct API calls.

You'll learn:

  • Why default thinking modes kill voice interaction speed
  • How to eliminate WebSocket throttling delays
  • Which config changes cut latency by 50%+

Time: 15 min | Level: Intermediate


Why This Happens

OpenClaw's default configuration prioritizes safety and feature completeness over raw speed. The gateway includes intentional delays (150ms WebSocket throttle), verbose thinking modes, and conservative prompt caching that work against real-time use cases.

Common symptoms:

  • Voice agent responses feel "sluggish" (>1s feels slow for conversation)
  • Gateway adds 10s overhead vs direct Ollama/API calls
  • API responses arrive fast but delivery to chat apps lags
  • Multi-turn conversations slow down over time

Solution

Step 1: Disable Verbose Thinking Mode

The thinkingDefault setting controls how much internal reasoning the model performs before responding. For real-time interactions, verbose thinking adds unnecessary latency.

Edit ~/.openclaw/openclaw.json:

{
  "agents": {
    "list": [
      {
        "id": "main",
        "thinkingDefault": "minimal"
      }
    ]
  }
}

Why this works: Reduces model processing time from ~2.2s to ~1.1s by skipping extensive chain-of-thought. The model responds faster with "just answer" behavior instead of showing its work.

Trade-off: Complex reasoning tasks may be less reliable. For critical decisions, override with openclaw agent --thinking high when needed.


Step 2: Reduce WebSocket Delta Throttle

OpenClaw throttles streaming token delivery at 150ms by default to prevent UI flooding. This is fine for web chat but terrible for voice WebSockets.

Set environment variable before starting gateway:

export OPENCLAW_WS_DELTA_THROTTLE_MS=20
openclaw gateway --port 18789

Or add to your shell profile (~/.zshrc or ~/.bashrc):

echo 'export OPENCLAW_WS_DELTA_THROTTLE_MS=20' >> ~/.zshrc
source ~/.zshrc

Expected: Tokens stream every 20ms instead of 150ms. Voice consumers like Deepgram see responses 7x faster.

If it fails:

  • Gateway ignores variable: Restart the gateway service completely: openclaw service restart
  • Still slow: Check you're using OpenClaw v2026.2.1+ (env var support added recently)

Step 3: Enable Anthropic Prompt Caching

For multi-turn conversations, prompt cache invalidation adds latency. OpenClaw now prunes context only after 5 minutes of idle time to preserve cache hits.

Check your OpenClaw version:

openclaw --version

If below v2026.2.0: Update to latest

npm install -g openclaw@latest
# or
pnpm add -g openclaw@latest

Why this works: Anthropic's prompt cache gives you 90% cost reduction AND faster responses when context is reused. Previous versions pruned too aggressively, forcing full context rebuilds.

Limitation: Only applies to Anthropic models (Claude). Other providers don't support prompt caching yet.


Step 4: Optimize Model Routing

Don't use Opus/Sonnet for every request. Route simple tasks to faster models.

Add to openclaw.json:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-sonnet-4-20250514",
        "fallbacks": [
          "anthropic/claude-haiku-4-20251001"
        ]
      }
    },
    "list": [
      {
        "id": "main",
        "heartbeat": {
          "model": "google/gemini-2.5-flash-lite"
        },
        "subAgents": {
          "model": "deepseek/deepseek-v3.2"
        }
      }
    ]
  }
}

Why this works:

  • Heartbeats (periodic "still alive" checks): Use ultra-fast free model instead of Opus
  • Sub-agents (parallel background work): Route to DeepSeek ($0.55/M tokens) vs Opus ($15/M)
  • Fallbacks: Switch to Haiku if Sonnet is rate-limited (different provider = faster failover)

Trade-off: Sub-tasks use cheaper models. Main conversation still gets Sonnet quality.


Step 5: Reduce Context Window Size (Local Models Only)

If using Ollama or local models, smaller context = faster inference.

Edit Ollama modelfile or set via API:

{
  "num_ctx": 4096,
  "num_batch": 512,
  "num_thread": 8,
  "n_gpu_layers": 35
}

Why this works: Less memory to read per token generation. A MacBook M2 goes from 3.2 t/s to ~8 t/s by cutting context from 8K to 4K.

Trade-off: OpenClaw will "forget" earlier messages in long conversations.

Alternative for cloud users: This doesn't apply to Anthropic/OpenAI APIs - they handle context internally.


Verification

Test latency with this command:

time openclaw agent --message "What's 2+2?" --thinking minimal

You should see:

  • First token in <500ms (was 1-2s before)
  • Total response <1.5s (was 2-3s before)

For voice integration:

Set up Deepgram Voice Agent API or similar, then measure end-to-end turn latency. Target <1s for conversational feel.


What You Learned

  • thinkingDefault: minimal is the single biggest latency win for real-time use
  • WebSocket throttle defaults are tuned for UI, not voice
  • Model routing slashes costs AND latency for background tasks
  • Prompt caching requires recent OpenClaw version (v2026.2+)

Limitations:

  • These optimizations favor speed over reasoning depth
  • Local model optimizations don't apply to cloud APIs
  • Voice-grade latency (<1s) may require GPU hardware for local models

Tested on OpenClaw v2026.2.1, Claude Sonnet 4.5, Deepgram Voice API, macOS Sequoia