Problem: OpenClaw Takes 2-3 Seconds Per Response
Your OpenClaw AI agent feels sluggish. Simple queries take 2.2-3.4 seconds when they should be instant. Voice interactions lag noticeably. The gateway adds 10+ seconds of overhead compared to direct API calls.
You'll learn:
- Why default thinking modes kill voice interaction speed
- How to eliminate WebSocket throttling delays
- Which config changes cut latency by 50%+
Time: 15 min | Level: Intermediate
Why This Happens
OpenClaw's default configuration prioritizes safety and feature completeness over raw speed. The gateway includes intentional delays (150ms WebSocket throttle), verbose thinking modes, and conservative prompt caching that work against real-time use cases.
Common symptoms:
- Voice agent responses feel "sluggish" (>1s feels slow for conversation)
- Gateway adds 10s overhead vs direct Ollama/API calls
- API responses arrive fast but delivery to chat apps lags
- Multi-turn conversations slow down over time
Solution
Step 1: Disable Verbose Thinking Mode
The thinkingDefault setting controls how much internal reasoning the model performs before responding. For real-time interactions, verbose thinking adds unnecessary latency.
Edit ~/.openclaw/openclaw.json:
{
"agents": {
"list": [
{
"id": "main",
"thinkingDefault": "minimal"
}
]
}
}
Why this works: Reduces model processing time from ~2.2s to ~1.1s by skipping extensive chain-of-thought. The model responds faster with "just answer" behavior instead of showing its work.
Trade-off: Complex reasoning tasks may be less reliable. For critical decisions, override with openclaw agent --thinking high when needed.
Step 2: Reduce WebSocket Delta Throttle
OpenClaw throttles streaming token delivery at 150ms by default to prevent UI flooding. This is fine for web chat but terrible for voice WebSockets.
Set environment variable before starting gateway:
export OPENCLAW_WS_DELTA_THROTTLE_MS=20
openclaw gateway --port 18789
Or add to your shell profile (~/.zshrc or ~/.bashrc):
echo 'export OPENCLAW_WS_DELTA_THROTTLE_MS=20' >> ~/.zshrc
source ~/.zshrc
Expected: Tokens stream every 20ms instead of 150ms. Voice consumers like Deepgram see responses 7x faster.
If it fails:
- Gateway ignores variable: Restart the gateway service completely:
openclaw service restart - Still slow: Check you're using OpenClaw v2026.2.1+ (env var support added recently)
Step 3: Enable Anthropic Prompt Caching
For multi-turn conversations, prompt cache invalidation adds latency. OpenClaw now prunes context only after 5 minutes of idle time to preserve cache hits.
Check your OpenClaw version:
openclaw --version
If below v2026.2.0: Update to latest
npm install -g openclaw@latest
# or
pnpm add -g openclaw@latest
Why this works: Anthropic's prompt cache gives you 90% cost reduction AND faster responses when context is reused. Previous versions pruned too aggressively, forcing full context rebuilds.
Limitation: Only applies to Anthropic models (Claude). Other providers don't support prompt caching yet.
Step 4: Optimize Model Routing
Don't use Opus/Sonnet for every request. Route simple tasks to faster models.
Add to openclaw.json:
{
"agents": {
"defaults": {
"model": {
"primary": "anthropic/claude-sonnet-4-20250514",
"fallbacks": [
"anthropic/claude-haiku-4-20251001"
]
}
},
"list": [
{
"id": "main",
"heartbeat": {
"model": "google/gemini-2.5-flash-lite"
},
"subAgents": {
"model": "deepseek/deepseek-v3.2"
}
}
]
}
}
Why this works:
- Heartbeats (periodic "still alive" checks): Use ultra-fast free model instead of Opus
- Sub-agents (parallel background work): Route to DeepSeek ($0.55/M tokens) vs Opus ($15/M)
- Fallbacks: Switch to Haiku if Sonnet is rate-limited (different provider = faster failover)
Trade-off: Sub-tasks use cheaper models. Main conversation still gets Sonnet quality.
Step 5: Reduce Context Window Size (Local Models Only)
If using Ollama or local models, smaller context = faster inference.
Edit Ollama modelfile or set via API:
{
"num_ctx": 4096,
"num_batch": 512,
"num_thread": 8,
"n_gpu_layers": 35
}
Why this works: Less memory to read per token generation. A MacBook M2 goes from 3.2 t/s to ~8 t/s by cutting context from 8K to 4K.
Trade-off: OpenClaw will "forget" earlier messages in long conversations.
Alternative for cloud users: This doesn't apply to Anthropic/OpenAI APIs - they handle context internally.
Verification
Test latency with this command:
time openclaw agent --message "What's 2+2?" --thinking minimal
You should see:
- First token in <500ms (was 1-2s before)
- Total response <1.5s (was 2-3s before)
For voice integration:
Set up Deepgram Voice Agent API or similar, then measure end-to-end turn latency. Target <1s for conversational feel.
What You Learned
thinkingDefault: minimalis the single biggest latency win for real-time use- WebSocket throttle defaults are tuned for UI, not voice
- Model routing slashes costs AND latency for background tasks
- Prompt caching requires recent OpenClaw version (v2026.2+)
Limitations:
- These optimizations favor speed over reasoning depth
- Local model optimizations don't apply to cloud APIs
- Voice-grade latency (<1s) may require GPU hardware for local models
Tested on OpenClaw v2026.2.1, Claude Sonnet 4.5, Deepgram Voice API, macOS Sequoia