Problem: Your OpenClaw Assistant Only Types
You have OpenClaw running but can only interact through text. You want natural voice conversations where you speak to your assistant and it responds with audio.
You'll learn:
- Enable speech-to-text (STT) with Whisper for transcribing your voice messages
- Configure text-to-speech (TTS) with ElevenLabs or OpenAI for audio responses
- Set up voice conversations via Telegram voice notes
Time: 15 min | Level: Intermediate
Why This Matters
OpenClaw's voice capabilities transform it from a text chatbot into a conversational assistant. Voice removes friction for mobile interactions and makes your assistant feel more natural when you're away from a keyboard.
Common use cases:
- Hands-free queries while driving or walking
- Quick voice memos to your assistant on mobile
- Audio responses you can listen to without reading
- Natural conversation flow on messaging apps
Solution
Step 1: Enable Audio Transcription (Speech-to-Text)
OpenClaw auto-detects available transcription providers in this order: local Whisper CLI, then provider APIs (OpenAI, Deepgram).
Option A: Auto-Detection (Easiest)
Audio transcription is enabled by default. OpenClaw tries local CLIs first, then provider APIs.
# Check if Whisper CLI is installed
which whisper
Expected: Path to Whisper binary, or nothing if not installed.
Option B: Use OpenAI Transcription API
{
"tools": {
"media": {
"audio": {
"enabled": true,
"maxBytes": 20971520,
"models": [
{
"provider": "openai",
"model": "gpt-4o-mini-transcribe"
}
]
}
}
}
}
Add this to your openclaw.json config file (typically at ~/.openclaw/openclaw.json).
Why this works: OpenAI's Whisper API handles transcription without needing local installation. gpt-4o-mini-transcribe is cost-effective for voice notes.
Option C: Install Local Whisper CLI
# Install Python Whisper
pip install --break-system-packages openai-whisper
# Test transcription
whisper test_audio.mp3 --model base
If it fails:
- Error: "whisper: command not found": Ensure Python's bin directory is in your PATH
- Slow transcription (30-60s): Normal for CPU-only. Use
basemodel for speed or GPU for faster results
Step 2: Configure Text-to-Speech (TTS)
OpenClaw supports three TTS providers: ElevenLabs (best quality), OpenAI (good quality), and Edge TTS (free fallback).
Option A: ElevenLabs (Recommended for Quality)
# Set API key as environment variable
export ELEVENLABS_API_KEY="your_api_key_here"
Add to your shell profile (~/.zshrc or ~/.bashrc) to persist:
echo 'export ELEVENLABS_API_KEY="your_api_key_here"' >> ~/.zshrc
source ~/.zshrc
Get your ElevenLabs API key: Sign up at elevenlabs.io → Settings → API Key
Option B: OpenAI TTS
# OpenAI API key (if not already set)
export OPENAI_API_KEY="your_api_key_here"
Option C: Edge TTS (Free, No API Key)
Edge TTS works automatically without configuration. OpenClaw uses it as fallback when no API keys are available.
Step 3: Enable TTS in OpenClaw
TTS is off by default. Enable it per session or globally:
# Enable TTS for current session (via chat)
/tts on
# Or enable in config (openclaw.json)
{
"messages": {
"tts": {
"auto": "always",
"provider": "elevenlabs",
"elevenlabs": {
"voice": "rachel"
}
}
}
}
Provider priority: OpenClaw prefers OpenAI if key exists, then ElevenLabs if key exists, otherwise Edge TTS.
Common voice IDs:
- ElevenLabs:
rachel,clyde,domi - OpenAI:
alloy,echo,nova
If it fails:
- No audio response: Check provider API key is set correctly
- Wrong voice: Specify voice in config or use
/tts provider elevenlabs - Audio too long: Set
maxTextLengthlimit in config
Step 4: Test Voice Conversation (Telegram Example)
OpenClaw's voice features work best on Telegram with voice notes.
Test STT (Speech-to-Text):
- Open Telegram chat with your OpenClaw bot
- Hold microphone button, speak: "What's the weather today?"
- Release to send voice note
- OpenClaw transcribes and responds
Expected: Bot replies with text showing it understood your speech.
Test TTS (Text-to-Speech):
# Enable TTS via chat command
/tts on
# Or test audio generation directly
/tts audio Hello from OpenClaw
Expected: Bot sends audio file you can play. On Telegram, it appears as a round voice note.
Verification
Full voice loop test:
# Check TTS status
/tts status
You should see:
- Auto-TTS mode (off/inbound/always/tagged)
- Current provider (openai/elevenlabs/edge)
- Voice configuration
- Character limits
Send a voice note asking a question. Bot should:
- Transcribe your speech to text
- Process the request
- Reply with audio response (if TTS is on)
Configuration Examples
Telegram Voice Notes with ElevenLabs
{
"tools": {
"media": {
"audio": {
"enabled": true,
"models": [
{
"provider": "openai",
"model": "gpt-4o-mini-transcribe"
}
]
}
}
},
"messages": {
"tts": {
"auto": "inbound",
"provider": "elevenlabs",
"maxTextLength": 2000,
"elevenlabs": {
"voice": "rachel",
"model": "eleven_turbo_v2_5"
}
}
}
}
This config:
- Uses OpenAI Whisper for transcription
- Auto-responds with audio when you send voice notes (
auto: "inbound") - Uses ElevenLabs Rachel voice
- Limits TTS to 2000 characters (auto-summarizes longer responses)
Local Whisper with OpenAI TTS
{
"tools": {
"media": {
"audio": {
"enabled": true,
"models": [
{
"type": "cli",
"command": "whisper",
"args": ["--model", "base", "{{MediaPath}}"],
"timeoutSeconds": 45
}
]
}
}
},
"messages": {
"tts": {
"auto": "always",
"provider": "openai",
"openai": {
"voice": "nova",
"model": "tts-1"
}
}
}
}
This config:
- Uses local Whisper CLI (free, no API calls)
- Always responds with audio (
auto: "always") - Uses OpenAI Nova voice with tts-1 model
Advanced: Voice-Only Mode
For true hands-free experience, disable text responses:
# Enable auto-TTS for all replies
/tts always
# Set short TTS limit to prevent long text conversion
/tts limit 1000
Result: All replies under 1000 chars become voice notes. Longer replies remain text (or get auto-summarized if configured).
What You Learned
- OpenClaw auto-detects transcription providers (Whisper CLI → Provider APIs)
- ElevenLabs provides highest quality TTS, OpenAI is balanced, Edge is free
- TTS modes:
off,inbound(reply to voice with voice),always,tagged - Telegram voice notes work seamlessly with 48kHz Opus encoding
Limitations:
- Local Whisper is CPU-intensive (30-60s for short clips)
- ElevenLabs requires paid subscription for high volume
- Voice quality varies by model (ElevenLabs > OpenAI > Edge)
Troubleshooting
"No audio transcription"
# Check audio config
cat ~/.openclaw/openclaw.json | grep -A 10 audio
# Verify API keys
echo $OPENAI_API_KEY
Fix: Ensure tools.media.audio.enabled is not false in config.
"TTS not working"
# Check TTS status
/tts status
# Test manual audio generation
/tts audio test message
Fix: Verify provider API key is set. Try /tts provider edge to test with free fallback.
"Transcription takes too long"
Solution: Use provider API instead of local Whisper:
{
"tools": {
"media": {
"audio": {
"models": [
{"provider": "openai", "model": "gpt-4o-mini-transcribe"}
]
}
}
}
}
Why: Cloud APIs are optimized for speed. Local Whisper on CPU can take 30-60s per voice note.
Tested on OpenClaw 2026.1.24+, macOS & Linux with Telegram voice notes