Build a Local Voice Assistant with Whisper v4 and Llama 4

Run a fully private voice assistant on your machine using OpenAI Whisper v4 for speech-to-text and Llama 4 for responses. No cloud required.

Problem: Building a Private Voice Assistant That Runs Locally

You want a voice assistant that doesn't send your audio to the cloud. Every major assistant (Siri, Alexa, Google) uploads your voice — this one doesn't. You'll run speech-to-text with Whisper v4 and inference with Llama 4 entirely on your machine.

You'll learn:

  • How to set up Whisper v4 for accurate local transcription
  • How to run Llama 4 with Ollama for fast local inference
  • How to wire audio capture, transcription, and LLM response into a real-time loop

Time: 45 min | Level: Intermediate


Why This Happens

Cloud assistants require constant internet and log your data. The local alternative has historically been too slow or too inaccurate to be useful. Whisper v4 (released late 2025) closes the accuracy gap with cloud STT, and Llama 4 Scout runs in 4-bit quantization on 16GB VRAM — fast enough for real conversation.

Requirements:

  • Python 3.11+
  • 16GB RAM minimum (32GB recommended)
  • GPU with 8GB+ VRAM (NVIDIA preferred) — CPU fallback works but is slow
  • macOS, Linux, or Windows with WSL2

Solution

Step 1: Install Dependencies

# Create isolated environment
python -m venv voice-assistant
source voice-assistant/bin/activate  # Windows: voice-assistant\Scripts\activate

# Core packages
pip install openai-whisper sounddevice numpy scipy

# Install Ollama for Llama 4
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama4:scout  # 8B param, 4-bit — fits in 8GB VRAM

Expected: Ollama downloads ~5GB model. Whisper installs its dependencies including torch.

If it fails:

  • CUDA not found: Install CUDA 12.x from nvidia.com before pip install
  • sounddevice error on Linux: Run sudo apt install portaudio19-dev first

Step 2: Capture Audio from Microphone

# audio_capture.py
import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write
import tempfile
import os

SAMPLE_RATE = 16000  # Whisper expects 16kHz
SILENCE_THRESHOLD = 0.01  # Amplitude below this = silence
SILENCE_DURATION = 1.5  # Seconds of silence before cutting off

def record_until_silence() -> str:
    """Record audio, stop after user goes quiet. Returns path to wav file."""
    chunks = []
    silence_counter = 0
    chunk_size = int(SAMPLE_RATE * 0.1)  # 100ms chunks for responsive silence detection

    print("Listening... (speak now)")

    with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, dtype='float32') as stream:
        while True:
            chunk, _ = stream.read(chunk_size)
            chunks.append(chunk.copy())

            amplitude = np.abs(chunk).mean()

            if amplitude < SILENCE_THRESHOLD:
                silence_counter += chunk_size / SAMPLE_RATE
            else:
                silence_counter = 0  # Reset on voice activity

            if silence_counter >= SILENCE_DURATION and len(chunks) > 10:
                # Minimum 1 second of audio before cutting off
                break

    audio = np.concatenate(chunks, axis=0)
    tmp = tempfile.mktemp(suffix=".wav")
    write(tmp, SAMPLE_RATE, audio)
    return tmp

Expected: Running this module directly should print "Listening..." and create a temp wav file after you stop speaking.


Step 3: Transcribe with Whisper v4

# transcribe.py
import whisper

# Load once at startup — expensive operation (~3s on GPU)
_model = None

def get_model():
    global _model
    if _model is None:
        # "turbo" is Whisper v4's fastest accurate mode
        _model = whisper.load_model("turbo")
    return _model

def transcribe(audio_path: str) -> str:
    """Transcribe audio file, return text. Returns empty string on silence."""
    model = get_model()
    result = model.transcribe(
        audio_path,
        language="en",  # Skip language detection for ~20% speed boost
        fp16=True,       # Use half precision on GPU (set False for CPU)
        condition_on_previous_text=False  # Prevents hallucinations on short clips
    )

    text = result["text"].strip()
    return text

Why condition_on_previous_text=False: Whisper v4 can hallucinate repeated text when it tries to be consistent with previous output. Disabling this prevents ghost transcriptions on short audio.


Step 4: Generate Response with Llama 4

# llm.py
import requests
import json

OLLAMA_URL = "http://localhost:11434/api/generate"
SYSTEM_PROMPT = """You are a helpful voice assistant. Keep responses concise — 
2-3 sentences maximum. You're being spoken aloud, so avoid markdown, 
bullet points, or special characters."""

def ask_llama(user_input: str, history: list[dict]) -> str:
    """Send prompt to Llama 4 via Ollama, return response text."""
    
    # Build conversation context from history
    context = "\n".join([
        f"{m['role'].capitalize()}: {m['content']}" 
        for m in history[-6:]  # Last 3 exchanges = 6 messages
    ])
    
    prompt = f"{context}\nUser: {user_input}\nAssistant:"

    response = requests.post(OLLAMA_URL, json={
        "model": "llama4:scout",
        "prompt": prompt,
        "system": SYSTEM_PROMPT,
        "stream": False,
        "options": {
            "temperature": 0.7,
            "num_predict": 150  # Caps response length for faster replies
        }
    })

    return response.json()["response"].strip()

Step 5: Wire It Together

# main.py
import os
from audio_capture import record_until_silence
from transcribe import transcribe
from llm import ask_llama

def speak(text: str):
    """Text-to-speech using system TTS."""
    # macOS
    os.system(f'say "{text}"')
    # Linux: os.system(f'espeak "{text}"')
    # Windows: use pyttsx3

def main():
    print("Voice Assistant ready. Press Ctrl+C to quit.\n")
    history = []

    while True:
        try:
            # 1. Capture voice
            audio_path = record_until_silence()

            # 2. Transcribe
            user_text = transcribe(audio_path)
            os.unlink(audio_path)  # Clean up temp file

            if not user_text or len(user_text) < 3:
                continue  # Skip empty captures

            print(f"You: {user_text}")

            # 3. Get LLM response
            response = ask_llama(user_text, history)
            print(f"Assistant: {response}\n")

            # 4. Speak response
            speak(response)

            # 5. Update history
            history.append({"role": "user", "content": user_text})
            history.append({"role": "assistant", "content": response})

        except KeyboardInterrupt:
            print("\nGoodbye.")
            break

if __name__ == "__main__":
    main()

Verification

Start Ollama in one Terminal, then run the assistant:

# Terminal 1
ollama serve

# Terminal 2
source voice-assistant/bin/activate
python main.py

You should see:

Voice Assistant ready. Press Ctrl+C to quit.

Listening... (speak now)
You: What's the capital of France?
Assistant: The capital of France is Paris, which has served as the country's capital for centuries.

End-to-end latency should be under 3 seconds on a modern GPU (transcription ~0.5s, inference ~1.5s, TTS ~0.5s).

Terminal showing voice assistant running Assistant processing and responding in real time


Performance Tuning

If responses are slow, try these in order:

# In transcribe.py — use smaller model for faster STT
_model = whisper.load_model("small")  # Less accurate but 3x faster

# In llm.py — reduce context window
history[-2:]  # Only last 1 exchange instead of 3

# In audio_capture.py — cut silence threshold
SILENCE_DURATION = 0.8  # More responsive cutoff

For CPU-only machines, set fp16=False in the transcribe call and switch to llama4:scout in 4-bit — expect 8-15 second response times.


What You Learned

  • Whisper v4's turbo model balances accuracy and speed for real-time use
  • condition_on_previous_text=False prevents hallucinations on short clips
  • Capping num_predict in Ollama keeps voice responses appropriately short
  • History trimming prevents context from growing unbounded over long sessions

Limitation: Whisper v4 struggles with heavy accents and technical jargon. For specialized domains, fine-tune on domain-specific audio or use large-v3 at the cost of ~2x latency.

When NOT to use this: If you need sub-1-second response times, cloud STT + local LLM is faster. Local Whisper adds ~500ms that cloud APIs skip.


Tested on Python 3.12, Whisper v4 turbo, Ollama 0.5, Llama 4 Scout — Ubuntu 24.04 and macOS 15.3