Problem: One Model Can't Do Everything Well

You're building a custom AI app and routing every request through the same general-purpose LLM. Code questions, legal summaries, and casual chat all hit the same endpoint. It's slow, expensive, and the outputs feel generic.

Mixture of Experts (MoE) routing fixes this by sending each request to the model best suited for it — just like how GPT-4 and Mixtral work internally, but under your control.

You'll learn:

How to classify user intent and route it to specialized experts
How to build a lightweight Python router without a framework
How to add fallback logic and confidence thresholds

Time: 45 min | Level: Advanced

Why This Happens

General-purpose LLMs are trained to be decent at everything. That generality costs you at inference time — you're paying for capability you don't need on most requests.

MoE solves this with a gating network: a fast classifier that reads the input and decides which expert (model, prompt config, or API endpoint) handles it. Only the winning expert runs. The rest don't.

Common symptoms that tell you to add routing:

Latency varies wildly across request types
Costs scale with request volume, not complexity
Output quality is inconsistent — great for some tasks, mediocre for others

MoE routing architecture diagram Router classifies intent → selects expert → expert generates response

Solution

Step 1: Define Your Experts

Start by mapping your use cases to specialists. Each expert is a config object that defines the model, system prompt, and the kinds of queries it handles.

# experts.py
from dataclasses import dataclass, field

@dataclass
class Expert:
    name: str
    model: str
    system_prompt: str
    keywords: list[str]
    temperature: float = 0.7

EXPERTS: list[Expert] = [
    Expert(
        name="code",
        model="claude-opus-4-6",
        system_prompt="You are a senior software engineer. Give concise, working code with explanations.",
        keywords=["code", "function", "bug", "error", "implement", "python", "typescript", "debug"],
        temperature=0.2,  # Low temp = deterministic code output
    ),
    Expert(
        name="legal",
        model="claude-opus-4-6",
        system_prompt="You are a legal analyst. Summarize clearly, flag risks, avoid giving direct legal advice.",
        keywords=["contract", "liability", "clause", "legal", "compliance", "regulation", "lawsuit"],
        temperature=0.3,
    ),
    Expert(
        name="creative",
        model="claude-sonnet-4-6",
        system_prompt="You are a creative writing partner. Be imaginative, engaging, and vivid.",
        keywords=["write", "story", "poem", "creative", "imagine", "describe", "draft"],
        temperature=0.9,  # High temp = more creative variation
    ),
    Expert(
        name="general",
        model="claude-haiku-4-5-20251001",
        system_prompt="You are a helpful assistant. Answer clearly and concisely.",
        keywords=[],  # Fallback — catches everything else
        temperature=0.7,
    ),
]

Expected: Four expert configs covering code, legal, creative, and general queries.

If it fails:

Import error on dataclass: You need Python 3.10+. Run python --version to check.

Step 2: Build the Router

The router scores each expert against the incoming query and picks the best match. This version uses keyword overlap plus an optional embedding similarity boost.

# router.py
import re
from experts import Expert, EXPERTS

def tokenize(text: str) -> set[str]:
    # Lowercase, strip punctuation, split on whitespace
    return set(re.sub(r"[^\w\s]", "", text.lower()).split())

def score_expert(query_tokens: set[str], expert: Expert) -> float:
    if not expert.keywords:
        return 0.0  # General fallback always scores zero — selected only if others also zero

    keyword_set = set(expert.keywords)
    overlap = query_tokens & keyword_set

    # Jaccard-style score: matches / total unique tokens considered
    return len(overlap) / len(keyword_set)

def route(query: str, threshold: float = 0.05) -> Expert:
    tokens = tokenize(query)
    scores = [(score_expert(tokens, expert), expert) for expert in EXPERTS]

    # Sort descending by score
    scores.sort(key=lambda x: x[0], reverse=True)

    best_score, best_expert = scores[0]

    if best_score < threshold:
        # Nothing matched well enough — use the general fallback
        fallback = next(e for e in EXPERTS if e.name == "general")
        return fallback

    return best_expert

Why threshold matters: Without it, a query like "hello" could weakly match "code" just because of a single token. The threshold enforces a minimum confidence before committing.

Step 3: Call the Winning Expert

Now wire the router to the Anthropic SDK. The router's output tells you which model and system prompt to use.

# app.py
import anthropic
from router import route

client = anthropic.Anthropic()

def ask(query: str) -> dict:
    expert = route(query)

    print(f"[Router] Selected expert: {expert.name} ({expert.model})")

    response = client.messages.create(
        model=expert.model,
        max_tokens=1024,
        system=expert.system_prompt,
        messages=[{"role": "user", "content": query}],
        temperature=expert.temperature,
    )

    return {
        "expert": expert.name,
        "model": expert.model,
        "response": response.content[0].text,
    }

if __name__ == "__main__":
    queries = [
        "Can you debug this Python function for me?",
        "Summarize the indemnification clause in this contract.",
        "Write a short story about a lighthouse keeper.",
        "What's the capital of France?",
    ]

    for q in queries:
        result = ask(q)
        print(f"Q: {q}")
        print(f"Expert: {result['expert']} | Model: {result['model']}")
        print(f"A: {result['response'][:120]}...\n")

Expected output:

[Router] Selected expert: code (claude-opus-4-6)
[Router] Selected expert: legal (claude-opus-4-6)
[Router] Selected expert: creative (claude-sonnet-4-6)
[Router] Selected expert: general (claude-haiku-4-5-20251001)

If it fails:

AuthenticationError: Set your API key — export ANTHROPIC_API_KEY=your_key_here
temperature parameter rejected: Some models don't accept temperature at the API level. Remove it and control randomness via system prompt wording instead.

Step 4: Add Confidence Logging and Fallback Chains

For production, you want visibility into routing decisions and a chain of fallbacks if an expert fails.

# router_v2.py — production-hardened version
import logging
from experts import Expert, EXPERTS
from router import tokenize, score_expert

logger = logging.getLogger(__name__)

def route_with_scores(query: str, threshold: float = 0.05) -> tuple[Expert, dict[str, float]]:
    tokens = tokenize(query)
    scored = {
        expert.name: score_expert(tokens, expert)
        for expert in EXPERTS
    }

    logger.info(f"Routing scores: {scored}")

    best_name = max(scored, key=lambda k: scored[k])
    best_score = scored[best_name]

    if best_score < threshold:
        logger.warning(f"No confident match (best: {best_name}={best_score:.3f}). Using fallback.")
        fallback = next(e for e in EXPERTS if e.name == "general")
        return fallback, scored

    best_expert = next(e for e in EXPERTS if e.name == best_name)
    return best_expert, scored

def ask_with_fallback(query: str, client) -> dict:
    expert, scores = route_with_scores(query)

    try:
        response = client.messages.create(
            model=expert.model,
            max_tokens=1024,
            system=expert.system_prompt,
            messages=[{"role": "user", "content": query}],
        )
        return {"expert": expert.name, "scores": scores, "response": response.content[0].text}

    except Exception as e:
        logger.error(f"Expert {expert.name} failed: {e}. Falling back to general.")
        # Retry with the general fallback
        fallback = next(ex for ex in EXPERTS if ex.name == "general")
        response = client.messages.create(
            model=fallback.model,
            max_tokens=1024,
            system=fallback.system_prompt,
            messages=[{"role": "user", "content": query}],
        )
        return {"expert": "general (fallback)", "scores": scores, "response": response.content[0].text}

Routing decision log in terminal Confidence scores per expert printed for each request — useful for tuning thresholds

Verification

python app.py

Check that each of the four test queries routes to the expected expert. Then run with a vague query to verify fallback:

result = ask("Hey, what do you think?")
print(result["expert"])  # Should print "general"

You should see: general for low-signal queries and domain-specific experts for clear intent.

Going Further: Embedding-Based Routing

Keyword overlap is fast but brittle. For higher accuracy, replace score_expert with cosine similarity over sentence embeddings.

# embedding_router.py — higher accuracy, higher latency
import numpy as np
from anthropic import Anthropic

client = Anthropic()

def embed(text: str) -> list[float]:
    # Use any embedding model — this shows the pattern with a placeholder
    # In practice: use sentence-transformers, OpenAI embeddings, or Voyage AI
    raise NotImplementedError("Plug in your embedding model here")

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Pre-compute expert embeddings at startup (not per-request)
EXPERT_EMBEDDINGS = {
    expert.name: embed(" ".join(expert.keywords))
    for expert in EXPERTS
    if expert.keywords
}

def route_by_embedding(query: str) -> Expert:
    query_vec = embed(query)
    similarities = {
        name: cosine_similarity(query_vec, vec)
        for name, vec in EXPERT_EMBEDDINGS.items()
    }
    best = max(similarities, key=similarities.get)
    return next(e for e in EXPERTS if e.name == best)

Use this when keyword matching misroutes edge cases. The tradeoff: ~50–100ms extra latency per routing decision.

What You Learned

MoE routing gives you cost and quality control by matching request type to the right model
Keyword overlap is a solid starting point — embedding similarity is better but slower
Always implement a fallback expert so no request is left unhandled
Log routing scores in production; your threshold will need tuning once real traffic hits

Limitations to know:

Keyword routing fails on ambiguous or multi-topic queries — embedding-based routing handles these better
This approach routes the whole request to one expert. For complex tasks, you may want cascaded routing — run a cheap model first, escalate to a stronger one if confidence is low
Not a replacement for fine-tuning when you need domain depth that a system prompt can't provide

Tested on Python 3.12, anthropic SDK 0.40+, macOS & Ubuntu 24.04