Problem: One Model Can't Do Everything Well
You're building a custom AI app and routing every request through the same general-purpose LLM. Code questions, legal summaries, and casual chat all hit the same endpoint. It's slow, expensive, and the outputs feel generic.
Mixture of Experts (MoE) routing fixes this by sending each request to the model best suited for it — just like how GPT-4 and Mixtral work internally, but under your control.
You'll learn:
- How to classify user intent and route it to specialized experts
- How to build a lightweight Python router without a framework
- How to add fallback logic and confidence thresholds
Time: 45 min | Level: Advanced
Why This Happens
General-purpose LLMs are trained to be decent at everything. That generality costs you at inference time — you're paying for capability you don't need on most requests.
MoE solves this with a gating network: a fast classifier that reads the input and decides which expert (model, prompt config, or API endpoint) handles it. Only the winning expert runs. The rest don't.
Common symptoms that tell you to add routing:
- Latency varies wildly across request types
- Costs scale with request volume, not complexity
- Output quality is inconsistent — great for some tasks, mediocre for others
Router classifies intent → selects expert → expert generates response
Solution
Step 1: Define Your Experts
Start by mapping your use cases to specialists. Each expert is a config object that defines the model, system prompt, and the kinds of queries it handles.
# experts.py
from dataclasses import dataclass, field
@dataclass
class Expert:
name: str
model: str
system_prompt: str
keywords: list[str]
temperature: float = 0.7
EXPERTS: list[Expert] = [
Expert(
name="code",
model="claude-opus-4-6",
system_prompt="You are a senior software engineer. Give concise, working code with explanations.",
keywords=["code", "function", "bug", "error", "implement", "python", "typescript", "debug"],
temperature=0.2, # Low temp = deterministic code output
),
Expert(
name="legal",
model="claude-opus-4-6",
system_prompt="You are a legal analyst. Summarize clearly, flag risks, avoid giving direct legal advice.",
keywords=["contract", "liability", "clause", "legal", "compliance", "regulation", "lawsuit"],
temperature=0.3,
),
Expert(
name="creative",
model="claude-sonnet-4-6",
system_prompt="You are a creative writing partner. Be imaginative, engaging, and vivid.",
keywords=["write", "story", "poem", "creative", "imagine", "describe", "draft"],
temperature=0.9, # High temp = more creative variation
),
Expert(
name="general",
model="claude-haiku-4-5-20251001",
system_prompt="You are a helpful assistant. Answer clearly and concisely.",
keywords=[], # Fallback — catches everything else
temperature=0.7,
),
]
Expected: Four expert configs covering code, legal, creative, and general queries.
If it fails:
- Import error on
dataclass: You need Python 3.10+. Runpython --versionto check.
Step 2: Build the Router
The router scores each expert against the incoming query and picks the best match. This version uses keyword overlap plus an optional embedding similarity boost.
# router.py
import re
from experts import Expert, EXPERTS
def tokenize(text: str) -> set[str]:
# Lowercase, strip punctuation, split on whitespace
return set(re.sub(r"[^\w\s]", "", text.lower()).split())
def score_expert(query_tokens: set[str], expert: Expert) -> float:
if not expert.keywords:
return 0.0 # General fallback always scores zero — selected only if others also zero
keyword_set = set(expert.keywords)
overlap = query_tokens & keyword_set
# Jaccard-style score: matches / total unique tokens considered
return len(overlap) / len(keyword_set)
def route(query: str, threshold: float = 0.05) -> Expert:
tokens = tokenize(query)
scores = [(score_expert(tokens, expert), expert) for expert in EXPERTS]
# Sort descending by score
scores.sort(key=lambda x: x[0], reverse=True)
best_score, best_expert = scores[0]
if best_score < threshold:
# Nothing matched well enough — use the general fallback
fallback = next(e for e in EXPERTS if e.name == "general")
return fallback
return best_expert
Why threshold matters: Without it, a query like "hello" could weakly match "code" just because of a single token. The threshold enforces a minimum confidence before committing.
Step 3: Call the Winning Expert
Now wire the router to the Anthropic SDK. The router's output tells you which model and system prompt to use.
# app.py
import anthropic
from router import route
client = anthropic.Anthropic()
def ask(query: str) -> dict:
expert = route(query)
print(f"[Router] Selected expert: {expert.name} ({expert.model})")
response = client.messages.create(
model=expert.model,
max_tokens=1024,
system=expert.system_prompt,
messages=[{"role": "user", "content": query}],
temperature=expert.temperature,
)
return {
"expert": expert.name,
"model": expert.model,
"response": response.content[0].text,
}
if __name__ == "__main__":
queries = [
"Can you debug this Python function for me?",
"Summarize the indemnification clause in this contract.",
"Write a short story about a lighthouse keeper.",
"What's the capital of France?",
]
for q in queries:
result = ask(q)
print(f"Q: {q}")
print(f"Expert: {result['expert']} | Model: {result['model']}")
print(f"A: {result['response'][:120]}...\n")
Expected output:
[Router] Selected expert: code (claude-opus-4-6)
[Router] Selected expert: legal (claude-opus-4-6)
[Router] Selected expert: creative (claude-sonnet-4-6)
[Router] Selected expert: general (claude-haiku-4-5-20251001)
If it fails:
AuthenticationError: Set your API key —export ANTHROPIC_API_KEY=your_key_heretemperatureparameter rejected: Some models don't accept temperature at the API level. Remove it and control randomness via system prompt wording instead.
Step 4: Add Confidence Logging and Fallback Chains
For production, you want visibility into routing decisions and a chain of fallbacks if an expert fails.
# router_v2.py — production-hardened version
import logging
from experts import Expert, EXPERTS
from router import tokenize, score_expert
logger = logging.getLogger(__name__)
def route_with_scores(query: str, threshold: float = 0.05) -> tuple[Expert, dict[str, float]]:
tokens = tokenize(query)
scored = {
expert.name: score_expert(tokens, expert)
for expert in EXPERTS
}
logger.info(f"Routing scores: {scored}")
best_name = max(scored, key=lambda k: scored[k])
best_score = scored[best_name]
if best_score < threshold:
logger.warning(f"No confident match (best: {best_name}={best_score:.3f}). Using fallback.")
fallback = next(e for e in EXPERTS if e.name == "general")
return fallback, scored
best_expert = next(e for e in EXPERTS if e.name == best_name)
return best_expert, scored
def ask_with_fallback(query: str, client) -> dict:
expert, scores = route_with_scores(query)
try:
response = client.messages.create(
model=expert.model,
max_tokens=1024,
system=expert.system_prompt,
messages=[{"role": "user", "content": query}],
)
return {"expert": expert.name, "scores": scores, "response": response.content[0].text}
except Exception as e:
logger.error(f"Expert {expert.name} failed: {e}. Falling back to general.")
# Retry with the general fallback
fallback = next(ex for ex in EXPERTS if ex.name == "general")
response = client.messages.create(
model=fallback.model,
max_tokens=1024,
system=fallback.system_prompt,
messages=[{"role": "user", "content": query}],
)
return {"expert": "general (fallback)", "scores": scores, "response": response.content[0].text}
Confidence scores per expert printed for each request — useful for tuning thresholds
Verification
python app.py
Check that each of the four test queries routes to the expected expert. Then run with a vague query to verify fallback:
result = ask("Hey, what do you think?")
print(result["expert"]) # Should print "general"
You should see: general for low-signal queries and domain-specific experts for clear intent.
Going Further: Embedding-Based Routing
Keyword overlap is fast but brittle. For higher accuracy, replace score_expert with cosine similarity over sentence embeddings.
# embedding_router.py — higher accuracy, higher latency
import numpy as np
from anthropic import Anthropic
client = Anthropic()
def embed(text: str) -> list[float]:
# Use any embedding model — this shows the pattern with a placeholder
# In practice: use sentence-transformers, OpenAI embeddings, or Voyage AI
raise NotImplementedError("Plug in your embedding model here")
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Pre-compute expert embeddings at startup (not per-request)
EXPERT_EMBEDDINGS = {
expert.name: embed(" ".join(expert.keywords))
for expert in EXPERTS
if expert.keywords
}
def route_by_embedding(query: str) -> Expert:
query_vec = embed(query)
similarities = {
name: cosine_similarity(query_vec, vec)
for name, vec in EXPERT_EMBEDDINGS.items()
}
best = max(similarities, key=similarities.get)
return next(e for e in EXPERTS if e.name == best)
Use this when keyword matching misroutes edge cases. The tradeoff: ~50–100ms extra latency per routing decision.
What You Learned
- MoE routing gives you cost and quality control by matching request type to the right model
- Keyword overlap is a solid starting point — embedding similarity is better but slower
- Always implement a fallback expert so no request is left unhandled
- Log routing scores in production; your threshold will need tuning once real traffic hits
Limitations to know:
- Keyword routing fails on ambiguous or multi-topic queries — embedding-based routing handles these better
- This approach routes the whole request to one expert. For complex tasks, you may want cascaded routing — run a cheap model first, escalate to a stronger one if confidence is low
- Not a replacement for fine-tuning when you need domain depth that a system prompt can't provide
Tested on Python 3.12, anthropic SDK 0.40+, macOS & Ubuntu 24.04