Groq Compound AI with Mixture-of-Agents (MoA) inference lets you run multiple LLMs in parallel on Groq's LPU hardware and aggregate their outputs into a single, higher-quality response — all in under two seconds on free-tier API keys.
Single-model calls plateau. No matter how large the model, one forward pass misses reasoning paths another model would catch. MoA fixes this by running several "proposer" models concurrently, then feeding all their drafts to an "aggregator" model that synthesizes the best answer. Groq's LPU makes this practical: parallel calls that would stall on GPU-bound APIs finish in milliseconds here.
You'll learn:
- How the MoA proposer → aggregator pipeline works on Groq
- How to implement concurrent proposer calls with
asyncioand the Groq Python SDK - How to tune model selection, temperature, and aggregator prompt for production use
- When MoA improves quality vs. when a single large model is cheaper and sufficient
Time: 20 min | Difficulty: Intermediate
Why Single-Model Inference Hits a Ceiling
Every LLM samples from a probability distribution. One sample = one reasoning path. That path may be confidently wrong, especially on multi-step problems, ambiguous instructions, or adversarial inputs.
Symptoms of single-model ceiling:
- Correct 80–90% of the time but fails on edge cases you can't predict
- Chain-of-thought improves results but adds latency without ensemble diversity
- Larger models reduce errors but multiply cost linearly with no quality guarantee
MoA breaks this by treating inference as an ensemble problem. Several smaller, fast models explore different reasoning paths in parallel. An aggregator — usually a stronger model — reads all drafts and synthesizes a final answer that's empirically better than any single proposer alone.
MoA pipeline: proposers run in parallel on Groq LPUs → drafts collected → aggregator synthesizes final answer
How Groq's LPU Makes MoA Practical
GPU inference stacks requests in a queue. Parallel calls to the same GPU-backed API don't actually run in parallel — they serialize behind each other, so MoA on GPU APIs multiplies latency by the number of proposers.
Groq's LPU (Language Processing Unit) is a deterministic, streaming compute unit with no memory bandwidth bottleneck. Each request gets dedicated silicon. Three parallel proposer calls take roughly the same wall-clock time as one. That's the architecture assumption MoA depends on — and why Groq is the natural backend.
Groq's free tier (as of March 2026) gives you 14,400 requests/day on llama-3.1-8b-instant and 6,000 requests/day on llama-3.3-70b-versatile, both at no cost — enough to run MoA experiments without a billing profile.
Prerequisites
- Python 3.12+
- Groq API key — get one free at console.groq.com
groqandasyncio(stdlib) — no other dependencies required
pip install groq
Set your key:
export GROQ_API_KEY="gsk_your_key_here"
Solution
Step 1: Define Proposer and Aggregator Models
MoA uses two roles. Proposers are fast, cheap models that run in parallel. The aggregator is a stronger model that reads all proposer drafts and writes the final answer.
# moa_config.py
PROPOSER_MODELS = [
"llama-3.1-8b-instant", # 8B — fastest on Groq LPU, ~150ms TTFT
"llama-3.1-8b-instant", # Same model, different temperature = diverse samples
"gemma2-9b-it", # Different architecture = different reasoning paths
]
AGGREGATOR_MODEL = "llama-3.3-70b-versatile" # 70B — best quality available on Groq free tier
PROPOSER_TEMPERATURE = 0.7 # High enough for diversity, low enough for coherence
AGGREGATOR_TEMPERATURE = 0.3 # Aggregator should be conservative — it's synthesizing, not exploring
MAX_TOKENS = 1024
Why use the same model twice? Temperature-driven diversity. The same 8B model at temperature=0.7 samples different reasoning paths on each call. Pair that with a different architecture (Gemma vs. Llama) and you get both stochastic and structural diversity.
Step 2: Run Proposers Concurrently with asyncio
# moa_proposers.py
import asyncio
import os
from groq import AsyncGroq
client = AsyncGroq(api_key=os.environ["GROQ_API_KEY"])
async def call_proposer(model: str, user_prompt: str, temperature: float) -> str:
"""Call a single proposer model and return its text response."""
response = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": user_prompt}],
temperature=temperature,
max_tokens=1024,
)
return response.choices[0].message.content
async def gather_proposals(user_prompt: str, models: list[str], temperature: float) -> list[str]:
"""Run all proposer models concurrently — Groq LPU parallelism makes this ~= 1x latency."""
tasks = [
call_proposer(model, user_prompt, temperature)
for model in models
]
# asyncio.gather fires all coroutines simultaneously
proposals = await asyncio.gather(*tasks)
return list(proposals)
asyncio.gather issues all proposer calls at the same time. On GPU APIs this wouldn't help — requests queue server-side. On Groq's LPU each request gets its own compute path, so three calls genuinely run in parallel.
Expected latency: ~300–600ms for three parallel 8B/9B proposer calls on Groq free tier.
Step 3: Build the Aggregator Prompt
The aggregator prompt is the most important tuning surface in MoA. It must instruct the model to synthesize, not just copy the longest draft.
# moa_aggregator.py
AGGREGATOR_SYSTEM_PROMPT = """You are a synthesis expert. You will receive multiple draft answers to the same question, each written by a different AI model.
Your task:
1. Identify the strongest reasoning in each draft
2. Resolve any contradictions by applying logical consistency
3. Write a single final answer that incorporates the best elements of all drafts
4. Do not mention that you received multiple drafts — output only the final answer
Be concise. Do not pad. If drafts agree, confirm and tighten. If they disagree, reason through the conflict and pick the defensible position."""
def build_aggregator_prompt(user_prompt: str, proposals: list[str]) -> str:
"""Format proposer outputs into a structured aggregator input."""
drafts_block = "\n\n".join(
f"--- Draft {i+1} ---\n{proposal}"
for i, proposal in enumerate(proposals)
)
return f"""Original question:
{user_prompt}
Proposer drafts:
{drafts_block}
Synthesize the best final answer."""
Step 4: Call the Aggregator and Return Final Answer
# moa_aggregator.py (continued)
import os
from groq import AsyncGroq
client = AsyncGroq(api_key=os.environ["GROQ_API_KEY"])
async def aggregate(user_prompt: str, proposals: list[str], model: str, temperature: float) -> str:
"""Feed all proposer drafts to the aggregator and return the synthesized answer."""
aggregator_user_content = build_aggregator_prompt(user_prompt, proposals)
response = await client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": AGGREGATOR_SYSTEM_PROMPT},
{"role": "user", "content": aggregator_user_content},
],
temperature=temperature,
max_tokens=1024,
)
return response.choices[0].message.content
Step 5: Wire the Full MoA Pipeline
# moa_pipeline.py
import asyncio
import time
from moa_config import (
PROPOSER_MODELS, AGGREGATOR_MODEL,
PROPOSER_TEMPERATURE, AGGREGATOR_TEMPERATURE,
)
from moa_proposers import gather_proposals
from moa_aggregator import aggregate
async def run_moa(user_prompt: str) -> dict:
"""
Full Mixture-of-Agents pipeline:
1. Run proposers in parallel
2. Pass drafts to aggregator
3. Return final answer + timing metadata
"""
start = time.perf_counter()
# Phase 1 — parallel proposer inference
proposals = await gather_proposals(user_prompt, PROPOSER_MODELS, PROPOSER_TEMPERATURE)
proposer_ms = int((time.perf_counter() - start) * 1000)
# Phase 2 — sequential aggregator synthesis
final_answer = await aggregate(user_prompt, proposals, AGGREGATOR_MODEL, AGGREGATOR_TEMPERATURE)
total_ms = int((time.perf_counter() - start) * 1000)
return {
"answer": final_answer,
"proposals": proposals,
"proposer_latency_ms": proposer_ms,
"total_latency_ms": total_ms,
}
if __name__ == "__main__":
question = "Explain the tradeoffs between B-tree and LSM-tree indexes for write-heavy workloads."
result = asyncio.run(run_moa(question))
print(f"Proposers completed in {result['proposer_latency_ms']}ms")
print(f"Total pipeline: {result['total_latency_ms']}ms")
print("\n=== Final Answer ===")
print(result["answer"])
Verification
Run the pipeline:
python moa_pipeline.py
You should see:
Proposers completed in 420ms
Total pipeline: 1180ms
=== Final Answer ===
B-trees favor read-heavy workloads because...
Total wall-clock under 1.5 seconds for a three-proposer + 70B aggregator pipeline is normal on Groq. If proposer latency exceeds 1,500ms, check your GROQ_API_KEY rate limit tier at console.groq.com/settings/limits.
If it fails:
AuthenticationError→GROQ_API_KEYnot exported in current shell. Runexport GROQ_API_KEY="gsk_..."and retry.RateLimitError→ Free tier exhausted. Wait 60 seconds or reducePROPOSER_MODELSto two entries.model not found→ Model name changed. Check current model IDs at console.groq.com/docs/models.
Tuning MoA for Production
Model Selection
| Proposer mix | When to use |
|---|---|
3× llama-3.1-8b-instant different temps | Maximum speed, lowest cost, good for factual Q&A |
llama-3.1-8b-instant + gemma2-9b-it + llama-3.1-8b-instant | Balanced diversity — recommended default |
| 3× different architectures | Best quality, slightly higher latency, use for reasoning tasks |
Number of Proposers
Two proposers is the minimum for meaningful diversity. Three is the practical optimum on Groq free tier — four or more starts hitting per-minute token limits before quality gains justify cost.
Aggregator Temperature
Keep the aggregator at temperature=0.2–0.4. Higher values introduce noise at exactly the stage where you want precision. Proposers handle exploration; the aggregator handles consolidation.
When NOT to Use MoA
- Simple lookup tasks — "What's the capital of France?" needs one model, not three.
- Latency-critical paths under 200ms — MoA always adds aggregator latency on top of proposer latency.
- Streaming UX — MoA must collect all proposals before the aggregator starts. You can't stream proposer output to the user mid-pipeline.
Groq Compound AI vs. Single 70B Model
| MoA (3× 8B + 70B aggregator) | Single llama-3.3-70b-versatile | |
|---|---|---|
| Reasoning quality | Higher on multi-step problems | Good, single reasoning path |
| Latency | ~1,000–1,500ms | ~400–700ms |
| Token cost | ~3–4× more tokens total | Baseline |
| Failure mode | Aggregator can over-smooth | Single confident wrong answer |
| Best for | Complex reasoning, evaluation, synthesis | Speed-sensitive, straightforward queries |
For tasks scored on benchmarks like MMLU, GSM8K, or HumanEval, MoA with Groq consistently outperforms single-70B calls. For production APIs where p95 latency matters more than accuracy percentiles, single-70B wins.
What You Learned
- MoA runs proposers in parallel then synthesizes with an aggregator — quality improves because diverse reasoning paths cover more of the solution space
- Groq's LPU is the only free-tier API where parallel proposer calls don't multiply wall-clock latency
- The aggregator system prompt is the highest-leverage tuning variable — bad prompts make the aggregator copy the longest draft instead of synthesizing
- Three proposers at mixed temperatures and architectures is the practical optimum for cost-quality tradeoff on Groq
Tested on Python 3.12, groq SDK 0.13.x, llama-3.1-8b-instant, gemma2-9b-it, llama-3.3-70b-versatile — March 2026
FAQ
Q: Does Groq MoA work without an async framework — can I use plain requests?
A: Yes, but you lose parallelism. Synchronous calls serialize proposers and multiply latency by the number of models. Use asyncio or concurrent.futures.ThreadPoolExecutor to get actual parallel execution.
Q: What is the difference between Mixture-of-Agents and Mixture-of-Experts? A: MoE (Mixture-of-Experts) is a single model architecture where different parameter subsets activate per token — it's internal to one model. MoA is an inference-time ensemble where multiple separate models run independently and their outputs are merged by a coordinator.
Q: How many tokens does a three-proposer MoA pipeline consume? A: Roughly 3× proposer input tokens + 3× proposer output tokens + aggregator input (which includes all drafts) + aggregator output. For a 200-token question with 400-token proposer answers, expect ~3,000–3,500 total tokens per MoA call.
Q: Can I run MoA with Groq and store results for evaluation?
A: Yes. Log result["proposals"] and result["answer"] to any database. Pair with LangSmith or Langfuse to trace proposer vs. aggregator contributions across runs. This is the recommended setup for measuring whether MoA actually improves your specific task.
Q: What is the minimum RAM required to run this pipeline locally? A: The Python client itself needs under 100MB — all inference runs on Groq's servers. No local GPU or VRAM required. A free-tier cloud VM (512MB RAM) is sufficient to run the orchestration code.