Groq Compound AI with Mixture-of-Agents (MoA) inference lets you run multiple LLMs in parallel on Groq's LPU hardware and aggregate their outputs into a single, higher-quality response — all in under two seconds on free-tier API keys.

Single-model calls plateau. No matter how large the model, one forward pass misses reasoning paths another model would catch. MoA fixes this by running several "proposer" models concurrently, then feeding all their drafts to an "aggregator" model that synthesizes the best answer. Groq's LPU makes this practical: parallel calls that would stall on GPU-bound APIs finish in milliseconds here.

You'll learn:

How the MoA proposer → aggregator pipeline works on Groq
How to implement concurrent proposer calls with asyncio and the Groq Python SDK
How to tune model selection, temperature, and aggregator prompt for production use
When MoA improves quality vs. when a single large model is cheaper and sufficient

Time: 20 min | Difficulty: Intermediate

Why Single-Model Inference Hits a Ceiling

Every LLM samples from a probability distribution. One sample = one reasoning path. That path may be confidently wrong, especially on multi-step problems, ambiguous instructions, or adversarial inputs.

Symptoms of single-model ceiling:

Correct 80–90% of the time but fails on edge cases you can't predict
Chain-of-thought improves results but adds latency without ensemble diversity
Larger models reduce errors but multiply cost linearly with no quality guarantee

MoA breaks this by treating inference as an ensemble problem. Several smaller, fast models explore different reasoning paths in parallel. An aggregator — usually a stronger model — reads all drafts and synthesizes a final answer that's empirically better than any single proposer alone.

Groq Compound AI Mixture-of-Agents inference pipeline MoA pipeline: proposers run in parallel on Groq LPUs → drafts collected → aggregator synthesizes final answer

How Groq's LPU Makes MoA Practical

GPU inference stacks requests in a queue. Parallel calls to the same GPU-backed API don't actually run in parallel — they serialize behind each other, so MoA on GPU APIs multiplies latency by the number of proposers.

Groq's LPU (Language Processing Unit) is a deterministic, streaming compute unit with no memory bandwidth bottleneck. Each request gets dedicated silicon. Three parallel proposer calls take roughly the same wall-clock time as one. That's the architecture assumption MoA depends on — and why Groq is the natural backend.

Groq's free tier (as of March 2026) gives you 14,400 requests/day on llama-3.1-8b-instant and 6,000 requests/day on llama-3.3-70b-versatile, both at no cost — enough to run MoA experiments without a billing profile.

Prerequisites

Python 3.12+
Groq API key — get one free at console.groq.com
groq and asyncio (stdlib) — no other dependencies required

pip install groq

Set your key:

export GROQ_API_KEY="gsk_your_key_here"

Solution

Step 1: Define Proposer and Aggregator Models

MoA uses two roles. Proposers are fast, cheap models that run in parallel. The aggregator is a stronger model that reads all proposer drafts and writes the final answer.

# moa_config.py

PROPOSER_MODELS = [
    "llama-3.1-8b-instant",   # 8B — fastest on Groq LPU, ~150ms TTFT
    "llama-3.1-8b-instant",   # Same model, different temperature = diverse samples
    "gemma2-9b-it",           # Different architecture = different reasoning paths
]

AGGREGATOR_MODEL = "llama-3.3-70b-versatile"  # 70B — best quality available on Groq free tier

PROPOSER_TEMPERATURE = 0.7   # High enough for diversity, low enough for coherence
AGGREGATOR_TEMPERATURE = 0.3 # Aggregator should be conservative — it's synthesizing, not exploring
MAX_TOKENS = 1024

Why use the same model twice? Temperature-driven diversity. The same 8B model at temperature=0.7 samples different reasoning paths on each call. Pair that with a different architecture (Gemma vs. Llama) and you get both stochastic and structural diversity.

Step 2: Run Proposers Concurrently with asyncio

# moa_proposers.py
import asyncio
import os
from groq import AsyncGroq

client = AsyncGroq(api_key=os.environ["GROQ_API_KEY"])

async def call_proposer(model: str, user_prompt: str, temperature: float) -> str:
    """Call a single proposer model and return its text response."""
    response = await client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": user_prompt}],
        temperature=temperature,
        max_tokens=1024,
    )
    return response.choices[0].message.content

async def gather_proposals(user_prompt: str, models: list[str], temperature: float) -> list[str]:
    """Run all proposer models concurrently — Groq LPU parallelism makes this ~= 1x latency."""
    tasks = [
        call_proposer(model, user_prompt, temperature)
        for model in models
    ]
    # asyncio.gather fires all coroutines simultaneously
    proposals = await asyncio.gather(*tasks)
    return list(proposals)

asyncio.gather issues all proposer calls at the same time. On GPU APIs this wouldn't help — requests queue server-side. On Groq's LPU each request gets its own compute path, so three calls genuinely run in parallel.

Expected latency: ~300–600ms for three parallel 8B/9B proposer calls on Groq free tier.

Step 3: Build the Aggregator Prompt

The aggregator prompt is the most important tuning surface in MoA. It must instruct the model to synthesize, not just copy the longest draft.

# moa_aggregator.py

AGGREGATOR_SYSTEM_PROMPT = """You are a synthesis expert. You will receive multiple draft answers to the same question, each written by a different AI model.

Your task:
1. Identify the strongest reasoning in each draft
2. Resolve any contradictions by applying logical consistency
3. Write a single final answer that incorporates the best elements of all drafts
4. Do not mention that you received multiple drafts — output only the final answer

Be concise. Do not pad. If drafts agree, confirm and tighten. If they disagree, reason through the conflict and pick the defensible position."""

def build_aggregator_prompt(user_prompt: str, proposals: list[str]) -> str:
    """Format proposer outputs into a structured aggregator input."""
    drafts_block = "\n\n".join(
        f"--- Draft {i+1} ---\n{proposal}"
        for i, proposal in enumerate(proposals)
    )
    return f"""Original question:
{user_prompt}

Proposer drafts:
{drafts_block}

Synthesize the best final answer."""

Step 4: Call the Aggregator and Return Final Answer

# moa_aggregator.py (continued)
import os
from groq import AsyncGroq

client = AsyncGroq(api_key=os.environ["GROQ_API_KEY"])

async def aggregate(user_prompt: str, proposals: list[str], model: str, temperature: float) -> str:
    """Feed all proposer drafts to the aggregator and return the synthesized answer."""
    aggregator_user_content = build_aggregator_prompt(user_prompt, proposals)

    response = await client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": AGGREGATOR_SYSTEM_PROMPT},
            {"role": "user", "content": aggregator_user_content},
        ],
        temperature=temperature,
        max_tokens=1024,
    )
    return response.choices[0].message.content

Step 5: Wire the Full MoA Pipeline

# moa_pipeline.py
import asyncio
import time
from moa_config import (
    PROPOSER_MODELS, AGGREGATOR_MODEL,
    PROPOSER_TEMPERATURE, AGGREGATOR_TEMPERATURE,
)
from moa_proposers import gather_proposals
from moa_aggregator import aggregate

async def run_moa(user_prompt: str) -> dict:
    """
    Full Mixture-of-Agents pipeline:
      1. Run proposers in parallel
      2. Pass drafts to aggregator
      3. Return final answer + timing metadata
    """
    start = time.perf_counter()

    # Phase 1 — parallel proposer inference
    proposals = await gather_proposals(user_prompt, PROPOSER_MODELS, PROPOSER_TEMPERATURE)
    proposer_ms = int((time.perf_counter() - start) * 1000)

    # Phase 2 — sequential aggregator synthesis
    final_answer = await aggregate(user_prompt, proposals, AGGREGATOR_MODEL, AGGREGATOR_TEMPERATURE)
    total_ms = int((time.perf_counter() - start) * 1000)

    return {
        "answer": final_answer,
        "proposals": proposals,
        "proposer_latency_ms": proposer_ms,
        "total_latency_ms": total_ms,
    }

if __name__ == "__main__":
    question = "Explain the tradeoffs between B-tree and LSM-tree indexes for write-heavy workloads."
    result = asyncio.run(run_moa(question))

    print(f"Proposers completed in {result['proposer_latency_ms']}ms")
    print(f"Total pipeline: {result['total_latency_ms']}ms")
    print("\n=== Final Answer ===")
    print(result["answer"])

Verification

Run the pipeline:

python moa_pipeline.py

You should see:

Proposers completed in 420ms
Total pipeline: 1180ms

=== Final Answer ===
B-trees favor read-heavy workloads because...

Total wall-clock under 1.5 seconds for a three-proposer + 70B aggregator pipeline is normal on Groq. If proposer latency exceeds 1,500ms, check your GROQ_API_KEY rate limit tier at console.groq.com/settings/limits.

If it fails:

AuthenticationError → GROQ_API_KEY not exported in current shell. Run export GROQ_API_KEY="gsk_..." and retry.
RateLimitError → Free tier exhausted. Wait 60 seconds or reduce PROPOSER_MODELS to two entries.
model not found → Model name changed. Check current model IDs at console.groq.com/docs/models.

Tuning MoA for Production

Model Selection

Proposer mix	When to use
3× `llama-3.1-8b-instant` different temps	Maximum speed, lowest cost, good for factual Q&A
`llama-3.1-8b-instant` + `gemma2-9b-it` + `llama-3.1-8b-instant`	Balanced diversity — recommended default
3× different architectures	Best quality, slightly higher latency, use for reasoning tasks

Number of Proposers

Two proposers is the minimum for meaningful diversity. Three is the practical optimum on Groq free tier — four or more starts hitting per-minute token limits before quality gains justify cost.

Aggregator Temperature

Keep the aggregator at temperature=0.2–0.4. Higher values introduce noise at exactly the stage where you want precision. Proposers handle exploration; the aggregator handles consolidation.

When NOT to Use MoA

Simple lookup tasks — "What's the capital of France?" needs one model, not three.
Latency-critical paths under 200ms — MoA always adds aggregator latency on top of proposer latency.
Streaming UX — MoA must collect all proposals before the aggregator starts. You can't stream proposer output to the user mid-pipeline.

Groq Compound AI vs. Single 70B Model

	MoA (3× 8B + 70B aggregator)	Single `llama-3.3-70b-versatile`
Reasoning quality	Higher on multi-step problems	Good, single reasoning path
Latency	~1,000–1,500ms	~400–700ms
Token cost	~3–4× more tokens total	Baseline
Failure mode	Aggregator can over-smooth	Single confident wrong answer
Best for	Complex reasoning, evaluation, synthesis	Speed-sensitive, straightforward queries

For tasks scored on benchmarks like MMLU, GSM8K, or HumanEval, MoA with Groq consistently outperforms single-70B calls. For production APIs where p95 latency matters more than accuracy percentiles, single-70B wins.

What You Learned

MoA runs proposers in parallel then synthesizes with an aggregator — quality improves because diverse reasoning paths cover more of the solution space
Groq's LPU is the only free-tier API where parallel proposer calls don't multiply wall-clock latency
The aggregator system prompt is the highest-leverage tuning variable — bad prompts make the aggregator copy the longest draft instead of synthesizing
Three proposers at mixed temperatures and architectures is the practical optimum for cost-quality tradeoff on Groq

Tested on Python 3.12, groq SDK 0.13.x, llama-3.1-8b-instant, gemma2-9b-it, llama-3.3-70b-versatile — March 2026

FAQ

Q: Does Groq MoA work without an async framework — can I use plain requests? A: Yes, but you lose parallelism. Synchronous calls serialize proposers and multiply latency by the number of models. Use asyncio or concurrent.futures.ThreadPoolExecutor to get actual parallel execution.

Q: What is the difference between Mixture-of-Agents and Mixture-of-Experts? A: MoE (Mixture-of-Experts) is a single model architecture where different parameter subsets activate per token — it's internal to one model. MoA is an inference-time ensemble where multiple separate models run independently and their outputs are merged by a coordinator.

Q: How many tokens does a three-proposer MoA pipeline consume? A: Roughly 3× proposer input tokens + 3× proposer output tokens + aggregator input (which includes all drafts) + aggregator output. For a 200-token question with 400-token proposer answers, expect ~3,000–3,500 total tokens per MoA call.

Q: Can I run MoA with Groq and store results for evaluation? A: Yes. Log result["proposals"] and result["answer"] to any database. Pair with LangSmith or Langfuse to trace proposer vs. aggregator contributions across runs. This is the recommended setup for measuring whether MoA actually improves your specific task.

Q: What is the minimum RAM required to run this pipeline locally? A: The Python client itself needs under 100MB — all inference runs on Groq's servers. No local GPU or VRAM required. A free-tier cloud VM (512MB RAM) is sufficient to run the orchestration code.