Multi-Agent Coding: One Agent Writes Tests, Another Writes the Code

Speed up development and reduce bias with two AI agents: one writes tests first, the other implements code to pass them. Here's how to set it up.

Problem: One AI Agent Knows Too Much

When a single AI agent writes both tests and implementation, it cheats. It writes tests that fit the code it already plans to write — not tests that define what the code should do.

You end up with 100% coverage and zero confidence.

You'll learn:

  • Why separating test-writing and code-writing agents produces better software
  • How to orchestrate two agents with a shared test contract
  • A working Python example using the Anthropic API

Time: 20 min | Level: Intermediate


Why This Happens

A single agent has full context. When it writes def add(a, b) and then writes assert add(2, 3) == 5, it's not testing — it's confirming. The tests are shaped by the implementation, not the other way around.

Multi-agent TDD breaks this loop:

  • Agent A (Test Writer): Reads the spec. Writes failing tests. Has no implementation context.
  • Agent B (Coder): Reads the tests. Writes code to pass them. Never sees the spec directly.

The tests become the contract. Agent B can only pass them by actually satisfying the requirements.

Common symptoms of single-agent test bias:

  • Tests pass immediately with no iteration
  • Edge cases are never covered
  • Refactoring breaks tests that "should still work"

Solution

Step 1: Define the Contract (The Spec)

Write a plain-language spec. This is what Agent A reads — and the only thing it reads.

# Spec: Password Validator

A password is valid if:
- At least 8 characters long
- Contains at least one uppercase letter
- Contains at least one digit
- Does not contain spaces

Save this as spec.md. Agent B never sees this file.


Step 2: Agent A Writes the Tests

Agent A receives the spec and outputs a test file. It knows nothing about how the validator will be implemented.

import anthropic

def run_test_writer_agent(spec: str) -> str:
    client = anthropic.Anthropic()

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        system="""You are a test-writing agent. 
You receive a software specification and write pytest tests for it.
Output ONLY valid Python test code. No explanations.
Do not write any implementation — only tests.""",
        messages=[
            {"role": "user", "content": f"Write tests for this spec:\n\n{spec}"}
        ]
    )

    return response.content[0].text

Expected output from Agent A:

import pytest
from validator import is_valid_password

def test_minimum_length():
    assert not is_valid_password("Ab1")

def test_exact_minimum_length():
    assert is_valid_password("Abcdef1g")  # 8 chars, meets all rules

def test_missing_uppercase():
    assert not is_valid_password("abcdefg1")

def test_missing_digit():
    assert not is_valid_password("Abcdefgh")

def test_contains_space():
    assert not is_valid_password("Abcdef 1")

def test_empty_string():
    assert not is_valid_password("")

def test_all_uppercase_no_digit():
    assert not is_valid_password("ABCDEFGH")

If Agent A outputs prose instead of code:

  • System prompt too weak: Add "Your entire response must be a Python code block only."
  • Model wandered: Use stop_sequences=["```\n\n"] to cut off after the code block.

Step 3: Agent B Writes the Implementation

Agent B receives only the test file. It writes code to pass every test.

def run_coder_agent(test_code: str) -> str:
    client = anthropic.Anthropic()

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        system="""You are a coding agent.
You receive pytest tests and write the implementation that passes them.
Output ONLY valid Python implementation code. No test code. No explanations.""",
        messages=[
            {"role": "user", "content": f"Write the implementation for these tests:\n\n{test_code}"}
        ]
    )

    return response.content[0].text

Expected output from Agent B:

# validator.py

def is_valid_password(password: str) -> bool:
    if len(password) < 8:
        return False
    if " " in password:
        return False
    if not any(c.isupper() for c in password):
        return False
    if not any(c.isdigit() for c in password):
        return False
    return True

Step 4: Wire the Orchestrator

import os

def main():
    # Load spec
    with open("spec.md") as f:
        spec = f.read()

    print("Agent A: Writing tests...")
    test_code = run_test_writer_agent(spec)

    # Save tests
    with open("test_validator.py", "w") as f:
        f.write(test_code)
    print("Tests written to test_validator.py")

    print("\nAgent B: Writing implementation...")
    impl_code = run_coder_agent(test_code)

    # Save implementation
    with open("validator.py", "w") as f:
        f.write(impl_code)
    print("Implementation written to validator.py")

    # Run tests automatically
    print("\nRunning pytest...")
    os.system("pytest test_validator.py -v")

if __name__ == "__main__":
    main()

Verification

python orchestrator.py

You should see:

Agent A: Writing tests...
Tests written to test_validator.py

Agent B: Writing implementation...
Implementation written to validator.py

Running pytest...
test_validator.py::test_minimum_length PASSED
test_validator.py::test_exact_minimum_length PASSED
test_validator.py::test_missing_uppercase PASSED
test_validator.py::test_missing_digit PASSED
test_validator.py::test_contains_space PASSED
test_validator.py::test_empty_string PASSED
test_validator.py::test_all_uppercase_no_digit PASSED

7 passed in 0.12s

If tests fail, feed the failure output back to Agent B as a follow-up message. This turns the orchestrator into an iterative loop until all tests pass.


Adding an Iteration Loop

If Agent B's first pass fails, add a retry loop:

def run_coder_agent_with_retry(test_code: str, max_retries: int = 3) -> str:
    client = anthropic.Anthropic()
    messages = [
        {"role": "user", "content": f"Write the implementation for these tests:\n\n{test_code}"}
    ]

    for attempt in range(max_retries):
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=1024,
            system="You are a coding agent. Output ONLY valid Python implementation code.",
            messages=messages
        )
        impl = response.content[0].text

        # Write and test
        with open("validator.py", "w") as f:
            f.write(impl)

        result = os.popen("pytest test_validator.py 2>&1").read()

        if "passed" in result and "failed" not in result:
            return impl  # All tests pass

        # Feed failure back to Agent B
        messages.append({"role": "assistant", "content": impl})
        messages.append({"role": "user", "content": f"Tests failed:\n{result}\n\nFix the implementation."})

    return impl  # Return last attempt

What You Learned

  • Separating test-writing and implementation agents eliminates confirmation bias in AI-generated code
  • Agent A should only see the spec; Agent B should only see the tests
  • A simple orchestrator can wire them together and automate the test run
  • Iteration loops with failure feedback close the gap when Agent B's first attempt misses edge cases

Limitation: This works best for pure logic functions. For agents that touch databases, APIs, or UIs, you'll need mocking strategies before Agent A can write meaningful tests.

When NOT to use this: Throwaway scripts, one-off data transforms, or anything where the spec is so loose that tests would be meaningless.


Tested with anthropic==0.40.0, Python 3.12, pytest 8.x