Problem: One AI Agent Knows Too Much
When a single AI agent writes both tests and implementation, it cheats. It writes tests that fit the code it already plans to write — not tests that define what the code should do.
You end up with 100% coverage and zero confidence.
You'll learn:
- Why separating test-writing and code-writing agents produces better software
- How to orchestrate two agents with a shared test contract
- A working Python example using the Anthropic API
Time: 20 min | Level: Intermediate
Why This Happens
A single agent has full context. When it writes def add(a, b) and then writes assert add(2, 3) == 5, it's not testing — it's confirming. The tests are shaped by the implementation, not the other way around.
Multi-agent TDD breaks this loop:
- Agent A (Test Writer): Reads the spec. Writes failing tests. Has no implementation context.
- Agent B (Coder): Reads the tests. Writes code to pass them. Never sees the spec directly.
The tests become the contract. Agent B can only pass them by actually satisfying the requirements.
Common symptoms of single-agent test bias:
- Tests pass immediately with no iteration
- Edge cases are never covered
- Refactoring breaks tests that "should still work"
Solution
Step 1: Define the Contract (The Spec)
Write a plain-language spec. This is what Agent A reads — and the only thing it reads.
# Spec: Password Validator
A password is valid if:
- At least 8 characters long
- Contains at least one uppercase letter
- Contains at least one digit
- Does not contain spaces
Save this as spec.md. Agent B never sees this file.
Step 2: Agent A Writes the Tests
Agent A receives the spec and outputs a test file. It knows nothing about how the validator will be implemented.
import anthropic
def run_test_writer_agent(spec: str) -> str:
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system="""You are a test-writing agent.
You receive a software specification and write pytest tests for it.
Output ONLY valid Python test code. No explanations.
Do not write any implementation — only tests.""",
messages=[
{"role": "user", "content": f"Write tests for this spec:\n\n{spec}"}
]
)
return response.content[0].text
Expected output from Agent A:
import pytest
from validator import is_valid_password
def test_minimum_length():
assert not is_valid_password("Ab1")
def test_exact_minimum_length():
assert is_valid_password("Abcdef1g") # 8 chars, meets all rules
def test_missing_uppercase():
assert not is_valid_password("abcdefg1")
def test_missing_digit():
assert not is_valid_password("Abcdefgh")
def test_contains_space():
assert not is_valid_password("Abcdef 1")
def test_empty_string():
assert not is_valid_password("")
def test_all_uppercase_no_digit():
assert not is_valid_password("ABCDEFGH")
If Agent A outputs prose instead of code:
- System prompt too weak: Add "Your entire response must be a Python code block only."
- Model wandered: Use
stop_sequences=["```\n\n"]to cut off after the code block.
Step 3: Agent B Writes the Implementation
Agent B receives only the test file. It writes code to pass every test.
def run_coder_agent(test_code: str) -> str:
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system="""You are a coding agent.
You receive pytest tests and write the implementation that passes them.
Output ONLY valid Python implementation code. No test code. No explanations.""",
messages=[
{"role": "user", "content": f"Write the implementation for these tests:\n\n{test_code}"}
]
)
return response.content[0].text
Expected output from Agent B:
# validator.py
def is_valid_password(password: str) -> bool:
if len(password) < 8:
return False
if " " in password:
return False
if not any(c.isupper() for c in password):
return False
if not any(c.isdigit() for c in password):
return False
return True
Step 4: Wire the Orchestrator
import os
def main():
# Load spec
with open("spec.md") as f:
spec = f.read()
print("Agent A: Writing tests...")
test_code = run_test_writer_agent(spec)
# Save tests
with open("test_validator.py", "w") as f:
f.write(test_code)
print("Tests written to test_validator.py")
print("\nAgent B: Writing implementation...")
impl_code = run_coder_agent(test_code)
# Save implementation
with open("validator.py", "w") as f:
f.write(impl_code)
print("Implementation written to validator.py")
# Run tests automatically
print("\nRunning pytest...")
os.system("pytest test_validator.py -v")
if __name__ == "__main__":
main()
Verification
python orchestrator.py
You should see:
Agent A: Writing tests...
Tests written to test_validator.py
Agent B: Writing implementation...
Implementation written to validator.py
Running pytest...
test_validator.py::test_minimum_length PASSED
test_validator.py::test_exact_minimum_length PASSED
test_validator.py::test_missing_uppercase PASSED
test_validator.py::test_missing_digit PASSED
test_validator.py::test_contains_space PASSED
test_validator.py::test_empty_string PASSED
test_validator.py::test_all_uppercase_no_digit PASSED
7 passed in 0.12s
If tests fail, feed the failure output back to Agent B as a follow-up message. This turns the orchestrator into an iterative loop until all tests pass.
Adding an Iteration Loop
If Agent B's first pass fails, add a retry loop:
def run_coder_agent_with_retry(test_code: str, max_retries: int = 3) -> str:
client = anthropic.Anthropic()
messages = [
{"role": "user", "content": f"Write the implementation for these tests:\n\n{test_code}"}
]
for attempt in range(max_retries):
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system="You are a coding agent. Output ONLY valid Python implementation code.",
messages=messages
)
impl = response.content[0].text
# Write and test
with open("validator.py", "w") as f:
f.write(impl)
result = os.popen("pytest test_validator.py 2>&1").read()
if "passed" in result and "failed" not in result:
return impl # All tests pass
# Feed failure back to Agent B
messages.append({"role": "assistant", "content": impl})
messages.append({"role": "user", "content": f"Tests failed:\n{result}\n\nFix the implementation."})
return impl # Return last attempt
What You Learned
- Separating test-writing and implementation agents eliminates confirmation bias in AI-generated code
- Agent A should only see the spec; Agent B should only see the tests
- A simple orchestrator can wire them together and automate the test run
- Iteration loops with failure feedback close the gap when Agent B's first attempt misses edge cases
Limitation: This works best for pure logic functions. For agents that touch databases, APIs, or UIs, you'll need mocking strategies before Agent A can write meaningful tests.
When NOT to use this: Throwaway scripts, one-off data transforms, or anything where the spec is so loose that tests would be meaningless.
Tested with anthropic==0.40.0, Python 3.12, pytest 8.x