Problem: Manual QA Can't Keep Up With Your Shipping Speed
Your team ships daily, but QA is still a bottleneck. Test coverage is inconsistent, flaky tests go uninvestigated, and bugs slip through because writing good tests takes time nobody has.
Agentic workflows fix this by letting an AI agent generate tests, run them, analyze failures, and open tickets—without a human in the loop for every step.
You'll learn:
- How to structure a QA agent with tool use and a feedback loop
- How to generate meaningful tests from source code automatically
- How to triage failures and route them to the right team
Time: 20 min | Level: Intermediate
Why This Happens
Traditional test automation is passive—you write tests, a CI runner executes them, and a human reads the report. The bottleneck is always the human: writing tests, triaging failures, deciding what matters.
Agentic workflows break that loop. An agent can reason about what needs testing, act by running commands, and adapt based on what it observes—repeatedly, without stopping for approval.
Common symptoms that signal you need this:
- PRs merge without test coverage because nobody had time to write them
- Flaky tests are ignored instead of fixed
- The same class of bug keeps getting through
Solution
You'll build a QA agent using Python, the Anthropic SDK, and pytest. The agent gets three tools: read source code, run tests, and file a GitHub issue.
Step 1: Install Dependencies
pip install anthropic pytest pytest-json-report gitpython
Step 2: Define the Agent's Tools
The agent needs to see code, run tests, and act on failures. Define those as tools the model can call.
# qa_agent.py
import anthropic
import subprocess
import json
from pathlib import Path
client = anthropic.Anthropic()
tools = [
{
"name": "read_file",
"description": "Read source code or test files to understand what needs testing.",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "Relative file path to read"}
},
"required": ["path"]
}
},
{
"name": "run_tests",
"description": "Run pytest on a specific file or directory. Returns JSON results.",
"input_schema": {
"type": "object",
"properties": {
"target": {"type": "string", "description": "File or directory to test"},
"extra_args": {"type": "string", "description": "Optional extra pytest flags"}
},
"required": ["target"]
}
},
{
"name": "write_test_file",
"description": "Write a new test file to disk.",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "Where to save the test file"},
"content": {"type": "string", "description": "Full test file content"}
},
"required": ["path", "content"]
}
}
]
Why three tools only: More tools = more decision overhead. Start minimal. Add a GitHub issue tool once the core loop is stable.
Step 3: Implement Tool Execution
def execute_tool(tool_name: str, tool_input: dict) -> str:
if tool_name == "read_file":
path = Path(tool_input["path"])
# Prevent path traversal outside project
if not path.exists():
return f"File not found: {path}"
return path.read_text()
elif tool_name == "run_tests":
target = tool_input["target"]
args = tool_input.get("extra_args", "")
result = subprocess.run(
f"pytest {target} {args} --json-report --json-report-file=report.json -q",
shell=True,
capture_output=True,
text=True
)
# Return structured results so the agent can reason about failures
try:
report = json.loads(Path("report.json").read_text())
summary = report.get("summary", {})
failures = [
{"test": t["nodeid"], "message": t["call"]["longrepr"]}
for t in report.get("tests", [])
if t["outcome"] == "failed"
]
return json.dumps({"summary": summary, "failures": failures[:5]})
except Exception:
return result.stdout + result.stderr
elif tool_name == "write_test_file":
path = Path(tool_input["path"])
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(tool_input["content"])
return f"Written: {path}"
return f"Unknown tool: {tool_name}"
If it fails:
report.jsonnot found: Make surepytest-json-reportis installed and the test runner has write access to the working directory.- Subprocess hangs: Add a
timeout=60parameter tosubprocess.run.
Step 4: Build the Agent Loop
This is the core agentic pattern: send a message, handle tool calls, feed results back, repeat until done.
def run_qa_agent(source_file: str):
system_prompt = """You are a QA engineer agent. Your job is to:
1. Read the source file provided
2. Write comprehensive pytest tests covering happy paths, edge cases, and error states
3. Run the tests and check results
4. If tests fail, diagnose the issue and fix your tests (not the source)
5. Report a summary of coverage gaps when done
Be direct. Write real tests, not placeholder tests."""
messages = [
{"role": "user", "content": f"Generate and run tests for: {source_file}"}
]
print(f"Starting QA agent for {source_file}...")
# Agent loop - runs until model stops calling tools
while True:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
system=system_prompt,
tools=tools,
messages=messages
)
# Append assistant response to message history
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
# Extract final text response
for block in response.content:
if hasattr(block, "text"):
print("\n--- Agent Report ---")
print(block.text)
break
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
print(f" → {block.name}({list(block.input.keys())})")
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
# Feed results back so agent can continue reasoning
messages.append({"role": "user", "content": tool_results})
Why feed results back: The agent needs to see what happened to decide what to do next. Without feeding tool results into messages, it's flying blind.
Step 5: Run It
python qa_agent.py
Or wire it into a helper script that points at a specific module:
# At the bottom of qa_agent.py
if __name__ == "__main__":
import sys
target = sys.argv[1] if len(sys.argv) > 1 else "src/utils.py"
run_qa_agent(target)
python qa_agent.py src/payments.py
The agent reads source, writes tests, runs pytest, and reports failures—no manual steps
Verification
# Run the agent against a real module in your codebase
python qa_agent.py src/your_module.py
# Check generated tests were written
ls tests/test_your_module.py
# Confirm they pass independently
pytest tests/test_your_module.py -v
You should see: A new test file under tests/, a passing pytest run, and a printed summary from the agent noting any coverage gaps it identified.
Agent-generated tests covering edge cases you likely would have missed
What You Learned
- Agentic QA works because the model can reason about failures and adapt, not just execute a script
- The tool loop is the core primitive: call tool → get result → decide next action
- Keep tools minimal at first—
read,run,writecovers 80% of QA automation needs - This approach works best for pure logic modules; UI testing still needs human-defined selectors
Limitation: The agent doesn't know your domain—it writes tests based on code structure, not business rules. Always review generated tests before merging to main.
When NOT to use this: For security-critical code paths, write tests yourself. You understand the threat model; the agent doesn't.
Tested on Python 3.12, anthropic SDK 0.25+, pytest 8.x, macOS & Ubuntu 24.04