Problem: Manual QA Can't Keep Up With Your Shipping Speed

Your team ships daily, but QA is still a bottleneck. Test coverage is inconsistent, flaky tests go uninvestigated, and bugs slip through because writing good tests takes time nobody has.

Agentic workflows fix this by letting an AI agent generate tests, run them, analyze failures, and open tickets—without a human in the loop for every step.

You'll learn:

How to structure a QA agent with tool use and a feedback loop
How to generate meaningful tests from source code automatically
How to triage failures and route them to the right team

Time: 20 min | Level: Intermediate

Why This Happens

Traditional test automation is passive—you write tests, a CI runner executes them, and a human reads the report. The bottleneck is always the human: writing tests, triaging failures, deciding what matters.

Agentic workflows break that loop. An agent can reason about what needs testing, act by running commands, and adapt based on what it observes—repeatedly, without stopping for approval.

Common symptoms that signal you need this:

PRs merge without test coverage because nobody had time to write them
Flaky tests are ignored instead of fixed
The same class of bug keeps getting through

Solution

You'll build a QA agent using Python, the Anthropic SDK, and pytest. The agent gets three tools: read source code, run tests, and file a GitHub issue.

Step 1: Install Dependencies

pip install anthropic pytest pytest-json-report gitpython

Step 2: Define the Agent's Tools

The agent needs to see code, run tests, and act on failures. Define those as tools the model can call.

# qa_agent.py
import anthropic
import subprocess
import json
from pathlib import Path

client = anthropic.Anthropic()

tools = [
    {
        "name": "read_file",
        "description": "Read source code or test files to understand what needs testing.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "Relative file path to read"}
            },
            "required": ["path"]
        }
    },
    {
        "name": "run_tests",
        "description": "Run pytest on a specific file or directory. Returns JSON results.",
        "input_schema": {
            "type": "object",
            "properties": {
                "target": {"type": "string", "description": "File or directory to test"},
                "extra_args": {"type": "string", "description": "Optional extra pytest flags"}
            },
            "required": ["target"]
        }
    },
    {
        "name": "write_test_file",
        "description": "Write a new test file to disk.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "Where to save the test file"},
                "content": {"type": "string", "description": "Full test file content"}
            },
            "required": ["path", "content"]
        }
    }
]

Why three tools only: More tools = more decision overhead. Start minimal. Add a GitHub issue tool once the core loop is stable.

Step 3: Implement Tool Execution

def execute_tool(tool_name: str, tool_input: dict) -> str:
    if tool_name == "read_file":
        path = Path(tool_input["path"])
        # Prevent path traversal outside project
        if not path.exists():
            return f"File not found: {path}"
        return path.read_text()

    elif tool_name == "run_tests":
        target = tool_input["target"]
        args = tool_input.get("extra_args", "")
        result = subprocess.run(
            f"pytest {target} {args} --json-report --json-report-file=report.json -q",
            shell=True,
            capture_output=True,
            text=True
        )
        # Return structured results so the agent can reason about failures
        try:
            report = json.loads(Path("report.json").read_text())
            summary = report.get("summary", {})
            failures = [
                {"test": t["nodeid"], "message": t["call"]["longrepr"]}
                for t in report.get("tests", [])
                if t["outcome"] == "failed"
            ]
            return json.dumps({"summary": summary, "failures": failures[:5]})
        except Exception:
            return result.stdout + result.stderr

    elif tool_name == "write_test_file":
        path = Path(tool_input["path"])
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(tool_input["content"])
        return f"Written: {path}"

    return f"Unknown tool: {tool_name}"

If it fails:

report.json not found: Make sure pytest-json-report is installed and the test runner has write access to the working directory.
Subprocess hangs: Add a timeout=60 parameter to subprocess.run.

Step 4: Build the Agent Loop

This is the core agentic pattern: send a message, handle tool calls, feed results back, repeat until done.

def run_qa_agent(source_file: str):
    system_prompt = """You are a QA engineer agent. Your job is to:
1. Read the source file provided
2. Write comprehensive pytest tests covering happy paths, edge cases, and error states
3. Run the tests and check results
4. If tests fail, diagnose the issue and fix your tests (not the source)
5. Report a summary of coverage gaps when done

Be direct. Write real tests, not placeholder tests."""

    messages = [
        {"role": "user", "content": f"Generate and run tests for: {source_file}"}
    ]

    print(f"Starting QA agent for {source_file}...")

    # Agent loop - runs until model stops calling tools
    while True:
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=4096,
            system=system_prompt,
            tools=tools,
            messages=messages
        )

        # Append assistant response to message history
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            # Extract final text response
            for block in response.content:
                if hasattr(block, "text"):
                    print("\n--- Agent Report ---")
                    print(block.text)
            break

        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    print(f"  → {block.name}({list(block.input.keys())})")
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })

            # Feed results back so agent can continue reasoning
            messages.append({"role": "user", "content": tool_results})

Why feed results back: The agent needs to see what happened to decide what to do next. Without feeding tool results into messages, it's flying blind.

Step 5: Run It

python qa_agent.py

Or wire it into a helper script that points at a specific module:

# At the bottom of qa_agent.py
if __name__ == "__main__":
    import sys
    target = sys.argv[1] if len(sys.argv) > 1 else "src/utils.py"
    run_qa_agent(target)

python qa_agent.py src/payments.py

Terminal showing agent tool calls and test output The agent reads source, writes tests, runs pytest, and reports failures—no manual steps

Verification

# Run the agent against a real module in your codebase
python qa_agent.py src/your_module.py

# Check generated tests were written
ls tests/test_your_module.py

# Confirm they pass independently
pytest tests/test_your_module.py -v

You should see: A new test file under tests/, a passing pytest run, and a printed summary from the agent noting any coverage gaps it identified.

Generated test file in VS Code with passing test marks Agent-generated tests covering edge cases you likely would have missed

What You Learned

Agentic QA works because the model can reason about failures and adapt, not just execute a script
The tool loop is the core primitive: call tool → get result → decide next action
Keep tools minimal at first—read, run, write covers 80% of QA automation needs
This approach works best for pure logic modules; UI testing still needs human-defined selectors

Limitation: The agent doesn't know your domain—it writes tests based on code structure, not business rules. Always review generated tests before merging to main.

When NOT to use this: For security-critical code paths, write tests yourself. You understand the threat model; the agent doesn't.

Tested on Python 3.12, anthropic SDK 0.25+, pytest 8.x, macOS & Ubuntu 24.04