Taking a Legacy Python Codebase from 0% to 80% Test Coverage Using AI Agents

Your legacy Python service has 0% test coverage and a deploy on Friday. AI can't write perfect tests — but it can get you to 80% coverage before standup. Forget the hand-wringing about "thoughtful test design." Right now, you need a tactical assault on your untested codebase using the AI agents already sitting in your IDE. This isn't about philosophical purity; it's about creating a functional safety net that stops you from shipping a TypeError: 'NoneType' object is not subscriptable to production at 5 PM.

We'll use AI not as an oracle, but as a force multiplier, automating the tedious 80% so you can focus on the critical 20%. The goal is a passing CI check and the confidence to refactor. Let's move.

Why 80% Coverage Is a Lie You Can Trust

Chasing 100% test coverage is like polishing a bicycle in a drag race—a pointless distraction that makes you feel productive. The real metric isn't the percentage; it's risk reduction per line of test code. A single, well-placed test on a core payment calculation function does more for your system's health than 100 tests checking that your Pydantic models can be instantiated.

With pytest being used by 84% of Python developers for testing (Python Developers Survey 2025), you're already in the right ecosystem. The problem is inertia. AI cuts through that. It doesn't get bored generating the fiftieth test for a utility module. Your job is to direct its fire.

First, know what not to measure. Coverage tells you what was executed, not if it was asserted correctly. A test that calls a function but makes no assertions gives you coverage. It also gives you a false sense of security. We'll use coverage as a scouting report, not a scorecard.

Triage First: Let AI Read the Map with `coverage.py`

Before you generate a single test, you need to know what you're dealing with. Blindly asking an AI to "write tests" for your project is a waste of tokens. You must triage.

Generate the Initial Report: Use pytest with the pytest-cov plugin to get a baseline. Don't install anything manually; use uv because it's 10–100x faster than pip for cold installs.
```
# In your project root, using the approved tooling
uv pip install pytest pytest-cov
uv run pytest --cov=your_package --cov-report=html --cov-report=term-missing
```
This creates an htmlcov/ directory. Open index.html. You'll see a depressing sea of red (0% coverage). This is your starting point.
AI-Assisted Gap Analysis: This is your first AI prompt. Don't ask for code yet. Ask for analysis. Copy the terminal output from --cov-report=term-missing or take a screenshot of the HTML report and feed it to your AI agent (GitHub Copilot Chat, Continue.dev, etc.).
Prompt: "Here's a coverage report for my legacy Python package. List the top 5 modules by uncovered line count. For each, categorize the code: is it mostly pure functions, class definitions, database logic (SQLAlchemy), or API endpoints (FastAPI)? Prioritize a list for test generation based on business criticality and testability."
The AI will give you a battle plan. Start with the modules full of pure, side-effect-free functions. They're the low-hanging fruit where AI excels.

Generating Tests for Pure Functions: AI's Happy Place

Pure functions (same input, same output, no side-effects) are AI test generation's killer app. Find a file like calculations.py or utils.py.

Real Error & Fix: You run the AI-generated tests and immediately hit: ModuleNotFoundError: No module named 'pandas' Fix: Your AI is generating tests assuming global dependencies. You must constrain it. Use uv to install the needed package in your active environment: uv add pandas. Then, ensure your test runner is using the correct interpreter (in VS Code, Ctrl+Shift+P and select "Python: Select Interpreter").

Here's a real example. Suppose you have this legacy function:


def calculate_discounted_price(base_price: float, discount_pct: float, tax_rate: float) -> float:
    """Calculate final price after discount and tax. No one has touched this since 2019."""
    if discount_pct < 0 or discount_pct > 100:
        raise ValueError("Discount must be between 0 and 100")
    discounted = base_price * (1 - discount_pct / 100)
    final = discounted * (1 + tax_rate / 100)
    return round(final, 2)

Feed this function to your AI agent with a focused prompt.

Prompt: "Generate a complete pytest file for this function. Use parameterized tests (@pytest.mark.parametrize) to cover: 1) normal cases, 2) edge cases (zero discount, 100% discount), 3) invalid inputs that should raise ValueError. Assume the file is test_price_calc.py. Use clear, descriptive test names."

You'll get back something like this, which is 90% correct out of the gate:

# tests/test_price_calc.py
import pytest
from legacy.price_calc import calculate_discounted_price

@pytest.mark.parametrize(
    "base, discount, tax, expected",
    [
        (100.0, 10.0, 8.0, 97.2),   # Normal case
        (100.0, 0.0, 10.0, 110.0),  # No discount
        (100.0, 100.0, 5.0, 0.0),   # Free product, still tax? (Might need logic check!)
        (55.55, 12.5, 7.25, 52.14), # More complex numbers
    ]
)
def test_calculate_discounted_price_normal(base, discount, tax, expected):
    """Test standard calculation paths."""
    assert calculate_discounted_price(base, discount, tax) == expected

@pytest.mark.parametrize(
    "base, discount, tax",
    [
        (100.0, -5.0, 10.0),
        (100.0, 150.0, 10.0),
    ]
)
def test_calculate_discounted_price_invalid_discount(base, discount, tax):
    """Test that invalid discounts raise ValueError."""
    with pytest.raises(ValueError, match="Discount must be between 0 and 100"):
        calculate_discounted_price(base, discount, tax)

Run it with uv run pytest tests/test_price_calc.py -v. It passes. You've just covered a key business function in minutes. The AI might miss a subtle edge (e.g., is tax applied on a zero-price item?), but it's given you a robust foundation to edit. This is the flywheel: AI writes the boilerplate, you apply the critical thinking.

Tackling the Side-Effect Hell: AI-Generated Mocks

Your legacy code isn't all pure functions. It's probably littered with database calls and HTTP requests. This is where you must teach AI to use unittest.mock.

Take a function that talks to a database:

# legacy/user_repo.py
from sqlalchemy.orm import Session

def get_active_users(db: Session) -> list:
    """Fetches all active users. Directly couples logic to DB."""
    # Legacy code often has raw SQL or complex ORM queries
    users = db.execute("SELECT * FROM users WHERE status = 'ACTIVE'").fetchall()
    return [dict(u) for u in users]

You cannot run this test without a live database. So you mock. Prompt your AI agent precisely.

Prompt: "Generate a pytest test for get_active_users. Use unittest.mock.patch to mock the db Session object's execute method. Have the mock return a fake result set that mimics SQLAlchemy's result proxy. Test that the function returns the correct list of dictionaries. Assume from unittest.mock import Mock, patch."

The AI will generate the scaffolding for you:

# tests/test_user_repo.py
from unittest.mock import Mock, patch
import pytest
from legacy.user_repo import get_active_users

def test_get_active_users():
    # 1. Create a mock result set
    fake_row = Mock()
    fake_row._asdict.return_value = {'id': 1, 'name': 'Alice', 'status': 'ACTIVE'}
    fake_result = Mock()
    fake_result.fetchall.return_value = [fake_row]

    # 2. Create a mock db session
    mock_db = Mock()
    mock_db.execute.return_value = fake_result

    # 3. Call the function with the mock
    result = get_active_users(mock_db)

    # 4. Assert the interactions and output
    mock_db.execute.assert_called_once_with("SELECT * FROM users WHERE status = 'ACTIVE'")
    assert result == [{'id': 1, 'name': 'Alice', 'status': 'ACTIVE'}]

This test validates the integration point: does the function call the right SQL? Does it transform the result correctly? The mock is generated by AI, but you must verify it matches SQLAlchemy's real API. Run it, see it pass, and you've neutered a side-effect.

Mutation Testing: The AI Test Quality Audit

AI can generate tests that pass but are vacuously weak. How do you know? Enter mutmut, a mutation testing tool. It makes small, logical changes (mutations) to your source code and sees if your tests fail. If a mutation survives, your tests didn't catch it.

Let's audit our AI-generated tests for calculate_discounted_price.

uv pip install mutmut
uv run mutmut run --paths-to-mutate legacy/price_calc.py

mutmut might change the line return round(final, 2) to return round(final, 3). If your tests still pass, they're not checking the rounding precision closely enough—a potentially critical bug for financial data. The AI's parameterized test gave us numbers, but maybe not the right edge case. You see the report, go back, and add a specific test for rounding behavior. You're now using AI to write tests and another tool to pressure-test the AI's work.

Benchmark: AI-Generated vs. Human-Written Tests

Is this approach actually faster? Let's be empirical. Take one moderately complex module (~200 LOC) from your codebase.

Task	AI-Assisted Workflow (with prompts)	Manual Writing
Initial Test Scaffolding	~5 minutes (writing prompts, running output)	~30 minutes (boilerplate, imports)
Achieving 80% Line Coverage	~25 minutes	~90 minutes
Mock Setup for 3 External Dependencies	~10 minutes (AI generates mock structure)	~30 minutes (reading API docs, crafting mocks)
Catching Logic Edge Cases	Requires human review (AI misses subtle business rules)	Human-driven (built into the process)
Total Time to Robust Suite	~40 min + human review time	~120+ minutes

The table shows the trade-off. AI obliterates the boilerplate and speed of initial coverage, but you must remain the domain expert in the loop. It's a partnership, not a replacement.

Integrating the Gate: CI That Enforces 80%

The final step is to make this stick. You need a CI pipeline that fails if coverage dips below your 80% target, locking in the gains. Here's a minimal GitHub Actions workflow (.github/workflows/test.yml):

name: Test & Enforce Coverage
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v3 # Use the blazing-fast installer
      - run: uv pip install pytest pytest-cov ruff
      - run: uv run ruff check . --fix # Lint first, because clean code is easier to test
      - run: uv run pytest --cov=your_package --cov-fail-under=80 --cov-report=term-missing

The key flag is --cov-fail-under=80. Now, any new code that lowers coverage below your hard-won 80% will break the build. The team is forced to write tests alongside features, or the AI-assisted tests you've generated become the new baseline.

Real Error & Fix: Your CI fails with: MemoryError with large DataFrames in a legacy pandas transformation. Fix: This is where you upgrade your approach. Don't just write a test; fix the code. Use chunked reading: pd.read_csv('huge.csv', chunksize=10000). Or, better yet, replace the component with Polars, which is designed for efficiency. The test coverage highlighted a real performance risk, which is its highest purpose.

Next Steps: From 80% to Sustainable

You've gone from 0% to 80% using AI as a turbocharger. The silicon has done its job. Now, shift gears:

Targeted Human Review: Use the coverage HTML report to manually inspect the remaining 20% of uncovered lines. These are often tricky integration points, error handlers, or deprecated code paths. Write these tests yourself; they're the most valuable.
Shift from Coverage to Property-Based Testing: For your core logic, use Hypothesis (in the approved toolset). Give AI the prompt: "Generate Hypothesis strategies for this function that fuzz the input space." This finds bugs coverage misses.
Refactor with Confidence: That gnarly 50-line function you now have 80% coverage on? Refactor it. Split it. The tests you just generated will tell you if you broke something. This is the true payoff: turning legacy code into maintainable code.

The goal was never a perfect test suite. It was a functional shield that lets you deploy on Friday without dread. AI got you the shield in hours, not days. Now go fix the actual code.