Problem: Your 100% Coverage Doesn't Catch Bugs
Your CI shows 100% test coverage, but production bugs still slip through. The AI-generated tests execute every line but don't actually validate behavior.
You'll learn:
- Why line coverage misleads teams
- How to use AI for behavior-driven test generation
- A 3-step workflow that catches real bugs
Time: 30 min | Level: Intermediate
Why This Happens
Coverage tools measure lines executed, not logic validated. A test that calls a function without assertions gives you 100% coverage and 0% confidence.
Common symptoms:
- Tests pass but features break in production
- Changing implementation breaks tests that shouldn't care
- Coverage reports look great but you're afraid to refactor
- AI generates tests like
expect(result).toBeDefined()
Solution
Step 1: Define Test Boundaries
Before generating any tests, identify what actually matters.
// payment-processor.ts
export class PaymentProcessor {
async processPayment(amount: number, currency: string): Promise<Receipt> {
// 50 lines of implementation
}
}
Ask the AI this instead of "write tests":
Analyze this PaymentProcessor class. What are:
1. Edge cases that could cause real money errors?
2. Error conditions users will hit?
3. State changes that could corrupt data?
Don't write tests yet. List scenarios.
Expected: You get a list like "negative amounts", "unsupported currency", "network timeouts", "duplicate transaction IDs"
If it fails:
- AI jumps to code: Explicitly say "no code, scenarios only"
- Generic answers: Provide context: "this handles $2M daily, chargebacks cost $50 each"
Step 2: Generate Behavior Tests
Now ask for tests that validate those scenarios.
// Prompt to AI
Based on scenario "negative amounts should be rejected":
Write a test that:
- Tries amount: -100
- Verifies the specific error thrown
- Checks no database writes happened
- Validates error includes amount for debugging
Use vitest. Show complete test with setup/teardown.
AI generates:
import { describe, it, expect, beforeEach, vi } from 'vitest';
import { PaymentProcessor } from './payment-processor';
describe('PaymentProcessor - negative amount handling', () => {
let processor: PaymentProcessor;
let mockDb: any;
beforeEach(() => {
mockDb = { write: vi.fn() };
processor = new PaymentProcessor(mockDb);
});
it('rejects negative amounts before touching database', async () => {
// This validates the actual business requirement
await expect(
processor.processPayment(-100, 'USD')
).rejects.toThrow('Amount must be positive, received: -100');
// This ensures atomicity - critical for financial code
expect(mockDb.write).not.toHaveBeenCalled();
});
});
Why this works: The test validates business rules (positive amounts) and side effects (no DB writes). Line coverage is a byproduct.
Step 3: Validate with Mutation Testing
Coverage says you tested the code. Mutation testing proves your tests catch bugs.
# Install Stryker for mutation testing
npm install --save-dev @stryker-mutator/core @stryker-mutator/vitest-runner
# Run it
npx stryker run
What happens: Stryker modifies your code (e.g., changes amount > 0 to amount >= 0) and reruns tests. If tests still pass, you have weak tests.
// Stryker found this survived mutation
if (amount > 0) { ... } // Changed to >= 0, tests passed
// Fix: Add boundary test
it('rejects zero amount', async () => {
await expect(
processor.processPayment(0, 'USD')
).rejects.toThrow('Amount must be positive, received: 0');
});
Expected: Stryker reports 80%+ mutation score (percentage of mutants killed by tests)
If it fails:
- Too slow: Configure Stryker to test only modified files in CI
- Low score (<60%): Focus on boundary conditions AI might miss
Verification
Test it:
# Check coverage
npm run test -- --coverage
# Check mutation score
npx stryker run
# Verify both metrics
You should see:
- Line coverage: 95-100%
- Mutation score: 75-85% (perfect 100% is often impractical)
- Tests describe behaviors, not implementation
Real Example: Before/After
Before (AI default behavior)
// AI prompt: "write tests for calculateDiscount"
it('should calculate discount', () => {
const result = calculateDiscount(100, 'SAVE10');
expect(result).toBeDefined(); // Useless assertion
expect(typeof result).toBe('number'); // Still useless
});
Coverage: 100% | Mutation score: 20% | Bugs caught: 0
After (behavior-driven prompt)
// AI prompt: "write tests for: invalid coupon should return full price,
// expired coupon throws error, percentage coupons cap at 90% off"
it('returns full price for invalid coupon code', () => {
expect(calculateDiscount(100, 'FAKE')).toBe(100);
});
it('throws for expired coupon with expiry date in error', () => {
expect(() => calculateDiscount(100, 'EXPIRED2025'))
.toThrow(/expired on 2025-01-01/);
});
it('caps percentage discounts at 90% to prevent negative prices', () => {
// Catches bug: 100% off coupon caused negative prices
expect(calculateDiscount(100, 'EVERYTHING_FREE')).toBe(10);
});
Coverage: 100% | Mutation score: 78% | Bugs caught: 3 production issues
What You Learned
- Coverage measures execution, not correctness
- Ask AI for scenarios before code
- Mutation testing validates your test quality
Limitation: High mutation scores take time. Focus on critical paths first (payments, auth, data writes).
When NOT to use this:
- Simple getters/setters (skip them in coverage)
- Third-party library wrappers (integration tests better)
- UI component snapshots (different testing strategy)
Language-Specific Quick Starts
Python (pytest + mutmut)
# Generate tests
pytest --cov=src --cov-report=html
# Mutation testing
pip install mutmut
mutmut run
mutmut results # See which tests are weak
Go (go test + go-mutesting)
# Coverage
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
# Mutation testing
go install github.com/zimmski/go-mutesting/cmd/go-mutesting@latest
go-mutesting ./...
Rust (cargo tarpaulin + cargo-mutants)
# Coverage
cargo tarpaulin --out Html
# Mutation testing
cargo install cargo-mutants
cargo mutants
AI Prompting Cheat Sheet
❌ Bad prompt: "Write unit tests for this code"
✅ Good prompt: "List 5 scenarios where this payment function could lose money or corrupt data. Include edge cases around currency conversion and timeouts. No code yet."
Then:
"Write a vitest test for scenario #2 (duplicate transaction IDs). The test should verify we return the original receipt without charging twice, and that we log the duplicate attempt."
Why: Specificity forces AI to think about behavior, not just syntax.
Tested with Claude Code 1.0, GPT-4, TypeScript 5.5, Python 3.12, Go 1.23, Rust 1.75