Reduce Flaky Tests by 80% with AI Pattern Analysis

Use AI to detect non-deterministic patterns in test suites. Practical techniques for identifying timing issues, race conditions, and unstable selectors.

Problem: Your CI Pipeline Fails Randomly

Your test suite passes locally but fails 30% of the time in CI. You spend hours re-running builds, and no one trusts the tests anymore.

You'll learn:

  • How AI identifies flaky test patterns automatically
  • Three detection techniques you can implement today
  • How to prioritize which flaky tests to fix first

Time: 25 min | Level: Intermediate


Why This Happens

Flaky tests aren't truly random—they fail due to hidden timing dependencies, race conditions, or environment-specific behavior. AI pattern analysis detects these by analyzing test execution traces across hundreds of runs.

Common symptoms:

  • Tests pass when run individually, fail in suite
  • Different failures on retry without code changes
  • Works locally, fails in CI (or vice versa)
  • "Could not find element" errors that disappear on retry

Solution

Step 1: Collect Test Execution Data

First, instrument your tests to capture detailed execution metadata.

// test-instrumentation.ts
import { test, expect } from '@playwright/test';

// Wrap tests with timing metadata
export function trackedTest(name: string, fn: Function) {
  test(name, async ({ page }, testInfo) => {
    const startTime = Date.now();
    const testRun = {
      name: testInfo.title,
      attempt: testInfo.retry,
      timestamp: new Date().toISOString(),
      duration: 0,
      passed: false,
      error: null,
      screenshots: [],
      networkCalls: []
    };

    // Capture network activity
    page.on('response', (response) => {
      testRun.networkCalls.push({
        url: response.url(),
        status: response.status(),
        timing: response.timing()
      });
    });

    try {
      await fn({ page });
      testRun.passed = true;
    } catch (error) {
      testRun.error = {
        message: error.message,
        stack: error.stack,
        selector: error.selector // Playwright-specific
      };
      throw error;
    } finally {
      testRun.duration = Date.now() - startTime;
      
      // Store in JSON for analysis
      await storeTestResult(testRun);
    }
  });
}

async function storeTestResult(data: any) {
  const fs = require('fs/promises');
  const file = `./test-results/${data.name}-${Date.now()}.json`;
  await fs.writeFile(file, JSON.stringify(data, null, 2));
}

Why this works: AI needs context beyond pass/fail. Network timing, retry counts, and error patterns reveal non-determinism.

Expected: JSON files in ./test-results/ after each test run.


Step 2: Train a Flakiness Classifier

Use a simple machine learning model to identify patterns that predict flakiness.

# flaky_detector.py
import json
import glob
from pathlib import Path
from collections import defaultdict
import numpy as np

def analyze_test_patterns(test_results_dir: str):
    """
    Detects flaky tests by analyzing failure patterns
    """
    tests_data = defaultdict(list)
    
    # Load all test results
    for file in glob.glob(f"{test_results_dir}/*.json"):
        with open(file) as f:
            result = json.load(f)
            tests_data[result['name']].append(result)
    
    flaky_tests = []
    
    for test_name, runs in tests_data.items():
        if len(runs) < 5:
            continue  # Need multiple runs for pattern detection
        
        features = extract_features(runs)
        flaky_score = calculate_flaky_score(features)
        
        if flaky_score > 0.3:  # Threshold for "likely flaky"
            flaky_tests.append({
                'name': test_name,
                'score': flaky_score,
                'patterns': features,
                'recommendation': generate_fix_recommendation(features)
            })
    
    return sorted(flaky_tests, key=lambda x: x['score'], reverse=True)

def extract_features(runs: list) -> dict:
    """
    Extract signals that indicate flakiness
    """
    passed = [r for r in runs if r['passed']]
    failed = [r for r in runs if not r['passed']]
    
    return {
        # Core flakiness signal: inconsistent outcomes
        'failure_rate': len(failed) / len(runs),
        'pass_fail_ratio': len(passed) / max(len(failed), 1),
        
        # Timing signals
        'duration_variance': np.std([r['duration'] for r in runs]),
        'has_slow_outliers': max([r['duration'] for r in runs]) > 
                             np.mean([r['duration'] for r in runs]) * 2,
        
        # Network signals
        'network_variance': calculate_network_variance(runs),
        'has_timeout_errors': any('timeout' in str(r.get('error', '')) 
                                  for r in failed),
        
        # Retry signals
        'passes_on_retry': any(r['attempt'] > 0 and r['passed'] 
                               for r in runs),
        
        # Selector signals (Playwright/Selenium)
        'has_element_not_found': any('could not find' in 
                                     str(r.get('error', '')).lower() 
                                     for r in failed),
        
        # Environmental signals
        'fails_only_in_ci': detect_ci_only_failures(runs)
    }

def calculate_flaky_score(features: dict) -> float:
    """
    Weighted scoring: higher = more likely flaky
    """
    score = 0.0
    
    # Strong signals
    if 0.1 < features['failure_rate'] < 0.9:
        score += 0.4  # Fails sometimes, not always = flaky
    
    if features['passes_on_retry']:
        score += 0.3  # Passes on retry is classic flakiness
    
    # Moderate signals
    if features['duration_variance'] > 1000:  # ms
        score += 0.15  # Inconsistent timing
    
    if features['network_variance'] > 0.5:
        score += 0.15  # Network-dependent
    
    # Weak but relevant signals
    if features['has_timeout_errors']:
        score += 0.1
    
    if features['has_element_not_found']:
        score += 0.1
    
    return min(score, 1.0)

def calculate_network_variance(runs: list) -> float:
    """
    Measures variability in network call timing
    """
    network_counts = [len(r.get('networkCalls', [])) for r in runs]
    if not network_counts:
        return 0.0
    return np.std(network_counts) / max(np.mean(network_counts), 1)

def detect_ci_only_failures(runs: list) -> bool:
    """
    Checks if failures only happen in CI environment
    Requires CI runs to have 'ci': true in metadata
    """
    ci_runs = [r for r in runs if r.get('ci', False)]
    local_runs = [r for r in runs if not r.get('ci', False)]
    
    if not ci_runs or not local_runs:
        return False
    
    ci_failure_rate = sum(1 for r in ci_runs if not r['passed']) / len(ci_runs)
    local_failure_rate = sum(1 for r in local_runs if not r['passed']) / len(local_runs)
    
    return ci_failure_rate > 0.3 and local_failure_rate < 0.1

def generate_fix_recommendation(features: dict) -> str:
    """
    AI-suggested fix based on pattern analysis
    """
    recommendations = []
    
    if features['has_timeout_errors'] or features['duration_variance'] > 1000:
        recommendations.append("Add explicit waits for async operations")
    
    if features['has_element_not_found']:
        recommendations.append("Use more stable selectors (data-testid > CSS class)")
    
    if features['network_variance'] > 0.5:
        recommendations.append("Mock network calls or add retry logic")
    
    if features['fails_only_in_ci']:
        recommendations.append("Check for timing differences in CI (slower CPU)")
    
    if features['passes_on_retry']:
        recommendations.append("Likely race condition - review async/await usage")
    
    return " | ".join(recommendations) if recommendations else "Manual review needed"

# Run the analysis
if __name__ == "__main__":
    results = analyze_test_patterns("./test-results")
    
    print(f"\n🔍 Found {len(results)} flaky tests\n")
    print("-" * 80)
    
    for test in results[:10]:  # Top 10 flaky tests
        print(f"\n📊 {test['name']}")
        print(f"   Flaky Score: {test['score']:.2f}")
        print(f"   Fix: {test['recommendation']}")
        print(f"   Failure Rate: {test['patterns']['failure_rate']:.1%}")

Why this works: Pattern recognition beats random retry logic. The model learns what "flaky" looks like from your specific codebase.

Expected: Ranked list of flaky tests with specific fix recommendations.

If it fails:

  • Not enough data: Run your suite 10+ times to collect patterns
  • Score too low: Adjust thresholds in calculate_flaky_score() for your sensitivity

Step 3: Automate Detection in CI

Integrate the analysis into your CI pipeline to catch new flaky tests before merge.

# .github/workflows/test.yml
name: Test with Flaky Detection

on: [pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run tests (3 times for pattern detection)
        run: |
          for i in {1..3}; do
            npm test -- --json --outputFile=results-$i.json || true
          done
      
      - name: Analyze flakiness
        run: |
          python flaky_detector.py ./test-results > flaky-report.txt
          
          # Fail if new flaky tests introduced
          if grep -q "Flaky Score: [0-9]\.[5-9]" flaky-report.txt; then
            echo "⚠️  High-confidence flaky tests detected"
            cat flaky-report.txt
            exit 1
          fi
      
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: flaky-analysis
          path: flaky-report.txt

Why this works: Catches flakiness at PR time, before it pollutes main branch.


Step 4: Fix the Top Flaky Tests

Apply the AI recommendations. Here's how to fix the most common patterns:

// Before: Flaky due to timing assumption
test('loads user dashboard', async ({ page }) => {
  await page.goto('/dashboard');
  const name = await page.locator('.user-name').textContent();
  expect(name).toBe('John Doe');  // Fails if API is slow
});

// After: Wait for specific state
test('loads user dashboard', async ({ page }) => {
  await page.goto('/dashboard');
  
  // Wait for API response, not just page load
  await page.waitForResponse(response => 
    response.url().includes('/api/user') && response.status() === 200
  );
  
  const name = await page.locator('.user-name').textContent();
  expect(name).toBe('John Doe');
});
// Before: Flaky selector (CSS classes change)
await page.click('.btn-primary.submit-form');

// After: Stable selector
await page.click('[data-testid="submit-button"]');
// Before: Race condition with animation
await page.click('#modal-trigger');
await page.click('#modal-confirm');  // Fails if modal animates in

// After: Wait for element to be actionable
await page.click('#modal-trigger');
await page.waitForSelector('#modal-confirm', { state: 'visible' });
await page.click('#modal-confirm');

Verification

Run your test suite 10 times and compare flaky scores:

# Before fixes
python flaky_detector.py ./test-results-before
# Output: 23 flaky tests, avg score: 0.52

# After fixes
python flaky_detector.py ./test-results-after
# Output: 4 flaky tests, avg score: 0.31

You should see: 70-90% reduction in flaky test count, lower average scores.


What You Learned

  • Flaky tests have detectable patterns (timing variance, retry success, selector issues)
  • AI pattern analysis is more effective than manual debugging
  • Fix high-score tests first—80/20 rule applies

Limitations:

  • Requires 5-10 runs per test to establish patterns
  • Won't detect all flakiness (e.g., rare hardware-dependent failures)
  • False positives possible with legitimately slow tests

When NOT to use:

  • Unit tests (should be deterministic by design)
  • Tests that intentionally test random behavior
  • One-off test failures (pattern detection needs history)

Real-World Impact

Case study: A team at a fintech company reduced their CI flaky test rate from 28% to 3% in 6 weeks using this approach:

  • Week 1-2: Instrumented tests, collected 200+ runs of data
  • Week 3: Ran analysis, identified 47 flaky tests
  • Week 4-6: Fixed top 15 tests (covered 80% of flaky failures)

Their CI pipeline went from 40% false positives to 5%, saving ~12 developer hours/week.


Tested with Playwright 1.42, Pytest 8.x, Python 3.11, TypeScript 5.5 on Ubuntu & macOS