Reduce Flaky Tests by 80% with AI Pattern Analysis

Problem: Your CI Pipeline Fails Randomly

Your test suite passes locally but fails 30% of the time in CI. You spend hours re-running builds, and no one trusts the tests anymore.

You'll learn:

How AI identifies flaky test patterns automatically
Three detection techniques you can implement today
How to prioritize which flaky tests to fix first

Time: 25 min | Level: Intermediate

Why This Happens

Flaky tests aren't truly random—they fail due to hidden timing dependencies, race conditions, or environment-specific behavior. AI pattern analysis detects these by analyzing test execution traces across hundreds of runs.

Common symptoms:

Tests pass when run individually, fail in suite
Different failures on retry without code changes
Works locally, fails in CI (or vice versa)
"Could not find element" errors that disappear on retry

Solution

Step 1: Collect Test Execution Data

First, instrument your tests to capture detailed execution metadata.

// test-instrumentation.ts
import { test, expect } from '@playwright/test';

// Wrap tests with timing metadata
export function trackedTest(name: string, fn: Function) {
  test(name, async ({ page }, testInfo) => {
    const startTime = Date.now();
    const testRun = {
      name: testInfo.title,
      attempt: testInfo.retry,
      timestamp: new Date().toISOString(),
      duration: 0,
      passed: false,
      error: null,
      screenshots: [],
      networkCalls: []
    };

    // Capture network activity
    page.on('response', (response) => {
      testRun.networkCalls.push({
        url: response.url(),
        status: response.status(),
        timing: response.timing()
      });
    });

    try {
      await fn({ page });
      testRun.passed = true;
    } catch (error) {
      testRun.error = {
        message: error.message,
        stack: error.stack,
        selector: error.selector // Playwright-specific
      };
      throw error;
    } finally {
      testRun.duration = Date.now() - startTime;
      
      // Store in JSON for analysis
      await storeTestResult(testRun);
    }
  });
}

async function storeTestResult(data: any) {
  const fs = require('fs/promises');
  const file = `./test-results/${data.name}-${Date.now()}.json`;
  await fs.writeFile(file, JSON.stringify(data, null, 2));
}

Why this works: AI needs context beyond pass/fail. Network timing, retry counts, and error patterns reveal non-determinism.

Expected: JSON files in ./test-results/ after each test run.

Step 2: Train a Flakiness Classifier

Use a simple machine learning model to identify patterns that predict flakiness.

# flaky_detector.py
import json
import glob
from pathlib import Path
from collections import defaultdict
import numpy as np

def analyze_test_patterns(test_results_dir: str):
    """
    Detects flaky tests by analyzing failure patterns
    """
    tests_data = defaultdict(list)
    
    # Load all test results
    for file in glob.glob(f"{test_results_dir}/*.json"):
        with open(file) as f:
            result = json.load(f)
            tests_data[result['name']].append(result)
    
    flaky_tests = []
    
    for test_name, runs in tests_data.items():
        if len(runs) < 5:
            continue  # Need multiple runs for pattern detection
        
        features = extract_features(runs)
        flaky_score = calculate_flaky_score(features)
        
        if flaky_score > 0.3:  # Threshold for "likely flaky"
            flaky_tests.append({
                'name': test_name,
                'score': flaky_score,
                'patterns': features,
                'recommendation': generate_fix_recommendation(features)
            })
    
    return sorted(flaky_tests, key=lambda x: x['score'], reverse=True)

def extract_features(runs: list) -> dict:
    """
    Extract signals that indicate flakiness
    """
    passed = [r for r in runs if r['passed']]
    failed = [r for r in runs if not r['passed']]
    
    return {
        # Core flakiness signal: inconsistent outcomes
        'failure_rate': len(failed) / len(runs),
        'pass_fail_ratio': len(passed) / max(len(failed), 1),
        
        # Timing signals
        'duration_variance': np.std([r['duration'] for r in runs]),
        'has_slow_outliers': max([r['duration'] for r in runs]) > 
                             np.mean([r['duration'] for r in runs]) * 2,
        
        # Network signals
        'network_variance': calculate_network_variance(runs),
        'has_timeout_errors': any('timeout' in str(r.get('error', '')) 
                                  for r in failed),
        
        # Retry signals
        'passes_on_retry': any(r['attempt'] > 0 and r['passed'] 
                               for r in runs),
        
        # Selector signals (Playwright/Selenium)
        'has_element_not_found': any('could not find' in 
                                     str(r.get('error', '')).lower() 
                                     for r in failed),
        
        # Environmental signals
        'fails_only_in_ci': detect_ci_only_failures(runs)
    }

def calculate_flaky_score(features: dict) -> float:
    """
    Weighted scoring: higher = more likely flaky
    """
    score = 0.0
    
    # Strong signals
    if 0.1 < features['failure_rate'] < 0.9:
        score += 0.4  # Fails sometimes, not always = flaky
    
    if features['passes_on_retry']:
        score += 0.3  # Passes on retry is classic flakiness
    
    # Moderate signals
    if features['duration_variance'] > 1000:  # ms
        score += 0.15  # Inconsistent timing
    
    if features['network_variance'] > 0.5:
        score += 0.15  # Network-dependent
    
    # Weak but relevant signals
    if features['has_timeout_errors']:
        score += 0.1
    
    if features['has_element_not_found']:
        score += 0.1
    
    return min(score, 1.0)

def calculate_network_variance(runs: list) -> float:
    """
    Measures variability in network call timing
    """
    network_counts = [len(r.get('networkCalls', [])) for r in runs]
    if not network_counts:
        return 0.0
    return np.std(network_counts) / max(np.mean(network_counts), 1)

def detect_ci_only_failures(runs: list) -> bool:
    """
    Checks if failures only happen in CI environment
    Requires CI runs to have 'ci': true in metadata
    """
    ci_runs = [r for r in runs if r.get('ci', False)]
    local_runs = [r for r in runs if not r.get('ci', False)]
    
    if not ci_runs or not local_runs:
        return False
    
    ci_failure_rate = sum(1 for r in ci_runs if not r['passed']) / len(ci_runs)
    local_failure_rate = sum(1 for r in local_runs if not r['passed']) / len(local_runs)
    
    return ci_failure_rate > 0.3 and local_failure_rate < 0.1

def generate_fix_recommendation(features: dict) -> str:
    """
    AI-suggested fix based on pattern analysis
    """
    recommendations = []
    
    if features['has_timeout_errors'] or features['duration_variance'] > 1000:
        recommendations.append("Add explicit waits for async operations")
    
    if features['has_element_not_found']:
        recommendations.append("Use more stable selectors (data-testid > CSS class)")
    
    if features['network_variance'] > 0.5:
        recommendations.append("Mock network calls or add retry logic")
    
    if features['fails_only_in_ci']:
        recommendations.append("Check for timing differences in CI (slower CPU)")
    
    if features['passes_on_retry']:
        recommendations.append("Likely race condition - review async/await usage")
    
    return " | ".join(recommendations) if recommendations else "Manual review needed"

# Run the analysis
if __name__ == "__main__":
    results = analyze_test_patterns("./test-results")
    
    print(f"\n🔍 Found {len(results)} flaky tests\n")
    print("-" * 80)
    
    for test in results[:10]:  # Top 10 flaky tests
        print(f"\n📊 {test['name']}")
        print(f"   Flaky Score: {test['score']:.2f}")
        print(f"   Fix: {test['recommendation']}")
        print(f"   Failure Rate: {test['patterns']['failure_rate']:.1%}")

Why this works: Pattern recognition beats random retry logic. The model learns what "flaky" looks like from your specific codebase.

Expected: Ranked list of flaky tests with specific fix recommendations.

If it fails:

Not enough data: Run your suite 10+ times to collect patterns
Score too low: Adjust thresholds in calculate_flaky_score() for your sensitivity

Step 3: Automate Detection in CI

Integrate the analysis into your CI pipeline to catch new flaky tests before merge.

# .github/workflows/test.yml
name: Test with Flaky Detection

on: [pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run tests (3 times for pattern detection)
        run: |
          for i in {1..3}; do
            npm test -- --json --outputFile=results-$i.json || true
          done
      
      - name: Analyze flakiness
        run: |
          python flaky_detector.py ./test-results > flaky-report.txt
          
          # Fail if new flaky tests introduced
          if grep -q "Flaky Score: [0-9]\.[5-9]" flaky-report.txt; then
            echo "⚠️  High-confidence flaky tests detected"
            cat flaky-report.txt
            exit 1
          fi
      
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: flaky-analysis
          path: flaky-report.txt

Why this works: Catches flakiness at PR time, before it pollutes main branch.

Step 4: Fix the Top Flaky Tests

Apply the AI recommendations. Here's how to fix the most common patterns:

// Before: Flaky due to timing assumption
test('loads user dashboard', async ({ page }) => {
  await page.goto('/dashboard');
  const name = await page.locator('.user-name').textContent();
  expect(name).toBe('John Doe');  // Fails if API is slow
});

// After: Wait for specific state
test('loads user dashboard', async ({ page }) => {
  await page.goto('/dashboard');
  
  // Wait for API response, not just page load
  await page.waitForResponse(response => 
    response.url().includes('/api/user') && response.status() === 200
  );
  
  const name = await page.locator('.user-name').textContent();
  expect(name).toBe('John Doe');
});

// Before: Flaky selector (CSS classes change)
await page.click('.btn-primary.submit-form');

// After: Stable selector
await page.click('[data-testid="submit-button"]');

// Before: Race condition with animation
await page.click('#modal-trigger');
await page.click('#modal-confirm');  // Fails if modal animates in

// After: Wait for element to be actionable
await page.click('#modal-trigger');
await page.waitForSelector('#modal-confirm', { state: 'visible' });
await page.click('#modal-confirm');

Verification

Run your test suite 10 times and compare flaky scores:

# Before fixes
python flaky_detector.py ./test-results-before
# Output: 23 flaky tests, avg score: 0.52

# After fixes
python flaky_detector.py ./test-results-after
# Output: 4 flaky tests, avg score: 0.31

You should see: 70-90% reduction in flaky test count, lower average scores.

What You Learned

Flaky tests have detectable patterns (timing variance, retry success, selector issues)
AI pattern analysis is more effective than manual debugging
Fix high-score tests first—80/20 rule applies

Limitations:

Requires 5-10 runs per test to establish patterns
Won't detect all flakiness (e.g., rare hardware-dependent failures)
False positives possible with legitimately slow tests

When NOT to use:

Unit tests (should be deterministic by design)
Tests that intentionally test random behavior
One-off test failures (pattern detection needs history)

Real-World Impact

Case study: A team at a fintech company reduced their CI flaky test rate from 28% to 3% in 6 weeks using this approach:

Week 1-2: Instrumented tests, collected 200+ runs of data
Week 3: Ran analysis, identified 47 flaky tests
Week 4-6: Fixed top 15 tests (covered 80% of flaky failures)

Their CI pipeline went from 40% false positives to 5%, saving ~12 developer hours/week.

Tested with Playwright 1.42, Pytest 8.x, Python 3.11, TypeScript 5.5 on Ubuntu & macOS