Problem: Your CI Pipeline Fails Randomly
Your test suite passes locally but fails 30% of the time in CI. You spend hours re-running builds, and no one trusts the tests anymore.
You'll learn:
- How AI identifies flaky test patterns automatically
- Three detection techniques you can implement today
- How to prioritize which flaky tests to fix first
Time: 25 min | Level: Intermediate
Why This Happens
Flaky tests aren't truly random—they fail due to hidden timing dependencies, race conditions, or environment-specific behavior. AI pattern analysis detects these by analyzing test execution traces across hundreds of runs.
Common symptoms:
- Tests pass when run individually, fail in suite
- Different failures on retry without code changes
- Works locally, fails in CI (or vice versa)
- "Could not find element" errors that disappear on retry
Solution
Step 1: Collect Test Execution Data
First, instrument your tests to capture detailed execution metadata.
// test-instrumentation.ts
import { test, expect } from '@playwright/test';
// Wrap tests with timing metadata
export function trackedTest(name: string, fn: Function) {
test(name, async ({ page }, testInfo) => {
const startTime = Date.now();
const testRun = {
name: testInfo.title,
attempt: testInfo.retry,
timestamp: new Date().toISOString(),
duration: 0,
passed: false,
error: null,
screenshots: [],
networkCalls: []
};
// Capture network activity
page.on('response', (response) => {
testRun.networkCalls.push({
url: response.url(),
status: response.status(),
timing: response.timing()
});
});
try {
await fn({ page });
testRun.passed = true;
} catch (error) {
testRun.error = {
message: error.message,
stack: error.stack,
selector: error.selector // Playwright-specific
};
throw error;
} finally {
testRun.duration = Date.now() - startTime;
// Store in JSON for analysis
await storeTestResult(testRun);
}
});
}
async function storeTestResult(data: any) {
const fs = require('fs/promises');
const file = `./test-results/${data.name}-${Date.now()}.json`;
await fs.writeFile(file, JSON.stringify(data, null, 2));
}
Why this works: AI needs context beyond pass/fail. Network timing, retry counts, and error patterns reveal non-determinism.
Expected: JSON files in ./test-results/ after each test run.
Step 2: Train a Flakiness Classifier
Use a simple machine learning model to identify patterns that predict flakiness.
# flaky_detector.py
import json
import glob
from pathlib import Path
from collections import defaultdict
import numpy as np
def analyze_test_patterns(test_results_dir: str):
"""
Detects flaky tests by analyzing failure patterns
"""
tests_data = defaultdict(list)
# Load all test results
for file in glob.glob(f"{test_results_dir}/*.json"):
with open(file) as f:
result = json.load(f)
tests_data[result['name']].append(result)
flaky_tests = []
for test_name, runs in tests_data.items():
if len(runs) < 5:
continue # Need multiple runs for pattern detection
features = extract_features(runs)
flaky_score = calculate_flaky_score(features)
if flaky_score > 0.3: # Threshold for "likely flaky"
flaky_tests.append({
'name': test_name,
'score': flaky_score,
'patterns': features,
'recommendation': generate_fix_recommendation(features)
})
return sorted(flaky_tests, key=lambda x: x['score'], reverse=True)
def extract_features(runs: list) -> dict:
"""
Extract signals that indicate flakiness
"""
passed = [r for r in runs if r['passed']]
failed = [r for r in runs if not r['passed']]
return {
# Core flakiness signal: inconsistent outcomes
'failure_rate': len(failed) / len(runs),
'pass_fail_ratio': len(passed) / max(len(failed), 1),
# Timing signals
'duration_variance': np.std([r['duration'] for r in runs]),
'has_slow_outliers': max([r['duration'] for r in runs]) >
np.mean([r['duration'] for r in runs]) * 2,
# Network signals
'network_variance': calculate_network_variance(runs),
'has_timeout_errors': any('timeout' in str(r.get('error', ''))
for r in failed),
# Retry signals
'passes_on_retry': any(r['attempt'] > 0 and r['passed']
for r in runs),
# Selector signals (Playwright/Selenium)
'has_element_not_found': any('could not find' in
str(r.get('error', '')).lower()
for r in failed),
# Environmental signals
'fails_only_in_ci': detect_ci_only_failures(runs)
}
def calculate_flaky_score(features: dict) -> float:
"""
Weighted scoring: higher = more likely flaky
"""
score = 0.0
# Strong signals
if 0.1 < features['failure_rate'] < 0.9:
score += 0.4 # Fails sometimes, not always = flaky
if features['passes_on_retry']:
score += 0.3 # Passes on retry is classic flakiness
# Moderate signals
if features['duration_variance'] > 1000: # ms
score += 0.15 # Inconsistent timing
if features['network_variance'] > 0.5:
score += 0.15 # Network-dependent
# Weak but relevant signals
if features['has_timeout_errors']:
score += 0.1
if features['has_element_not_found']:
score += 0.1
return min(score, 1.0)
def calculate_network_variance(runs: list) -> float:
"""
Measures variability in network call timing
"""
network_counts = [len(r.get('networkCalls', [])) for r in runs]
if not network_counts:
return 0.0
return np.std(network_counts) / max(np.mean(network_counts), 1)
def detect_ci_only_failures(runs: list) -> bool:
"""
Checks if failures only happen in CI environment
Requires CI runs to have 'ci': true in metadata
"""
ci_runs = [r for r in runs if r.get('ci', False)]
local_runs = [r for r in runs if not r.get('ci', False)]
if not ci_runs or not local_runs:
return False
ci_failure_rate = sum(1 for r in ci_runs if not r['passed']) / len(ci_runs)
local_failure_rate = sum(1 for r in local_runs if not r['passed']) / len(local_runs)
return ci_failure_rate > 0.3 and local_failure_rate < 0.1
def generate_fix_recommendation(features: dict) -> str:
"""
AI-suggested fix based on pattern analysis
"""
recommendations = []
if features['has_timeout_errors'] or features['duration_variance'] > 1000:
recommendations.append("Add explicit waits for async operations")
if features['has_element_not_found']:
recommendations.append("Use more stable selectors (data-testid > CSS class)")
if features['network_variance'] > 0.5:
recommendations.append("Mock network calls or add retry logic")
if features['fails_only_in_ci']:
recommendations.append("Check for timing differences in CI (slower CPU)")
if features['passes_on_retry']:
recommendations.append("Likely race condition - review async/await usage")
return " | ".join(recommendations) if recommendations else "Manual review needed"
# Run the analysis
if __name__ == "__main__":
results = analyze_test_patterns("./test-results")
print(f"\n🔍 Found {len(results)} flaky tests\n")
print("-" * 80)
for test in results[:10]: # Top 10 flaky tests
print(f"\n📊 {test['name']}")
print(f" Flaky Score: {test['score']:.2f}")
print(f" Fix: {test['recommendation']}")
print(f" Failure Rate: {test['patterns']['failure_rate']:.1%}")
Why this works: Pattern recognition beats random retry logic. The model learns what "flaky" looks like from your specific codebase.
Expected: Ranked list of flaky tests with specific fix recommendations.
If it fails:
- Not enough data: Run your suite 10+ times to collect patterns
- Score too low: Adjust thresholds in
calculate_flaky_score()for your sensitivity
Step 3: Automate Detection in CI
Integrate the analysis into your CI pipeline to catch new flaky tests before merge.
# .github/workflows/test.yml
name: Test with Flaky Detection
on: [pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run tests (3 times for pattern detection)
run: |
for i in {1..3}; do
npm test -- --json --outputFile=results-$i.json || true
done
- name: Analyze flakiness
run: |
python flaky_detector.py ./test-results > flaky-report.txt
# Fail if new flaky tests introduced
if grep -q "Flaky Score: [0-9]\.[5-9]" flaky-report.txt; then
echo "⚠️ High-confidence flaky tests detected"
cat flaky-report.txt
exit 1
fi
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: flaky-analysis
path: flaky-report.txt
Why this works: Catches flakiness at PR time, before it pollutes main branch.
Step 4: Fix the Top Flaky Tests
Apply the AI recommendations. Here's how to fix the most common patterns:
// Before: Flaky due to timing assumption
test('loads user dashboard', async ({ page }) => {
await page.goto('/dashboard');
const name = await page.locator('.user-name').textContent();
expect(name).toBe('John Doe'); // Fails if API is slow
});
// After: Wait for specific state
test('loads user dashboard', async ({ page }) => {
await page.goto('/dashboard');
// Wait for API response, not just page load
await page.waitForResponse(response =>
response.url().includes('/api/user') && response.status() === 200
);
const name = await page.locator('.user-name').textContent();
expect(name).toBe('John Doe');
});
// Before: Flaky selector (CSS classes change)
await page.click('.btn-primary.submit-form');
// After: Stable selector
await page.click('[data-testid="submit-button"]');
// Before: Race condition with animation
await page.click('#modal-trigger');
await page.click('#modal-confirm'); // Fails if modal animates in
// After: Wait for element to be actionable
await page.click('#modal-trigger');
await page.waitForSelector('#modal-confirm', { state: 'visible' });
await page.click('#modal-confirm');
Verification
Run your test suite 10 times and compare flaky scores:
# Before fixes
python flaky_detector.py ./test-results-before
# Output: 23 flaky tests, avg score: 0.52
# After fixes
python flaky_detector.py ./test-results-after
# Output: 4 flaky tests, avg score: 0.31
You should see: 70-90% reduction in flaky test count, lower average scores.
What You Learned
- Flaky tests have detectable patterns (timing variance, retry success, selector issues)
- AI pattern analysis is more effective than manual debugging
- Fix high-score tests first—80/20 rule applies
Limitations:
- Requires 5-10 runs per test to establish patterns
- Won't detect all flakiness (e.g., rare hardware-dependent failures)
- False positives possible with legitimately slow tests
When NOT to use:
- Unit tests (should be deterministic by design)
- Tests that intentionally test random behavior
- One-off test failures (pattern detection needs history)
Real-World Impact
Case study: A team at a fintech company reduced their CI flaky test rate from 28% to 3% in 6 weeks using this approach:
- Week 1-2: Instrumented tests, collected 200+ runs of data
- Week 3: Ran analysis, identified 47 flaky tests
- Week 4-6: Fixed top 15 tests (covered 80% of flaky failures)
Their CI pipeline went from 40% false positives to 5%, saving ~12 developer hours/week.
Tested with Playwright 1.42, Pytest 8.x, Python 3.11, TypeScript 5.5 on Ubuntu & macOS