Problem: Your Model Works in Dev, Fails in Production
Your ML model had 95% accuracy last month. Today it's at 78% and you only noticed when users complained. Production data shifted, but you had no way to detect it.
You'll learn:
- Set up Evidently AI for drift detection
- Monitor feature and prediction drift automatically
- Build alerts before accuracy degrades
Time: 15 min | Level: Intermediate
Why This Happens
Production data changes over time (seasonality, user behavior, market shifts). Your model trained on historical data becomes outdated, but traditional metrics don't catch drift until performance tanks.
Common symptoms:
- Accuracy drops with no code changes
- Predictions skew toward one class
- Feature distributions shift silently
- Alerts trigger only after user impact
Solution
Step 1: Install Evidently AI
pip install evidently --break-system-packages
Expected: Version 0.4.x or higher (supports production monitoring)
Step 2: Create Your First Drift Report
# drift_detector.py
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import pandas as pd
# Load your reference data (training set)
reference_data = pd.read_csv('train_data.csv')
# Load current production data
current_data = pd.read_csv('production_week_6.csv')
# Generate drift report
report = Report(metrics=[
DataDriftPreset(),
])
report.run(
reference_data=reference_data,
current_data=current_data,
column_mapping=None # Auto-detect feature types
)
# Save as interactive HTML
report.save_html('drift_report.html')
Why this works: Evidently compares statistical distributions between reference (training) and current (production) data using multiple drift detection algorithms.
If it fails:
- Error: "Column mismatch": Ensure both datasets have identical column names
- Empty report: Check that current_data has >100 rows for statistical significance
Step 3: Interpret Drift Metrics
Open drift_report.html in your browser. Key sections:
Dataset Drift:
- ✅ No drift detected: <30% of features drifted
- ⚠️ Drift detected: ≥50% of features drifted
- 🚨 Critical drift: Prediction column drifted
Per-Feature Drift:
# Programmatic access to drift metrics
drift_info = report.as_dict()
for feature in drift_info['metrics'][0]['result']['drift_by_columns']:
if feature['drift_detected']:
print(f"⚠️ {feature['column_name']}: drift score {feature['drift_score']:.3f}")
Expected output:
⚠️ customer_age: drift score 0.842
⚠️ transaction_amount: drift score 0.651
Step 4: Set Up Continuous Monitoring
# monitor.py
from evidently.ui.workspace import Workspace
from evidently.ui.dashboards import DashboardPanelCounter, CounterAgg
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
import schedule
import time
# Create workspace for production monitoring
ws = Workspace.create("production_monitoring")
project = ws.create_project("fraud_detection_model")
project.description = "Monitors drift for fraud detection in prod"
def check_drift():
"""Run every hour in production"""
current_data = fetch_latest_predictions() # Your data pipeline
report = Report(metrics=[
DataDriftPreset(),
DataQualityPreset(), # Catches missing values, type changes
])
report.run(
reference_data=reference_data,
current_data=current_data
)
# Save to workspace for historical tracking
ws.add_report(project.id, report)
# Alert on drift
drift_result = report.as_dict()['metrics'][0]['result']
if drift_result['dataset_drift']:
send_alert(f"🚨 Data drift detected: {drift_result['number_of_drifted_columns']} features")
# Run every hour
schedule.every().hour.do(check_drift)
while True:
schedule.run_pending()
time.sleep(60)
Why hourly: Catches drift early without overwhelming compute. Adjust based on data volume.
Step 5: Configure Drift Detection Method
from evidently.options import DataDriftOptions
from evidently.calculations.stattests import StatTest
# Customize for your data type
data_drift_options = DataDriftOptions(
# For continuous features: Kolmogorov-Smirnov test
num_stattest=StatTest.KS,
num_stattest_threshold=0.05, # p-value < 0.05 = drift
# For categorical features: Chi-squared test
cat_stattest=StatTest.CHISQUARE,
cat_stattest_threshold=0.05,
# Drift if >50% of features drift
drift_share=0.5,
)
report = Report(
metrics=[DataDriftPreset()],
options=[data_drift_options]
)
When to adjust thresholds:
- Lower threshold (0.01): High-stakes models (medical, financial)
- Higher threshold (0.1): Noisy data, tolerate more variation
- Custom tests: Time-series data needs different stat tests
Step 6: Monitor Prediction Drift
from evidently.metric_preset import TargetDriftPreset
# If you have ground truth labels
report = Report(metrics=[
TargetDriftPreset(), # Monitors prediction distribution
])
report.run(
reference_data=train_data,
current_data=prod_data,
column_mapping={'target': 'is_fraud', 'prediction': 'fraud_score'}
)
Critical insight: Prediction drift often precedes accuracy drop by days/weeks. Catch it here.
Verification
Test your setup:
python drift_detector.py
You should see:
drift_report.htmlgenerated (open in browser)- Clear visualization of drifted features
- Dataset-level drift verdict
Smoke test with synthetic drift:
# Force drift for testing
import numpy as np
test_data = reference_data.copy()
test_data['feature_1'] = test_data['feature_1'] * 2 # Artificial drift
report.run(reference_data=reference_data, current_data=test_data)
# Should detect drift in feature_1
What You Learned
- Evidently AI compares production vs training data distributions
- Statistical tests (KS, Chi-squared) quantify drift automatically
- Monitor continuously, alert before accuracy degrades
Limitations:
- Requires labeled data for target drift (prediction drift works without)
- Statistical tests need >100 samples for reliability
- Doesn't explain why drift occurred, just that it did
When NOT to use this:
- Natural drift is expected (e.g., seasonal models)
- Data volume too low (<50 samples/day)
- Cost of false alerts exceeds drift risk
Production Deployment Tips
Integrate with Existing Stack
# Export to Prometheus for Grafana dashboards
from prometheus_client import Gauge
drift_gauge = Gauge('model_drift_score', 'Current drift score', ['feature'])
for feature, score in drift_scores.items():
drift_gauge.labels(feature=feature).set(score)
Storage Optimization
# For high-throughput models, sample production data
current_data = prod_data.sample(n=10000, random_state=42)
Alert Thresholds
- Warning (30% drift): Review in next sprint
- Critical (50% drift): Investigate within 24h
- Emergency (prediction drift): Page on-call engineer
Tested on Evidently 0.4.25, Python 3.11, with 500K+ production predictions