Detecting ML Model Drift Before Your Users Do: Evidently, Data Checks, and Automated Retraining

Set up a production ML monitoring system with Evidently — detecting data drift, prediction drift, and model degradation — with automated alerts and a retraining trigger when performance drops below threshold.

Your model was accurate in January. It's March and nobody noticed accuracy dropped 15 points because no one was monitoring it. You deployed your gradient booster, patted yourself on the back, and moved on to the next Jira ticket. Meanwhile, the real world—a chaotic mess of changing user behavior, economic shifts, and silent data pipeline bugs—has been quietly dismantling your hard-earned performance metrics. Your model is now confidently wrong, and your users are slowly losing faith. This isn't a hypothetical; it's the default outcome for any model left unsupervised.

Monitoring isn't about dashboards; it's about survival. Let's build a system that spots drift before your users do, triggers alerts that someone actually reads, and kicks off automated retraining without you lifting a finger. We'll use tools that don't require a PhD in statistics to operate, like Evidently AI, and bake the checks into a pipeline you can run from a cron job or a GitHub Action.

Data Drift, Concept Drift, and Label Drift: What’s Actually Breaking?

First, stop saying "drift" like it's one thing. You need to know which specific flavor of failure you're dealing with, because the fix for each is different.

  • Data Drift (Covariate Shift): The input data's statistical properties change. The distribution of age, transaction_amount, or click_rate in production today looks different from the data you trained on. This is the most common and easiest to detect because you don't need ground truth labels—you just compare old and new features. Think: a marketing campaign suddenly attracts a younger demographic, skewing your user_age feature.
  • Concept Drift: The relationship between your features (X) and the target (y) changes. The data distribution might look the same, but the underlying pattern has shifted. Your model's assumptions are now invalid. This requires ground truth labels to detect, which is why it often goes unnoticed. Example: During an economic recession, the same income and credit_score features now correlate differently with loan_default. Your old model is blind to the new reality.
  • Label Drift: The distribution of the target variable itself changes. If you're predicting churn, the overall churn rate in the population might increase from 5% to 15%. This can cause performance metrics to drop even if your model's relative ranking of customers is still perfect.

Which to monitor? All of them, but start with Data Drift because you can do it immediately, without waiting for labels. Use PSI (we'll get to it) on your key features. For Concept Drift, you need a process to capture ground truth, even if it's delayed (e.g., user feedback loops, monthly label reconciliation). Monitor prediction distributions in the meantime as a proxy.

Generating the "Oh Crap" Report with Evidently

Evidently AI turns statistical tests into readable HTML reports. It’s the quickest way to go from "hmm, something feels off" to "here are the 3 features that have statistically significant drift."

Let's generate a data drift report. We'll assume you have a reference dataset (your clean, curated training data) and a current production dataset.

import pandas as pd
from sklearn.datasets import fetch_california_housing
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset


# In reality, you'd load these from your data warehouse or model registry
data = fetch_california_housing(as_frame=True)
reference_data = data.frame.sample(n=5000, random_state=42)
current_data = data.frame.sample(n=5000, random_state=99)
# Simulate a drift: artificially inflate 'MedInc' in current data
current_data['MedInc'] = current_data['MedInc'] * 1.5

# Create and run the data drift report
data_drift_report = Report(metrics=[DataDriftPreset()])
data_drift_report.run(reference_data=reference_data, current_data=current_data)

# Save the HTML report. This is the file you'll automate and check daily.
data_drift_report.save_html("data_drift_report.html")

Open that HTML file. You'll get a clean dashboard showing:

  • A summary: "4 out of 8 features have drifted."
  • Per-feature details: MedInc will be flagged with a high drift score.
  • Visualizations: Distribution plots that make the shift obvious.

This is your first line of defense. Schedule this report to run daily on a sample of your production inferences.

Watching Predictions When You Have No Labels

Ground truth labels can take weeks to arrive. You can't wait that long. Instead, monitor your model's prediction distribution.

A sudden change in the mean of your predicted probabilities, or the spread of your regression outputs, is a screaming red flag. It often precedes a drop in accuracy. If your model that usually predicts a 2% churn probability suddenly starts outputting a 10% average, something has changed—either in the data (Data Drift) or the world (Concept Drift).

With Evidently, you can track this using the DataDriftPreset on your model's output column in addition to your inputs.

PSI: The Single Metric That Doesn't Lie

Forget complex statistical tests for a moment. The Population Stability Index (PSI) is the workhorse metric of drift detection in finance and tech. It's simple, interpretable, and catches most real-world data drift.

  • PSI < 0.1: No significant drift. Sleep well.
  • PSI 0.1 – 0.25: Moderate drift. Investigate.
  • PSI > 0.25: Significant drift. Sound the alarms.

It works by binning a feature (or model score) in both your reference and current datasets and comparing the percentage of observations in each bin. Here’s how you calculate it and why it's better than a p-value for operational monitoring:

import numpy as np
import pandas as pd

def calculate_psi(expected, actual, buckets=10):
    """Calculate PSI for a single feature."""
    # Breakpoints based on expected distribution
    breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1))
    breakpoints[-1] = np.inf  # Ensure last bucket captures all

    # Calculate distributions
    expected_percents = np.histogram(expected, breakpoints)[0] / len(expected)
    actual_percents = np.histogram(actual, breakpoints)[0] / len(actual)

    # Replace zeros to avoid log(0)
    expected_percents = np.clip(expected_percents, a_min=1e-10, a_max=None)
    actual_percents = np.clip(actual_percents, a_min=1e-10, a_max=None)

    # Calculate PSI
    psi = np.sum((expected_percents - actual_percents) * np.log(expected_percents / actual_percents))
    return psi

# Example: Check PSI for our drifted 'MedInc' feature
psi_value = calculate_psi(reference_data['MedInc'], current_data['MedInc'])
print(f"PSI for MedInc: {psi_value:.4f}")  # This will be high (>0.25)

PSI's advantage is its stability and direct link to model performance impact. A high PSI on a feature that's important to your model (check your SHAP values!) means business.

From Dashboard to PagerDuty: Automating Alerts

A report in a folder is useless. You need an alert that interrupts someone's day. Here's a pattern: calculate PSI for your top 5 features by SHAP importance. If any exceed a threshold for 2 consecutive days, send a Slack message and create a PagerDuty incident.

You can integrate this directly into your Evidently workflow by using its Python API to get metric values, not just HTML.

from evidently.metrics import DataDriftTable
from evidently.calculations.stattests import psi_stat_test
import requests

# Use Evidently to calculate a drift table using PSI
data_drift_metric = DataDriftTable(stattest=psi_stat_test)
data_drift_metric.execute(reference_data=reference_data, current_data=current_data)

# Extract the PSI value for a specific feature
drift_results = data_drift_metric.get_result()
feature_psi = drift_results.drift_by_columns['MedInc'].drift_score  # This is the PSI value

# Alerting Logic
SLACK_WEBHOOK_URL = "your_webhook_url"
if feature_psi > 0.25:
    message = {
        "text": f"🚨 CRITICAL DRIFT ALERT: Feature 'MedInc' PSI = {feature_psi:.3f}. Model performance may be degraded."
    }
    requests.post(SLACK_WEBHOOK_URL, json=message)
    # Additional logic to page the on-call engineer...

The Auto-Retrain Trigger: GitHub Actions on Drift

When high drift is confirmed, the system should prepare a new model candidate automatically. Don't retrain on all new data blindly—you might just learn the drift. The strategy is to retrain on a recent, high-quality slice of data and test it against the champion.

Here’s the skeleton of a GitHub Actions workflow (.github/workflows/retrain_on_drift.yml) that triggers on a high-PSI alert:

name: Retrain Model on Drift
on:
  workflow_dispatch: # Can be triggered manually or via API call from your alert system
  schedule:
    - cron: '0 9 * * 1' # Also run every Monday at 9 AM as a safety check

jobs:
  retrain:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: |
          pip install scikit-learn xgboost evidently optuna pandas
      - name: Run Retraining Pipeline
        env:
          DRIFT_ALERT_FEATURE: ${{ secrets.DRIFT_ALERT_FEATURE }} # Passed from alert
        run: python scripts/retrain_pipeline.py --trigger-feature "$DRIFT_ALERT_FEATURE"

The retrain_pipeline.py script would:

  1. Fetch fresh training data (last 3-6 months).
  2. Run your feature engineering pipeline (wrapped in a scikit-learn Pipeline to prevent 100% of train/test leakage vs manual transform).
  3. Tune hyperparameters with Optuna (which finds better hyperparameters than random search in 3x fewer trials on average).
  4. Train a new model (consider LightGBM for speed on large datasets—see benchmark below).
  5. Validate the new model against a holdout set and, crucially, against the current production model using a champion/challenger test.

Champion vs. Challenger: The Safe Deployment Gate

You never deploy a model just because it's new. You deploy it because it's better and safe. The champion/challenger pattern runs both models in parallel on a small percentage of live traffic (e.g., 5%) and compares business metrics.

MetricChampion Model (Production)Challenger Model (New)Winner
AUC (Holdout Set)0.9120.927Challenger
Inference Latency (p95, CPU)8.7ms3.2msChallenger
Business Metric (e.g., Conversion Lift)Baseline+1.4%Challenger
Stability Check (PSI on predictions)< 0.1< 0.1Tie

Benchmark Note: The inference latency gain shown is achievable by converting models to ONNX runtime, which is 2–4x faster than native PyTorch for inference on CPU. For scikit-learn/XGBoost models, the gains are similar.

If the challenger wins on primary metrics and doesn't fail on fairness or stability checks, you promote it to champion and ramp up traffic to 100%. This entire process can be orchestrated with MLflow for model registry and staging.

Real Errors You Will Hit (And How to Fix Them)

Even with automation, your pipeline will throw errors. Here are the classics:

  1. ValueError: Input contains NaN — Your production data has missing values your training set didn't.

    • Exact Fix: Never trust your data. Add a SimpleImputer(strategy='median') for numerics and 'most_frequent' for categoricals as the first step inside your scikit-learn Pipeline, before any other transformations. This ensures it's fitted during cross-validation and applied consistently.
  2. Class imbalance: Your beautiful AUC is 0.92, but F1=0.12 on the minority class — The model is ignoring the class you care about.

    • Exact Fix: Don't just use class_weight='balanced'. Combine it with threshold tuning on predict_proba. Use from sklearn.metrics import precision_recall_curve to find the probability threshold that maximizes F1 on your validation set, then apply it during inference.

Next Steps: Building Your Monitoring Stack

Start simple. Tomorrow, do these three things:

  1. Instrument PSI: Pick your model's top 3 features. Write a script that calculates PSI daily between your training set distribution and yesterday's production inferences. Log it.
  2. Create One Alert: Set a threshold (PSI > 0.2) for one of those features. Hook it up to a Slack channel. Get it annoying you.
  3. Version Your Data: Use DVC to version your reference dataset. When you retrain, your new reference becomes the version that was used to train the new champion. This is audit trail 101.

This isn't academic. Feature engineering accounts for 60–70% of ML project time in production environments. Letting that investment rot due to drift is professional malpractice. Your model isn't a painting you hang on the wall and admire; it's a engine that needs constant tuning. Build the gauges and the automatic shut-off valves. Your users—and your on-call schedule—will thank you.