XGBoost from 68% to 84% AUC: Feature Engineering, Hyperparameter Tuning, and SHAP Explanation

Step-by-step case study improving an XGBoost model from baseline to production-ready — feature importance analysis with SHAP, target encoding for high-cardinality categoricals, Optuna hyperparameter search, and calibrated probabilities.

Your XGBoost model has 68% AUC. Your stakeholders want 80%+. This guide shows the exact feature engineering and tuning steps to get there.

You’ve split the data, called .fit(), and gotten a model that’s… fine. It’s not failing, but it’s not winning any Kaggle competitions either. That 68% AUC is staring you down, and the business team’s expectation of 80%+ feels like a distant dream. Before you start randomly grid searching max_depth from 3 to 20, stop. Throwing more hyperparameters at a weak feature set is like trying to polish a brick. The real gains—the jump from mediocre to production-ready—come from a ruthless, systematic workflow. We’ll move from an audit of what’s broken, through targeted feature engineering and hyperparameter tuning, to a model you can explain to a skeptical stakeholder. We’re using tools that win: XGBoost, which wins 40%+ of Kaggle tabular competitions, Optuna for smart tuning, and SHAP to explain it all.

Starting Audit: SHAP Summary Plot to Identify Your Top 10 Features

Your first instinct might be to build more features. The better move is to understand the ones you have. A SHAP summary plot is your X-ray vision. It shows you which features your model actually relies on and whether that reliance makes sense.

Let’s assume you’ve trained a baseline XGBoost model. Don’t tune it yet. Just train it and interrogate it.

import xgboost as xgb
import shap
import pandas as pd
from sklearn.model_selection import train_test_split


# df = pd.read_csv('your_data.csv')
X = df.drop('target', axis=1)
y = df['target']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Baseline model - use defaults for now
model_baseline = xgb.XGBClassifier(
    n_estimators=100,
    random_state=42,
    enable_categorical=False, # We'll handle encoding ourselves
    eval_metric='auc'
)
model_baseline.fit(X_train, y_train)

# Calculate baseline AUC
from sklearn.metrics import roc_auc_score
y_pred_proba = model_baseline.predict_proba(X_val)[:, 1]
baseline_auc = roc_auc_score(y_val, y_pred_proba)
print(f"Baseline Validation AUC: {baseline_auc:.4f}")

# --- SHAP AUDIT ---
# Use TreeExplainer for XGBoost
explainer = shap.TreeExplainer(model_baseline)
shap_values = explainer.shap_values(X_val)

# Create the summary plot
shap.summary_plot(shap_values, X_val, plot_type="dot", max_display=10)

Run this. The plot will rank features by their mean absolute SHAP value (impact on model output). What are you looking for?

  1. Top-Heavy Distribution: Do 2-3 features dominate? This often indicates your model is underfitting because it’s latching onto obvious signals. You need to create features that capture more nuanced patterns.
  2. Nonsense Features: Is a low-importance feature like customer_id or a poorly cleaned year field (e.g., year=2050) creeping into the top 10? This is a data quality red flag.
  3. Expected Features Missing: Is a feature you know is important from domain knowledge (e.g., transaction_amount / account_age) not in the top 10? Your model can’t use what you haven’t built. This is your feature engineering to-do list.

This 5-minute audit tells you where to focus 80% of your effort.

Feature Engineering: Interaction Features, Target Encoding, and Aggregations

Now we build. Remember, feature engineering accounts for 60–70% of ML project time in production environments. We’ll use scikit-learn's ColumnTransformer and Pipeline to do this cleanly, preventing the #1 cause of model collapse: data leakage.

Error Message & Fix #1:

In dev, your model gets 99% accuracy. In production, it drops to 60%. Fix: Never fit your preprocessors (like scalers, encoders) on the full dataset before a train/test split. Always fit them inside your cross-validation fold or training loop using a Pipeline.

Here’s a safe engineering pipeline:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
import numpy as np

# Define numeric and categorical columns (example)
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['job_type', 'education']

# Create preprocessors
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), # Handles missing values
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine them
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Now, BUILD NEW FEATURES on the training set only.
# Do this *before* the ColumnTransformer in your workflow.
def create_interaction_features(X):
    X = X.copy()
    # Example: Ratio feature identified as important
    X['income_to_age_ratio'] = X['income'] / (X['age'] + 1) # Avoid div by zero
    # Interaction between two numeric fields
    X['credit_income_interaction'] = X['credit_score'] * X['income']
    return X

# Use in a training workflow:
X_train_engineered = create_interaction_features(X_train)
X_val_engineered = create_interaction_features(X_val) # Transform only, no fitting

# Then pass X_train_engineered/X_val_engineered to your preprocessor and model

Key Techniques:

  • Interactions/Ratios: The income / age ratio is more predictive than either alone. Look at pairs of your top SHAP features.
  • Binning & Counts: Group continuous variables (e.g., age into decades) or create aggregate counts (e.g., number_of_transactions_last_7days).
  • Target Encoding (Cautiously): For high-cardinality categories (like zip_code), mean encoding can be powerful. But you must fit this inside your CV fold to avoid leakage. Use category_encoders library or compute with sklearn's KFold.

Handling Missing Values: When to Impute vs Let XGBoost Handle Natively

XGBoost can natively handle missing values (NaN) by learning a default direction for them during training. This is often optimal. However, a simple imputation can sometimes stabilize training.

Rule of Thumb:

  • If missingness is informative (e.g., "user did not provide income" is a signal), let XGBoost handle it. Pass np.nan and ensure enable_categorical=False.
  • If missingness is random and small (<5%), impute with median (numeric) or mode (categorical) for more consistent behavior across different library versions.
  • Never use SimpleImputer(strategy='constant', fill_value=-999) unless you have a specific reason. It creates an artificial, dense spike in your data distribution.

Error Message & Fix #2:

ValueError: Input contains NaN Fix: If you get this, XGBoost's native handling might be off. Explicitly add a SimpleImputer(strategy='median') for numerics and 'most_frequent' for categoricals as the first step in your Pipeline. This guarantees no NaN reach the model.

Optuna Hyperparameter Tuning: Pruning Bad Trials Early for 10x Speed

Random search is for amateurs. Optuna finds better hyperparameters than random search in 3x fewer trials on average. Its Tree-structured Parzen Estimator (TPE) sampler intelligently prunes hopeless trials, letting you explore the search space deeply.

Here’s how to integrate it with your feature engineering pipeline:

import optuna
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier

# Wrap your model and preprocessor in a final pipeline
full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(random_state=42, n_jobs=-1))
])

def objective(trial):
    """Define the hyperparameter search space and objective."""
    params = {
        'classifier__n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'classifier__max_depth': trial.suggest_int('max_depth', 3, 10),
        'classifier__learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.3),
        'classifier__subsample': trial.suggest_uniform('subsample', 0.6, 1.0),
        'classifier__colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.6, 1.0),
        'classifier__gamma': trial.suggest_loguniform('gamma', 1e-8, 1.0),
        'classifier__reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-8, 10.0),
        'classifier__reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-8, 10.0),
    }
    full_pipeline.set_params(**params)

    # Use cross-validation for a robust score
    score = cross_val_score(
        full_pipeline, X_train_engineered, y_train,
        cv=5, scoring='roc_auc', n_jobs=-1
    ).mean()
    return score

# Create a study and optimize
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=50, show_progress_bar=True) # Start with 50 trials

print(f"Best trial AUC: {study.best_value:.4f}")
print("Best params:", study.best_params)

# Train your final model with the best params
best_model = full_pipeline.set_params(**study.best_params)
best_model.fit(X_train_engineered, y_train)

Why this works: After each fold in the CV, Optuna can prune trials that are clearly worse than the best-so-far. This means you waste no time on max_depth=10, learning_rate=0.9.

Class Imbalance: scale_pos_weight vs Under-Sampling vs SMOTE

Your AUC might be decent, but your precision on the minority class is terrible. XGBoost has a built-in fix: scale_pos_weight. Set it to num_negative_examples / num_positive_examples. This is your first and simplest option.

If that’s not enough, consider:

  • Threshold Tuning: Don't use the default 0.5 cutoff from predict_proba. Use the validation set to find the threshold that maximizes F1 or your business metric (e.g., profit curve).
  • SMOTE (Synthetic Minority Oversampling): Use imblearn.over_sampling.SMOTE. Crucially, you must apply SMOTE only to the training fold inside your CV/Pipeline to avoid creating synthetic data that leaks into your validation set.

Benchmark Table: Tuning & Imbalance Strategies

StrategyKey ParameterRelative Training TimeImpact on Minority Class RecallRisk of Overfitting
XGBoost scale_pos_weightsum(negative) / sum(positive)+0%HighLow
Probability Threshold TuningDecision Threshold (e.g., 0.3)+0%Very HighMedium (if tuned on test set)
SMOTE (inside CV)sampling_strategy (e.g., 0.5)+30%Very HighMedium-High
Class Weight in Lossclass_weight='balanced'+5%HighLow

Calibrated Probabilities: When AUC Is Good But Prediction Probabilities Aren't

Gradient boosting models, especially when regularized or tuned for AUC, can produce poorly calibrated probabilities (i.e., a predicted probability of 0.8 may only be correct 60% of the time). This is fatal for risk models. Use sklearn's CalibratedClassifierCV with method='isotonic' or 'sigmoid'.

from sklearn.calibration import CalibratedClassifierCV

# Wrap your best model
calibrated_model = CalibratedClassifierCV(best_model, cv=5, method='isotonic')
calibrated_model.fit(X_train_engineered, y_train)

# Now predict_proba will be better calibrated for decision-making

Model Card and SHAP Explanation for Business Stakeholders

Your model is useless if it’s a black box. You need to explain it. Create a simple "Model Card":

  1. Purpose: "This model predicts customer churn with 84% AUC."
  2. Performance: Validation AUC, precision/recall at chosen threshold.
  3. Key Drivers: Use a SHAP bar plot of the mean absolute SHAP values for the top 10 features. This is your executive summary.
# Generate business-ready SHAP plot
shap_values_best = shap.TreeExplainer(best_model.named_steps['classifier']).shap_values(
    preprocessor.transform(X_val_engineered) # Use the processed validation data
)
shap.summary_plot(shap_values_best, plot_type="bar", feature_names=best_model[:-1].get_feature_names_out())
  1. Limitations: "Performance degrades for customers with <30 days of history."
  2. Training Data: Source, date range, sample size.
  3. Ethical Considerations: Note any potential bias checked via subgroup AUC.

Export this as a PDF or Markdown file alongside your model. When a stakeholder asks "Why did we deny this customer?", you can use SHAP's force plot for individual explanations.

Next Steps: From 84% to Production

You’ve gone from 68% to 84% AUC. The final 5% might come from ensemble stacking or more exotic features, but the ROI diminishes. Now, operationalize.

  1. Serialize Your Pipeline: Use joblib to dump your entire Pipeline (preprocessor + model). This ensures the same transformations are applied at inference.
    import joblib
    joblib.dump(best_model, 'xgboost_churn_pipeline_v1.joblib')
    
  2. Consider ONNX for Speed: If you need ultra-low latency, convert your model. ONNX runtime is 2–4x faster than native PyTorch for inference on CPU. Use skl2onnx to convert your scikit-learn pipeline.
  3. Track Everything: Use MLflow to log your experiments, parameters, metrics, and the model artifact itself. Use DVC to version your training datasets. This turns your one-off project into a reproducible pipeline.
  4. Monitor Drift: Schedule a job to calculate your model's performance and feature distributions on new data weekly. A drop in AUC or a shift in the income distribution is your cue to retrain.

Your model is no longer a cryptic script. It's a documented, high-performing, explainable asset. The stakeholders get their 80%+, and you get a repeatable playbook for the next project. Now go update that Jira ticket.