Your XGBoost model has 68% AUC. Your stakeholders want 80%+. This guide shows the exact feature engineering and tuning steps to get there.
You’ve split the data, called .fit(), and gotten a model that’s… fine. It’s not failing, but it’s not winning any Kaggle competitions either. That 68% AUC is staring you down, and the business team’s expectation of 80%+ feels like a distant dream. Before you start randomly grid searching max_depth from 3 to 20, stop. Throwing more hyperparameters at a weak feature set is like trying to polish a brick. The real gains—the jump from mediocre to production-ready—come from a ruthless, systematic workflow. We’ll move from an audit of what’s broken, through targeted feature engineering and hyperparameter tuning, to a model you can explain to a skeptical stakeholder. We’re using tools that win: XGBoost, which wins 40%+ of Kaggle tabular competitions, Optuna for smart tuning, and SHAP to explain it all.
Starting Audit: SHAP Summary Plot to Identify Your Top 10 Features
Your first instinct might be to build more features. The better move is to understand the ones you have. A SHAP summary plot is your X-ray vision. It shows you which features your model actually relies on and whether that reliance makes sense.
Let’s assume you’ve trained a baseline XGBoost model. Don’t tune it yet. Just train it and interrogate it.
import xgboost as xgb
import shap
import pandas as pd
from sklearn.model_selection import train_test_split
# df = pd.read_csv('your_data.csv')
X = df.drop('target', axis=1)
y = df['target']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Baseline model - use defaults for now
model_baseline = xgb.XGBClassifier(
n_estimators=100,
random_state=42,
enable_categorical=False, # We'll handle encoding ourselves
eval_metric='auc'
)
model_baseline.fit(X_train, y_train)
# Calculate baseline AUC
from sklearn.metrics import roc_auc_score
y_pred_proba = model_baseline.predict_proba(X_val)[:, 1]
baseline_auc = roc_auc_score(y_val, y_pred_proba)
print(f"Baseline Validation AUC: {baseline_auc:.4f}")
# --- SHAP AUDIT ---
# Use TreeExplainer for XGBoost
explainer = shap.TreeExplainer(model_baseline)
shap_values = explainer.shap_values(X_val)
# Create the summary plot
shap.summary_plot(shap_values, X_val, plot_type="dot", max_display=10)
Run this. The plot will rank features by their mean absolute SHAP value (impact on model output). What are you looking for?
- Top-Heavy Distribution: Do 2-3 features dominate? This often indicates your model is underfitting because it’s latching onto obvious signals. You need to create features that capture more nuanced patterns.
- Nonsense Features: Is a low-importance feature like
customer_idor a poorly cleanedyearfield (e.g.,year=2050) creeping into the top 10? This is a data quality red flag. - Expected Features Missing: Is a feature you know is important from domain knowledge (e.g.,
transaction_amount / account_age) not in the top 10? Your model can’t use what you haven’t built. This is your feature engineering to-do list.
This 5-minute audit tells you where to focus 80% of your effort.
Feature Engineering: Interaction Features, Target Encoding, and Aggregations
Now we build. Remember, feature engineering accounts for 60–70% of ML project time in production environments. We’ll use scikit-learn's ColumnTransformer and Pipeline to do this cleanly, preventing the #1 cause of model collapse: data leakage.
Error Message & Fix #1:
In dev, your model gets 99% accuracy. In production, it drops to 60%. Fix: Never fit your preprocessors (like scalers, encoders) on the full dataset before a train/test split. Always fit them inside your cross-validation fold or training loop using a
Pipeline.
Here’s a safe engineering pipeline:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
import numpy as np
# Define numeric and categorical columns (example)
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['job_type', 'education']
# Create preprocessors
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')), # Handles missing values
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# Combine them
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Now, BUILD NEW FEATURES on the training set only.
# Do this *before* the ColumnTransformer in your workflow.
def create_interaction_features(X):
X = X.copy()
# Example: Ratio feature identified as important
X['income_to_age_ratio'] = X['income'] / (X['age'] + 1) # Avoid div by zero
# Interaction between two numeric fields
X['credit_income_interaction'] = X['credit_score'] * X['income']
return X
# Use in a training workflow:
X_train_engineered = create_interaction_features(X_train)
X_val_engineered = create_interaction_features(X_val) # Transform only, no fitting
# Then pass X_train_engineered/X_val_engineered to your preprocessor and model
Key Techniques:
- Interactions/Ratios: The
income / ageratio is more predictive than either alone. Look at pairs of your top SHAP features. - Binning & Counts: Group continuous variables (e.g.,
ageinto decades) or create aggregate counts (e.g.,number_of_transactions_last_7days). - Target Encoding (Cautiously): For high-cardinality categories (like
zip_code), mean encoding can be powerful. But you must fit this inside your CV fold to avoid leakage. Usecategory_encoderslibrary or compute withsklearn'sKFold.
Handling Missing Values: When to Impute vs Let XGBoost Handle Natively
XGBoost can natively handle missing values (NaN) by learning a default direction for them during training. This is often optimal. However, a simple imputation can sometimes stabilize training.
Rule of Thumb:
- If missingness is informative (e.g., "user did not provide income" is a signal), let XGBoost handle it. Pass
np.nanand ensureenable_categorical=False. - If missingness is random and small (<5%), impute with median (numeric) or mode (categorical) for more consistent behavior across different library versions.
- Never use
SimpleImputer(strategy='constant', fill_value=-999)unless you have a specific reason. It creates an artificial, dense spike in your data distribution.
Error Message & Fix #2:
ValueError: Input contains NaNFix: If you get this, XGBoost's native handling might be off. Explicitly add aSimpleImputer(strategy='median')for numerics and'most_frequent'for categoricals as the first step in yourPipeline. This guarantees noNaNreach the model.
Optuna Hyperparameter Tuning: Pruning Bad Trials Early for 10x Speed
Random search is for amateurs. Optuna finds better hyperparameters than random search in 3x fewer trials on average. Its Tree-structured Parzen Estimator (TPE) sampler intelligently prunes hopeless trials, letting you explore the search space deeply.
Here’s how to integrate it with your feature engineering pipeline:
import optuna
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
# Wrap your model and preprocessor in a final pipeline
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', XGBClassifier(random_state=42, n_jobs=-1))
])
def objective(trial):
"""Define the hyperparameter search space and objective."""
params = {
'classifier__n_estimators': trial.suggest_int('n_estimators', 100, 1000),
'classifier__max_depth': trial.suggest_int('max_depth', 3, 10),
'classifier__learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.3),
'classifier__subsample': trial.suggest_uniform('subsample', 0.6, 1.0),
'classifier__colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.6, 1.0),
'classifier__gamma': trial.suggest_loguniform('gamma', 1e-8, 1.0),
'classifier__reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-8, 10.0),
'classifier__reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-8, 10.0),
}
full_pipeline.set_params(**params)
# Use cross-validation for a robust score
score = cross_val_score(
full_pipeline, X_train_engineered, y_train,
cv=5, scoring='roc_auc', n_jobs=-1
).mean()
return score
# Create a study and optimize
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=50, show_progress_bar=True) # Start with 50 trials
print(f"Best trial AUC: {study.best_value:.4f}")
print("Best params:", study.best_params)
# Train your final model with the best params
best_model = full_pipeline.set_params(**study.best_params)
best_model.fit(X_train_engineered, y_train)
Why this works: After each fold in the CV, Optuna can prune trials that are clearly worse than the best-so-far. This means you waste no time on max_depth=10, learning_rate=0.9.
Class Imbalance: scale_pos_weight vs Under-Sampling vs SMOTE
Your AUC might be decent, but your precision on the minority class is terrible. XGBoost has a built-in fix: scale_pos_weight. Set it to num_negative_examples / num_positive_examples. This is your first and simplest option.
If that’s not enough, consider:
- Threshold Tuning: Don't use the default 0.5 cutoff from
predict_proba. Use the validation set to find the threshold that maximizes F1 or your business metric (e.g., profit curve). - SMOTE (Synthetic Minority Oversampling): Use
imblearn.over_sampling.SMOTE. Crucially, you must apply SMOTE only to the training fold inside your CV/Pipeline to avoid creating synthetic data that leaks into your validation set.
Benchmark Table: Tuning & Imbalance Strategies
| Strategy | Key Parameter | Relative Training Time | Impact on Minority Class Recall | Risk of Overfitting |
|---|---|---|---|---|
XGBoost scale_pos_weight | sum(negative) / sum(positive) | +0% | High | Low |
| Probability Threshold Tuning | Decision Threshold (e.g., 0.3) | +0% | Very High | Medium (if tuned on test set) |
| SMOTE (inside CV) | sampling_strategy (e.g., 0.5) | +30% | Very High | Medium-High |
| Class Weight in Loss | class_weight='balanced' | +5% | High | Low |
Calibrated Probabilities: When AUC Is Good But Prediction Probabilities Aren't
Gradient boosting models, especially when regularized or tuned for AUC, can produce poorly calibrated probabilities (i.e., a predicted probability of 0.8 may only be correct 60% of the time). This is fatal for risk models. Use sklearn's CalibratedClassifierCV with method='isotonic' or 'sigmoid'.
from sklearn.calibration import CalibratedClassifierCV
# Wrap your best model
calibrated_model = CalibratedClassifierCV(best_model, cv=5, method='isotonic')
calibrated_model.fit(X_train_engineered, y_train)
# Now predict_proba will be better calibrated for decision-making
Model Card and SHAP Explanation for Business Stakeholders
Your model is useless if it’s a black box. You need to explain it. Create a simple "Model Card":
- Purpose: "This model predicts customer churn with 84% AUC."
- Performance: Validation AUC, precision/recall at chosen threshold.
- Key Drivers: Use a SHAP bar plot of the mean absolute SHAP values for the top 10 features. This is your executive summary.
# Generate business-ready SHAP plot
shap_values_best = shap.TreeExplainer(best_model.named_steps['classifier']).shap_values(
preprocessor.transform(X_val_engineered) # Use the processed validation data
)
shap.summary_plot(shap_values_best, plot_type="bar", feature_names=best_model[:-1].get_feature_names_out())
- Limitations: "Performance degrades for customers with <30 days of history."
- Training Data: Source, date range, sample size.
- Ethical Considerations: Note any potential bias checked via subgroup AUC.
Export this as a PDF or Markdown file alongside your model. When a stakeholder asks "Why did we deny this customer?", you can use SHAP's force plot for individual explanations.
Next Steps: From 84% to Production
You’ve gone from 68% to 84% AUC. The final 5% might come from ensemble stacking or more exotic features, but the ROI diminishes. Now, operationalize.
- Serialize Your Pipeline: Use
joblibto dump your entirePipeline(preprocessor + model). This ensures the same transformations are applied at inference.import joblib joblib.dump(best_model, 'xgboost_churn_pipeline_v1.joblib') - Consider ONNX for Speed: If you need ultra-low latency, convert your model. ONNX runtime is 2–4x faster than native PyTorch for inference on CPU. Use
skl2onnxto convert your scikit-learn pipeline. - Track Everything: Use MLflow to log your experiments, parameters, metrics, and the model artifact itself. Use DVC to version your training datasets. This turns your one-off project into a reproducible pipeline.
- Monitor Drift: Schedule a job to calculate your model's performance and feature distributions on new data weekly. A drop in AUC or a shift in the
incomedistribution is your cue to retrain.
Your model is no longer a cryptic script. It's a documented, high-performing, explainable asset. The stakeholders get their 80%+, and you get a repeatable playbook for the next project. Now go update that Jira ticket.