Real-Time Fraud Detection Engine: XGBoost Feature Store and Sub-100ms Scoring

Build a production fraud detection pipeline that scores transactions in under 100ms — Redis feature store, XGBoost model, streaming feature computation, and A/B testing new rules safely.

Fraud Detection Latency Budget: Breaking Down the 100ms SLA

Visa's fraud detection has a 100ms SLA. Your model takes 340ms because it computes features on-demand from PostgreSQL. A Redis feature store pre-computes everything — here's the architecture that gets you under 50ms.

That 100ms isn't a suggestion; it's the physical and financial reality of a card swipe. The breakdown is brutal: network transit (15ms), authorization logic (10ms), database lookups (25ms), leaving you with a 50ms window for the actual ML inference and feature retrieval. Blow past 100ms and the transaction times out at the terminal, the customer retries, and you've just created a terrible user experience and a potential false positive. Payment fraud costs $40B globally in 2025 (Nilson Report), and while ML models detect 94% vs rule-based 76%, that accuracy is worthless if your predictions arrive late.

Your latency budget isn't for the model alone. It's for the entire scoring pipeline: fetching the customer's historical features, enriching the raw transaction, running inference, and applying post-model business rules. If you're computing a 30-day transaction velocity or a graph-based feature like "number of connections to known mule accounts" on-demand, you're already dead in the water. This is where the architecture shifts from "model-first" to "feature-first."

Feature Store Architecture: Pre-Computing vs On-Demand Feature Retrieval

On-demand feature computation is the architectural equivalent of baking a cake every time someone asks for a slice. For a single transaction, you query the raw transaction logs, join tables, apply window functions, and pray your database indices are perfect. The p99 latency for this in a moderately loaded PostgreSQL instance? About 45ms for a complex fraud feature vector. That's nearly your entire budget gone before you even call model.predict().

A feature store inverts this. It's a pre-baked, sliced, and plated cake service. Core transactional data (amount, merchant, timestamp) streams in. A separate, asynchronous process—decoupled from the critical path—continuously computes and updates the heavy features: rolling windows, aggregates, graph metrics. These computed features are written to a low-latency serving layer, like Redis. At scoring time, your model service performs a simple key-value lookup. The benchmark is stark: Redis feature store <2ms vs on-demand computation 45ms for fraud scoring.

Here’s the async computation job using a pseudo-streaming approach with Pandas. This runs in a separate service, not in the request path.

import pandas as pd
import redis
from datetime import datetime, timedelta


def compute_user_velocity_features(user_id: str, transaction_df: pd.DataFrame):
    """Compute features asynchronously and store in Redis."""
    now = datetime.now()
    past_24h = now - timedelta(hours=24)
    
    user_tx = transaction_df[transaction_df['user_id'] == user_id]
    recent_tx = user_tx[user_tx['timestamp'] >= past_24h]
    
    features = {
        'tx_count_24h': len(recent_tx),
        'total_amount_24h': recent_tx['amount'].sum(),
        'avg_amount_24h': recent_tx['amount'].mean(),
        'std_amount_24h': recent_tx['amount'].std(),
        'unique_merchants_24h': recent_tx['merchant_id'].nunique(),
        # Time since last transaction
        'hours_since_last_tx': (now - user_tx['timestamp'].max()).total_seconds() / 3600 if not user_tx.empty else 24.0
    }
    # Fill NaNs for new users
    for key, val in features.items():
        if pd.isna(val):
            features[key] = 0.0 if 'count' in key or 'amount' in key else 24.0
    
    return features

# Connect to Redis feature store
r = redis.Redis(host='localhost', port=6379, decode_responses=True)

# Assume `batch_transactions` is a DataFrame of new transactions since last run
# This would typically be triggered by a Kafka event or a scheduled job
for user_id in batch_transactions['user_id'].unique():
    user_features = compute_user_velocity_features(user_id, batch_transactions)
    # Store with a TTL slightly longer than your feature freshness requirement
    r.hset(f"user_features:{user_id}", mapping=user_features)
    r.expire(f"user_features:{user_id}", 300)  # 5-minute TTL for freshness

Redis Schema for Transaction Features: Velocity, History, and Graph Features

Throwing JSON blobs into Redis keys is a start, but it's how you structure them that determines whether you hit 2ms or 20ms. Use Redis Hashes (HSET) for feature vectors—they're memory efficient and allow partial updates. Your key schema must be intuitive and partitionable.

For a transaction scoring event, you need to fetch features for multiple entities: the user, the card, the device, and the merchant. Each lookup must be a single O(1) operation.

Recommended Schema:

  • user_features:{user_id}: Hash containing tx_count_24h, total_amount_24h, avg_amount_7d, chargeback_90d_flag.
  • card_features:{card_hash}: Hash containing card_velocity_1h, country_change_flag, card_age_days.
  • device_graph:{device_fingerprint}: Set containing associated_user_ids (last 10). Cardinality of this set is a powerful graph feature ("device sharing score").
  • model_metadata:current: String holding the model version or A/B test cohort. This allows for seamless model hot-swapping.

The device graph feature is a classic example. Checking SCARD device_graph:{fingerprint} tells you how many unique users have transacted on this device recently. A number >2 is a massive red flag. Computing this on-demand requires a costly GROUP BY and COUNT DISTINCT. Pre-computed as a Redis Set cardinality, it's a sub-millisecond lookup.

XGBoost Model Training on Imbalanced Data: SMOTE and Class Weighting

Fraud data is imbalanced. You might have 0.1% fraud. Training XGBoost naively on this leads to a model that predicts "not fraud" 99.9% of the time and achieves wonderful, useless accuracy. You need to force the model to look at the fraudsters. XGBoost wins on tabular data with AUC 0.97 vs 0.94 for deep learning (Kaggle benchmark), but only if you handle the imbalance.

Two primary methods:

  1. Class Weighting (scale_pos_weight): Tell XGBoost that a fraud instance is, say, 999 times more important than a non-fraud instance. This is simple and built-in.
  2. Synthetic Data Generation (SMOTE): Create synthetic fraud examples by interpolating between real fraud cases. SMOTE oversampling improves minority class recall from 67% to 84% on imbalanced fraud datasets.

Use both. Start with scale_pos_weight and if recall on the fraud class is still poor, apply SMOTE to the training set only (never the validation/test set, or you'll create a fantasy world).

import xgboost as xgb
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, roc_auc_score
import pandas as pd
import numpy as np

# Load your feature matrix `X` and labels `y` (1=fraud)
# Assume `X` is a DataFrame of features from your feature store

# 1. Split data BEFORE applying SMOTE to avoid data leakage
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# 2. Apply SMOTE only to the training data
smote = SMOTE(sampling_strategy=0.1, random_state=42)  # Boost fraud class to 10% of majority
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print(f"Class distribution after SMOTE: {pd.Series(y_train_resampled).value_counts().to_dict()}")

# 3. Calculate scale_pos_weight (recommended: sqrt of imbalance ratio is a good start)
fraud_ratio = len(y_train[y_train==1]) / len(y_train[y_train==0])
scale_pos_weight = int(np.sqrt(1 / fraud_ratio))  # e.g., ~32 for 0.1% fraud

# 4. Train XGBoost
dtrain = xgb.DMatrix(X_train_resampled, label=y_train_resampled)
dval = xgb.DMatrix(X_val, label=y_val)

params = {
    'objective': 'binary:logistic',
    'eval_metric': 'aucpr',  # Use AUC-PR for imbalanced data, not AUC-ROC
    'max_depth': 6,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'scale_pos_weight': scale_pos_weight,
    'seed': 42
}

evals = [(dtrain, 'train'), (dval, 'eval')]
model = xgb.train(params, dtrain, num_boost_round=500, evals=evals, early_stopping_rounds=20, verbose_eval=50)

# 5. Evaluate
y_pred_proba = model.predict(dval)
y_pred = (y_pred_proba > 0.5).astype(int)  # Adjust threshold based on business cost
print(classification_report(y_val, y_pred))
print(f"Validation AUC-PR: {roc_auc_score(y_val, y_pred_proba)}")

Real Error Message & Fix:

Model drift detected: AUC dropped from 0.97 to 0.89 Fix: Retrain with last 30 days of data, add concept drift monitoring using a library like alibi-detect to trigger retraining automatically when feature distributions shift beyond a threshold.

Streaming Feature Updates with Kafka: Real-Time Feature Freshness

A feature store with stale data is a fraudster's best friend. If your 24-hour velocity counter only updates hourly, a rapid burst of fraudulent transactions in the 59th minute goes unnoticed. You need streaming updates.

The pattern: Transaction events publish to a Kafka topic raw_transactions. A Spark Streaming or Faust (Python) application consumes this stream, computes the incremental update to features (e.g., tx_count_24h increments by 1, total_amount_24h adds this transaction's amount), and issues an HINCRBY command to the Redis hash. For windowed features, you also need to expire old contributions. This is often done with a second, time-ordered stream and periodic compaction.

The goal is feature freshness under 5 seconds. This ensures your model scores based on the transaction that just happened, not the world as it was a minute ago.

A/B Testing New Rules in Shadow Mode Before Full Deployment

Your new graph feature is brilliant. Your model's AUC improved. Deploying it directly to production is how you cause a false positive spike after a rule change. A false positive rate of 0.1% on 1M daily transactions equals 1,000 wrongly declined legitimate purchases—a customer service and revenue disaster.

Shadow Mode Deployment:

  1. Dual-Write: In your scoring service, run both the new model (with the new feature) and the current champion model. Store both predictions.
  2. Log, Don't Act: Only use the champion model's score to make the real decision (approve/decline). Log the new model's score and features to an analytics database.
  3. Analyze Discrepancies: After a week, analyze the cases where the models disagreed. Did the new model catch fraud the old one missed (true positive)? Did it flag good customers (false positive)? Calculate the new model's business metrics as if it had been live.
  4. Promote to Champion: Only if the new model shows a net improvement (factoring in the cost of false positives) do you switch the traffic.

This is controlled risk-taking. It's also essential for regulatory compliance. You can demonstrate due diligence in model changes.

Real Error Message & Fix:

PCI-DSS violation: card numbers in model training data Fix: Tokenize PAN (Primary Account Number) before feature extraction using a vault tokenization service. Never let raw PAN enter your feature engineering pipeline. Use hash-based tokens (with a salt) for joining datasets if necessary.

Benchmark: Feature Store vs On-Demand Computation

The theoretical argument is over. Here are the measured latencies that determine if you meet the SLA or break it. This table compares the two architectural approaches for a single fraud scoring request requiring 25 features, including 5 rolling-window aggregates and 1 graph feature.

OperationOn-Demand Computation (PostgreSQL)Redis Feature StoreNotes
Feature Retrieval (p99)45 ms<2 msIncludes 5x LATERAL JOIN window queries & 1 graph query.
XGBoost Inference0.8 ms0.8 msModel is constant; using libxgboost with DMatrix from array.
Network + Serialization~5 ms~5 msAssumes scoring service co-located with data store.
Total Scoring Latency~51 ms<8 msFeature store is 6x faster, consuming only 15% of latency budget.
System Load at 1k TPSDB CPU >80% (bottleneck)Redis CPU <20%On-demand design requires massive DB scaling.
Data FreshnessPerfect~1-5 second lagAsync stream processing introduces minimal, managed lag.

The conclusion is inescapable. The on-demand approach uses the entire latency budget on feature retrieval alone, leaving no room for error and forcing expensive database over-provisioning. The feature store architecture reduces the critical path to a few key-value lookups, making the 100ms SLA comfortable and scalable.

Next Steps: Building Your Fraud Pipeline

Start by instrumenting your current scoring endpoint. Measure exactly where the milliseconds are going. You'll likely find the database is the culprit. Then, implement incrementally:

  1. Phase 1: Cache the Heavy Features. Pick your two most expensive features (e.g., 30-day velocity, device graph). Compute them in a nightly job and store them in Redis. Modify your scoring service to check Redis first, fall back to SQL. You'll see an immediate latency drop.
  2. Phase 2: Build the Async Stream. Introduce Kafka. Shift the computation of all features from the request path to a streaming consumer. Ensure idempotent writes to Redis.
  3. Phase 3: Implement Shadow Testing. Before you add any new feature or model from this point on, test it in shadow mode. Build the analytics dashboard to compare champion/challenger performance.
  4. Phase 4: Automate Retraining & Monitoring. Use MLflow to track experiments and manage model staging. Set up Great Expectations data quality checks on your feature pipeline. Automate retraining when drift is detected.

Your goal isn't just to beat 100ms. It's to build a system where latency is predictable, accuracy is high and measurable, and changes can be made without sweating. The fraudsters aren't waiting, and neither is the customer at the checkout. Your architecture needs to ensure neither of them has time to notice you're there.