Monitor Your LLMs for Toxic Output and Bias in Real-Time

Problem: Your LLM Can Turn Toxic Without Warning

Your model returns fine 99% of the time. Then a user crafts a clever prompt and your app spits out hate speech, discriminatory content, or confidently biased output — live, in production, in front of real users.

Logging it after the fact isn't enough. You need to catch it as it streams.

You'll learn:

How to add a real-time toxicity and bias detection layer between your LLM and your users
How to use open-source classifiers without adding noticeable latency
How to build a streaming guardrail that blocks, flags, or rewrites bad output

Time: 25 min | Level: Intermediate

Why This Happens

LLMs don't have a stable "safe mode." They're probability machines — given the right context, even well-tuned models will produce outputs that are toxic, biased toward certain demographics, or confidently wrong in harmful ways.

Common symptoms:

Hate speech or slurs in edge-case completions
Outputs that stereotype or demean based on race, gender, or religion
Jailbreaks that bypass system prompt instructions
Sentiment or advice that skews negatively toward certain user groups

Fine-tuning and RLHF reduce these — they don't eliminate them. You still need a runtime layer.

Solution

The architecture here uses a classifier running in parallel with your LLM stream. Output is buffered in sliding windows, scored, and either passed through, flagged, or blocked — all before the final token hits the client.

Architecture diagram showing LLM output flowing through toxicity classifier Token stream flows from LLM → buffer → classifier → gateway → client

Step 1: Install the Detection Stack

We'll use detoxify for toxicity scoring and transformers for bias classification. Both run locally — no external API calls on the hot path.

pip install detoxify transformers torch
pip install fastapi uvicorn httpx  # for the gateway

Verify GPU availability (optional but recommended for low latency):

python -c "import torch; print(torch.cuda.is_available())"

Expected: True if GPU is available, False if CPU-only. CPU works fine for < 50 req/min.

Step 2: Build the Scoring Module

# guardrail/scorer.py
from detoxify import Detoxify
from transformers import pipeline
import torch

class OutputScorer:
    def __init__(self):
        device = 0 if torch.cuda.is_available() else -1
        
        # Detoxify handles: toxic, severe_toxic, obscene, threat, insult, identity_hate
        self.toxicity_model = Detoxify('multilingual', device=device)
        
        # Bias classifier — catches demographic stereotyping
        self.bias_model = pipeline(
            "text-classification",
            model="d4data/bias-detection-model",
            device=device
        )
    
    def score(self, text: str) -> dict:
        if len(text.strip()) < 10:
            # Too short to classify reliably
            return {"toxic": 0.0, "biased": 0.0, "flagged": False}
        
        tox = self.toxicity_model.predict(text)
        bias = self.bias_model(text[:512])[0]  # Truncate to model max
        
        # Aggregate toxicity score — worst-case across categories
        max_toxic = max(
            tox["toxicity"],
            tox["identity_attack"],
            tox["insult"],
            tox["threat"]
        )
        
        bias_score = bias["score"] if bias["label"] == "Biased" else 0.0
        
        return {
            "toxic": round(max_toxic, 3),
            "biased": round(bias_score, 3),
            "flagged": max_toxic > 0.7 or bias_score > 0.75,
            "details": tox
        }

Why this works: Scoring each token is wasteful. Scoring complete sentences gives the models enough context to work with. The flagged threshold of 0.7 is deliberately conservative — tune it up if you're getting false positives.

Step 3: Add the Streaming Gateway

This wraps your existing LLM client. It buffers output into sentence chunks and scores each one before forwarding.

# guardrail/gateway.py
import asyncio
import re
from typing import AsyncGenerator
from .scorer import OutputScorer

scorer = OutputScorer()  # Load once at startup

SENTENCE_ENDINGS = re.compile(r'(?<=[.!?])\s+')

async def guarded_stream(
    llm_stream: AsyncGenerator[str, None],
    on_flag=None  # Optional callback: async fn(chunk, scores)
) -> AsyncGenerator[str, None]:
    
    buffer = ""
    
    async for token in llm_stream:
        buffer += token
        
        # Score when we have a complete sentence
        sentences = SENTENCE_ENDINGS.split(buffer)
        
        # Keep the last incomplete fragment in the buffer
        if len(sentences) > 1:
            to_score = " ".join(sentences[:-1])
            buffer = sentences[-1]
            
            scores = scorer.score(to_score)
            
            if scores["flagged"]:
                if on_flag:
                    await on_flag(to_score, scores)
                # Block the flagged chunk — yield nothing
                # You can also yield a placeholder: "[Content removed]"
                continue
            
            yield to_score + " "
    
    # Flush remaining buffer
    if buffer.strip():
        scores = scorer.score(buffer)
        if not scores["flagged"]:
            yield buffer

Expected: Flagged sentences are dropped silently. The rest of the stream continues uninterrupted.

If it fails:

High false positive rate: Raise thresholds to 0.82+ or add a domain-specific allowlist
Latency spikes > 200ms: Run scorer.py on GPU, or switch to detoxify('original') (smaller model)
Stream breaks mid-word: Increase buffer before scoring — check for \n in addition to sentence endings

Step 4: Wire Up Logging and Alerting

Silent blocking is a start, but you need visibility.

# guardrail/logger.py
import json
import time
from pathlib import Path

LOG_PATH = Path("logs/guardrail.jsonl")
LOG_PATH.parent.mkdir(exist_ok=True)

async def log_flag(chunk: str, scores: dict, user_id: str = "anonymous"):
    entry = {
        "ts": time.time(),
        "user_id": user_id,
        "chunk": chunk[:200],  # Truncate PII risk
        "toxic": scores["toxic"],
        "biased": scores["biased"],
        "details": scores.get("details", {})
    }
    with open(LOG_PATH, "a") as f:
        f.write(json.dumps(entry) + "\n")

Pipe guardrail.jsonl into your existing log aggregator (Datadog, Loki, CloudWatch). Set an alert if toxic > 0.9 events exceed 5 per hour — that's a signal someone is actively probing your system.

Example Datadog dashboard showing toxicity event spikes Flagged events over 24 hours — the spike at 14:00 shows an active probing session

Step 5: Add Demographic Bias Auditing (Offline)

Real-time scoring catches acute toxicity. Demographic bias is subtler — you need periodic offline analysis across your full output corpus.

# scripts/bias_audit.py
import json
from pathlib import Path
from guardrail.scorer import OutputScorer

scorer = OutputScorer()

def audit_logs(log_path: str, sample_size: int = 500):
    entries = []
    with open(log_path) as f:
        for line in f:
            entries.append(json.loads(line))
    
    # Sample recent non-flagged output for bias patterns
    sample = entries[-sample_size:]
    bias_scores = [scorer.score(e["chunk"])["biased"] for e in sample]
    
    avg_bias = sum(bias_scores) / len(bias_scores)
    high_bias = [s for s in bias_scores if s > 0.5]
    
    print(f"Avg bias score: {avg_bias:.3f}")
    print(f"High-bias outputs (>0.5): {len(high_bias)} / {len(sample)}")
    
    if avg_bias > 0.15:
        print("⚠️  Average bias elevated — review prompt templates and system instructions")
    
    return avg_bias, high_bias

Run this weekly via cron. If average bias trends up, your system prompt or few-shot examples may be the source — not just edge-case user inputs.

Verification

# Run a quick smoke test
python -c "
from guardrail.scorer import OutputScorer
s = OutputScorer()
print(s.score('I hate all people from that country'))
print(s.score('Here is how to bake a chocolate cake'))
"

You should see:

First call: flagged: True, toxic > 0.7
Second call: flagged: False, scores near 0.0

# Test the streaming gateway
python -m pytest tests/test_gateway.py -v

What You Learned

A sentence-buffered streaming guardrail adds 20–80ms latency overhead — acceptable for most chat UIs
detoxify covers acute toxicity well; demographic bias needs a second model like d4data/bias-detection-model
Thresholds matter more than model choice — start conservative, tune based on your domain
Real-time blocking + offline auditing cover different threat surfaces — you need both

Limitation: These classifiers are English-first. Multilingual content needs detoxify('multilingual') and separate bias models per language. Coverage drops significantly for low-resource languages.

When NOT to use this: If you're running a medical, legal, or crisis support application, this layer is necessary but not sufficient. You'll also need human review queues and domain-specific classifiers.

Tested on Python 3.12, PyTorch 2.3, detoxify 0.5.2, transformers 4.40. Works with any async LLM client (OpenAI, Anthropic, local via Ollama).