Problem: Users Are Sending Harmful Prompts to Your AI
You've shipped an AI-powered feature and users are finding ways to abuse it — jailbreaks, hate speech, PII, prompt injections. You need a moderation layer before prompts hit your model.
You'll learn:
- How to integrate a content moderation API (OpenAI Moderation, AWS Comprehend, or Perspective API)
- How to score prompts and block or flag them
- How to build a production-ready pipeline with fallback logic
Time: 20 min | Level: Intermediate
Why This Happens
LLMs have no native guardrails at the input layer. Without moderation, any user input reaches your model — costing tokens, producing harmful outputs, and creating liability. Moderation APIs run fast (50–150ms) and catch the majority of abuse before it costs you.
Common symptoms:
- Model returns policy-violating content
- Users exploiting system prompts via injection
- Compliance or legal team flagging AI outputs
Solution
Step 1: Pick Your Moderation API
Choose based on your stack and threat model:
| API | Best For | Free Tier |
|---|---|---|
| OpenAI Moderation | Text toxicity, harassment | Yes (rate limited) |
| AWS Comprehend | PII detection, multilingual | 50K units/month |
| Perspective API | Comment toxicity, civil discourse | Yes (apply required) |
This guide uses OpenAI Moderation — zero setup if you already use the OpenAI API, and it covers the most common abuse categories.
Step 2: Install Dependencies
pip install openai>=1.0.0 python-dotenv
Create a .env file:
OPENAI_API_KEY=sk-...
Step 3: Build the Moderation Client
# moderation.py
import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# Threshold above which we block the request
BLOCK_THRESHOLD = 0.80
# Categories you want to enforce — tune to your use case
ENFORCED_CATEGORIES = [
"harassment",
"harassment/threatening",
"hate",
"hate/threatening",
"self-harm",
"sexual/minors",
"violence",
"violence/graphic",
]
def moderate_prompt(prompt: str) -> dict:
"""
Returns a dict with:
- blocked (bool): True if the prompt should be rejected
- reason (str | None): Category that triggered the block
- scores (dict): Raw category scores for logging
"""
response = client.moderations.create(input=prompt)
result = response.results[0]
scores = result.category_scores.model_dump()
flagged_categories = result.categories.model_dump()
# Check if any enforced category exceeds threshold
for category in ENFORCED_CATEGORIES:
score = scores.get(category, 0)
if flagged_categories.get(category) or score >= BLOCK_THRESHOLD:
return {
"blocked": True,
"reason": category,
"scores": scores,
}
return {"blocked": False, "reason": None, "scores": scores}
Expected: Function returns a clean dict — no exceptions thrown for moderation decisions.
If it fails:
AuthenticationError: Check yourOPENAI_API_KEYin.envRateLimitError: Add retry logic (see Step 5)
Step 4: Wire It Into Your Request Pipeline
# main.py
from moderation import moderate_prompt
def handle_user_request(user_prompt: str) -> str:
# Always moderate before sending to your model
mod_result = moderate_prompt(user_prompt)
if mod_result["blocked"]:
# Log the attempt for auditing — don't expose reason to user
log_moderation_event(user_prompt, mod_result)
return "I can't help with that request."
# Safe to forward to your model
return call_your_llm(user_prompt)
def log_moderation_event(prompt: str, result: dict):
# Replace with your logging setup (Datadog, CloudWatch, etc.)
print(f"[MODERATION BLOCK] reason={result['reason']} scores={result['scores']}")
Why we don't expose the reason: Telling users exactly which category triggered the block helps them refine their bypass attempts.
Step 5: Add Retry Logic for Production
The moderation API can fail transiently. Don't let API errors block all user requests.
# moderation.py — updated with retries
import time
from openai import OpenAI, APIError, RateLimitError
MAX_RETRIES = 3
RETRY_DELAY = 0.5 # seconds
def moderate_prompt(prompt: str) -> dict:
for attempt in range(MAX_RETRIES):
try:
response = client.moderations.create(input=prompt)
result = response.results[0]
scores = result.category_scores.model_dump()
flagged = result.categories.model_dump()
for category in ENFORCED_CATEGORIES:
score = scores.get(category, 0)
if flagged.get(category) or score >= BLOCK_THRESHOLD:
return {"blocked": True, "reason": category, "scores": scores}
return {"blocked": False, "reason": None, "scores": scores}
except RateLimitError:
if attempt < MAX_RETRIES - 1:
time.sleep(RETRY_DELAY * (attempt + 1)) # Exponential backoff
else:
# Fail open — don't block the user when moderation is down
return {"blocked": False, "reason": "moderation_unavailable", "scores": {}}
except APIError as e:
# Unexpected error — fail open and alert your on-call
print(f"[MODERATION ERROR] {e}")
return {"blocked": False, "reason": "moderation_error", "scores": {}}
Fail open vs. fail closed: Failing open (allowing requests through when moderation is unavailable) keeps your product working. Fail closed if you're in a high-risk domain (healthcare, finance, minors).
Verification
python - <<EOF
from moderation import moderate_prompt
# Should be blocked
bad = moderate_prompt("I want to hurt someone")
print("Bad prompt blocked:", bad["blocked"]) # True
# Should pass
good = moderate_prompt("How do I sort a list in Python?")
print("Good prompt blocked:", good["blocked"]) # False
EOF
You should see:
Bad prompt blocked: True
Good prompt blocked: False
What You Learned
- The OpenAI Moderation API is a fast, low-cost first line of defense
- Always log blocked requests — the data reveals abuse patterns over time
- Fail open on API errors unless your use case demands otherwise
- Never expose block reasons to users; it teaches them to evade the filter
Limitation: Moderation APIs catch known abuse patterns but miss novel jailbreaks and context-dependent harms. Layer this with output moderation and system prompt hardening for full coverage.
When NOT to use this: If your app processes non-English content heavily, benchmark Perspective API or AWS Comprehend first — OpenAI Moderation is primarily English-tuned.
Tested on Python 3.12, openai 1.16.0, Ubuntu 24.04