Set Up Content Moderation APIs for User Prompts in 20 Minutes

Step-by-step guide to integrating content moderation APIs to filter user prompts before they reach your AI model. Covers setup, scoring, and fallback logic.

Problem: Users Are Sending Harmful Prompts to Your AI

You've shipped an AI-powered feature and users are finding ways to abuse it — jailbreaks, hate speech, PII, prompt injections. You need a moderation layer before prompts hit your model.

You'll learn:

  • How to integrate a content moderation API (OpenAI Moderation, AWS Comprehend, or Perspective API)
  • How to score prompts and block or flag them
  • How to build a production-ready pipeline with fallback logic

Time: 20 min | Level: Intermediate


Why This Happens

LLMs have no native guardrails at the input layer. Without moderation, any user input reaches your model — costing tokens, producing harmful outputs, and creating liability. Moderation APIs run fast (50–150ms) and catch the majority of abuse before it costs you.

Common symptoms:

  • Model returns policy-violating content
  • Users exploiting system prompts via injection
  • Compliance or legal team flagging AI outputs

Solution

Step 1: Pick Your Moderation API

Choose based on your stack and threat model:

APIBest ForFree Tier
OpenAI ModerationText toxicity, harassmentYes (rate limited)
AWS ComprehendPII detection, multilingual50K units/month
Perspective APIComment toxicity, civil discourseYes (apply required)

This guide uses OpenAI Moderation — zero setup if you already use the OpenAI API, and it covers the most common abuse categories.


Step 2: Install Dependencies

pip install openai>=1.0.0 python-dotenv

Create a .env file:

OPENAI_API_KEY=sk-...

Step 3: Build the Moderation Client

# moderation.py
import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Threshold above which we block the request
BLOCK_THRESHOLD = 0.80

# Categories you want to enforce — tune to your use case
ENFORCED_CATEGORIES = [
    "harassment",
    "harassment/threatening",
    "hate",
    "hate/threatening",
    "self-harm",
    "sexual/minors",
    "violence",
    "violence/graphic",
]


def moderate_prompt(prompt: str) -> dict:
    """
    Returns a dict with:
      - blocked (bool): True if the prompt should be rejected
      - reason (str | None): Category that triggered the block
      - scores (dict): Raw category scores for logging
    """
    response = client.moderations.create(input=prompt)
    result = response.results[0]

    scores = result.category_scores.model_dump()
    flagged_categories = result.categories.model_dump()

    # Check if any enforced category exceeds threshold
    for category in ENFORCED_CATEGORIES:
        score = scores.get(category, 0)
        if flagged_categories.get(category) or score >= BLOCK_THRESHOLD:
            return {
                "blocked": True,
                "reason": category,
                "scores": scores,
            }

    return {"blocked": False, "reason": None, "scores": scores}

Expected: Function returns a clean dict — no exceptions thrown for moderation decisions.

If it fails:

  • AuthenticationError: Check your OPENAI_API_KEY in .env
  • RateLimitError: Add retry logic (see Step 5)

Step 4: Wire It Into Your Request Pipeline

# main.py
from moderation import moderate_prompt

def handle_user_request(user_prompt: str) -> str:
    # Always moderate before sending to your model
    mod_result = moderate_prompt(user_prompt)

    if mod_result["blocked"]:
        # Log the attempt for auditing — don't expose reason to user
        log_moderation_event(user_prompt, mod_result)
        return "I can't help with that request."

    # Safe to forward to your model
    return call_your_llm(user_prompt)


def log_moderation_event(prompt: str, result: dict):
    # Replace with your logging setup (Datadog, CloudWatch, etc.)
    print(f"[MODERATION BLOCK] reason={result['reason']} scores={result['scores']}")

Why we don't expose the reason: Telling users exactly which category triggered the block helps them refine their bypass attempts.


Step 5: Add Retry Logic for Production

The moderation API can fail transiently. Don't let API errors block all user requests.

# moderation.py — updated with retries
import time
from openai import OpenAI, APIError, RateLimitError

MAX_RETRIES = 3
RETRY_DELAY = 0.5  # seconds


def moderate_prompt(prompt: str) -> dict:
    for attempt in range(MAX_RETRIES):
        try:
            response = client.moderations.create(input=prompt)
            result = response.results[0]
            scores = result.category_scores.model_dump()
            flagged = result.categories.model_dump()

            for category in ENFORCED_CATEGORIES:
                score = scores.get(category, 0)
                if flagged.get(category) or score >= BLOCK_THRESHOLD:
                    return {"blocked": True, "reason": category, "scores": scores}

            return {"blocked": False, "reason": None, "scores": scores}

        except RateLimitError:
            if attempt < MAX_RETRIES - 1:
                time.sleep(RETRY_DELAY * (attempt + 1))  # Exponential backoff
            else:
                # Fail open — don't block the user when moderation is down
                return {"blocked": False, "reason": "moderation_unavailable", "scores": {}}

        except APIError as e:
            # Unexpected error — fail open and alert your on-call
            print(f"[MODERATION ERROR] {e}")
            return {"blocked": False, "reason": "moderation_error", "scores": {}}

Fail open vs. fail closed: Failing open (allowing requests through when moderation is unavailable) keeps your product working. Fail closed if you're in a high-risk domain (healthcare, finance, minors).


Verification

python - <<EOF
from moderation import moderate_prompt

# Should be blocked
bad = moderate_prompt("I want to hurt someone")
print("Bad prompt blocked:", bad["blocked"])  # True

# Should pass
good = moderate_prompt("How do I sort a list in Python?")
print("Good prompt blocked:", good["blocked"])  # False
EOF

You should see:

Bad prompt blocked: True
Good prompt blocked: False

What You Learned

  • The OpenAI Moderation API is a fast, low-cost first line of defense
  • Always log blocked requests — the data reveals abuse patterns over time
  • Fail open on API errors unless your use case demands otherwise
  • Never expose block reasons to users; it teaches them to evade the filter

Limitation: Moderation APIs catch known abuse patterns but miss novel jailbreaks and context-dependent harms. Layer this with output moderation and system prompt hardening for full coverage.

When NOT to use this: If your app processes non-English content heavily, benchmark Perspective API or AWS Comprehend first — OpenAI Moderation is primarily English-tuned.


Tested on Python 3.12, openai 1.16.0, Ubuntu 24.04