Problem: Your AI Safety Filter Is Blocking Legitimate Requests
You deployed a safety filter in front of your LLM, and now it's refusing valid user queries — medical questions flagged as self-harm, coding help flagged as malware instructions, customer support messages flagged as threats.
You'll learn:
- How to identify which classifier layer is triggering false positives
- How to tune confidence thresholds without weakening safety
- How to inject context so the filter makes better decisions
Time: 20 min | Level: Intermediate
Why This Happens
Most AI safety filters combine keyword matching, embedding-based classifiers, and sometimes a secondary LLM judge. False positives happen when one layer lacks context — a phrase like "kill the process" triggers violence classifiers, or "how do I get rid of someone" reads as a threat without knowing it's an HR question.
Common symptoms:
- Legitimate queries blocked with a generic "I can't help with that" response
- Error logs showing
safety_score > thresholdfor obviously benign input - Inconsistent behavior — same query blocks sometimes but not others
Solution
Step 1: Identify Which Layer Is Triggering
Before tuning anything, find the actual source of the block. Most safety pipelines expose a reason code or score breakdown in their response metadata.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": your_query}]
)
# Check stop reason and log the full response for analysis
print(response.stop_reason) # e.g., "end_turn" vs "max_tokens"
print(response.model_dump_json(indent=2))
If you're running your own classifier stack (e.g., a fine-tuned BERT classifier + LLM judge), log the score from each layer separately:
def classify_with_debug(text: str) -> dict:
keyword_score = keyword_classifier.score(text)
embedding_score = embedding_classifier.score(text)
# Log before combining — this is where false positives hide
print(f"keyword={keyword_score:.3f}, embedding={embedding_score:.3f}")
combined = max(keyword_score, embedding_score) # conservative: take max
return {"score": combined, "blocked": combined > THRESHOLD}
Expected: You'll see one layer scoring high while the other is near zero. That tells you exactly what to fix.
If it fails:
- No metadata returned: Your filter may be swallowing scores — add explicit logging before the blocking condition
- Both scores high: The query genuinely resembles harmful content in your training data; move to Step 3
Step 2: Tune Thresholds Per Category
A single global threshold causes most false positives. Different harm categories need different sensitivity levels.
# Instead of one global threshold
THRESHOLD = 0.75 # too blunt
# Use per-category thresholds based on your false positive analysis
THRESHOLDS = {
"violence": 0.85, # High FP rate for action game queries
"self_harm": 0.70, # Keep sensitive — lower threshold acceptable
"hate_speech": 0.80,
"illegal_activity": 0.75,
"medical_advice": 0.90, # Medical questions often flagged incorrectly
}
def should_block(category: str, score: float) -> bool:
threshold = THRESHOLDS.get(category, 0.75)
return score > threshold
To find the right threshold for each category, plot your FP rate against the threshold using a labeled validation set:
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# precision_recall_curve needs true labels and predicted scores
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
# Pick threshold where precision (1 - FPR) meets your acceptable recall
for p, r, t in zip(precision, recall, thresholds):
if p >= 0.95: # We want at most 5% false positives
print(f"Threshold: {t:.3f} | Precision: {p:.3f} | Recall: {r:.3f}")
break
Expected: You'll find that loosening the medical/coding categories by 0.05–0.10 eliminates the bulk of false positives with minimal safety loss.
Step 3: Inject System Context Before Classification
The highest-leverage fix is giving your classifier the same context the LLM has. A bare user message lacks intent signals — the system prompt often resolves ambiguity.
def classify_with_context(system_prompt: str, user_message: str) -> dict:
# Classify the full conversation context, not just the message
context_for_classifier = f"""
System context: {system_prompt}
User message: {user_message}
"""
return classifier.score(context_for_classifier)
# Example: customer support bot
system_prompt = "You are a customer support agent for Acme Software. Help users with technical issues."
user_message = "How do I kill the background process that's hanging my app?"
# Without context: "kill" + "process" might score high for violence
# With context: classifier understands this is a technical support scenario
score = classify_with_context(system_prompt, user_message)
If your classifier is an LLM judge, make the system context explicit in the judging prompt:
judge_prompt = f"""
You are evaluating whether a message is harmful in context.
APPLICATION CONTEXT: {system_prompt}
USER MESSAGE: {user_message}
Is this message harmful given the application context?
Respond with JSON: {{"harmful": true/false, "confidence": 0.0-1.0, "reason": "..."}}
"""
Expected: False positive rate drops 40–70% for domain-specific applications when context is included.
If it fails:
- Classifier ignores context: Your model may need fine-tuning on contextualized examples — compile the false positive cases as negative training examples
- Latency too high: Cache classifier results for identical (system_prompt_hash, message) pairs
Step 4: Build a False Positive Review Pipeline
Tuning thresholds once isn't enough. Build a lightweight review loop so false positives surface automatically.
import json
from datetime import datetime
def log_blocked_request(user_message: str, scores: dict, threshold: float):
"""Log every block for offline review."""
entry = {
"timestamp": datetime.utcnow().isoformat(),
"message": user_message,
"scores": scores,
"threshold": threshold,
"review_status": "pending" # human reviews this queue
}
with open("blocked_requests.jsonl", "a") as f:
f.write(json.dumps(entry) + "\n")
Review this queue weekly. When you find a false positive, add it to your validation set — this creates a feedback loop that catches threshold drift over time.
Verification
Run your validation set against the updated configuration:
python evaluate_classifier.py \
--dataset data/labeled_validation.jsonl \
--config config/thresholds_v2.json \
--report reports/fp_analysis.html
You should see: False positive rate below your target (typically <5%) while true positive rate stays above 90%. The report will break results down by category so you can spot remaining problem areas.
What You Learned
- False positives almost always come from a single classifier layer lacking context — log each layer separately before tuning
- Per-category thresholds outperform a single global value; find the right cutoff with precision-recall analysis
- Injecting the system prompt into the classifier resolves most domain-specific false positives without loosening safety
- A logged review queue turns false positive tuning from a one-time fix into an ongoing process
Limitation: These techniques reduce false positives for well-defined application contexts. General-purpose deployments with unpredictable user intent need a different approach — consider a two-stage system where flagged requests are re-evaluated with additional context before being hard-blocked.
When NOT to use threshold relaxation: If you're building a children's platform or a regulated medical/legal product, err on the side of more false positives and invest in better classifier training data instead.
Tested with Python 3.12, scikit-learn 1.4, Anthropic SDK 0.23. Classifier examples use a generic scoring interface — adapt to your specific stack.