Problem: Your Logs Are Leaking Secrets to OpenAI
You pipe application logs into OpenAI for analysis or debugging — and you're probably sending emails, API keys, and passwords along with them.
You'll learn:
- How to detect and redact PII (emails, phone numbers, SSNs, credit cards)
- How to mask secrets like API keys and tokens before they leave your system
- How to build a reusable masking pipeline in Python
Time: 20 min | Level: Intermediate
Why This Happens
Application logs are designed for humans reading in a controlled environment. They weren't built with third-party AI APIs in mind. Structured log formats like JSON often include full request/response bodies, headers, and user data — all of which get forwarded verbatim when you pass logs to OpenAI.
Common symptoms:
- User emails visible in OpenAI Playground history
- Auth tokens appearing in model context
- GDPR/CCPA exposure risk from log-based AI workflows
Solution
Step 1: Install Dependencies
pip install presidio-analyzer presidio-anonymizer spacy
python -m spacy download en_core_web_lg
Microsoft Presidio handles PII detection with high accuracy. It runs locally — nothing is sent anywhere during analysis.
Expected: No errors. Presidio installs its analyzer and anonymizer engines separately.
If it fails:
- spacy model not found: Run
python -m spacy download en_core_web_lgexplicitly - Presidio import error: Ensure both
presidio-analyzerandpresidio-anonymizerare installed
Step 2: Build a Regex Layer for Secrets
Presidio handles PII well, but secrets like API keys need a pattern-based approach first.
import re
# Patterns for common secrets
SECRET_PATTERNS = [
# OpenAI, Anthropic, generic API keys
(r'sk-[A-Za-z0-9]{20,}', '[API_KEY]'),
# Bearer tokens in headers
(r'Bearer\s+[A-Za-z0-9\-._~+/]+=*', 'Bearer [TOKEN]'),
# AWS access keys
(r'AKIA[0-9A-Z]{16}', '[AWS_KEY]'),
# Generic hex secrets (32+ chars)
(r'\b[0-9a-f]{32,}\b', '[HEX_SECRET]'),
# Connection strings
(r'(mongodb|postgres|mysql):\/\/[^\s"]+', '[CONNECTION_STRING]'),
]
def mask_secrets(text: str) -> str:
for pattern, replacement in SECRET_PATTERNS:
# Replace matches without touching surrounding context
text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
return text
Why regex first: Secrets are structurally predictable. Regex is faster than NLP and avoids false negatives on unusual key formats.
Step 3: Add PII Detection with Presidio
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
# Entities to redact — add or remove based on your compliance requirements
PII_ENTITIES = [
"EMAIL_ADDRESS",
"PHONE_NUMBER",
"CREDIT_CARD",
"US_SSN",
"IP_ADDRESS",
"PERSON",
"LOCATION",
]
def mask_pii(text: str) -> str:
results = analyzer.analyze(
text=text,
entities=PII_ENTITIES,
language="en"
)
if not results:
return text
anonymized = anonymizer.anonymize(
text=text,
analyzer_results=results,
operators={
# Replace each entity type with a labeled placeholder
"EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "[EMAIL]"}),
"PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[PHONE]"}),
"CREDIT_CARD": OperatorConfig("replace", {"new_value": "[CREDIT_CARD]"}),
"US_SSN": OperatorConfig("replace", {"new_value": "[SSN]"}),
"IP_ADDRESS": OperatorConfig("replace", {"new_value": "[IP]"}),
"PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),
"LOCATION": OperatorConfig("replace", {"new_value": "[LOCATION]"}),
}
)
return anonymized.text
Why labeled placeholders over hashes: Placeholders like [EMAIL] preserve the semantic structure of the log. The model still understands "a user's email was here" without seeing the actual address.
Step 4: Combine Into a Single Pipeline
def sanitize_log(log_text: str) -> str:
# Order matters: secrets first (fast regex), then PII (slower NLP)
log_text = mask_secrets(log_text)
log_text = mask_pii(log_text)
return log_text
Now wrap your OpenAI call:
import openai
def analyze_log_with_openai(raw_log: str) -> str:
# Always sanitize before the log leaves your system
clean_log = sanitize_log(raw_log)
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a log analysis assistant."},
{"role": "user", "content": f"Analyze this log:\n\n{clean_log}"}
]
)
return response.choices[0].message.content
Expected: OpenAI receives the log with all secrets and PII replaced by labeled tokens.
If it fails:
- Presidio misses a value: Add a custom regex pattern to
SECRET_PATTERNSfor that format - Performance too slow: Cache the
AnalyzerEngineinstance — initialize it once at startup, not per call
Step 5: Validate the Masking
# Quick smoke test
test_log = """
2026-02-24 ERROR user john.doe@example.com called /api/payments
Authorization: Bearer eyJhbGc.fake.token
Card: 4111 1111 1111 1111, IP: 192.168.1.1
API Key used: sk-abc123XYZ789longkeyhere
"""
result = sanitize_log(test_log)
print(result)
Expected output:
2026-02-24 ERROR user [EMAIL] called /api/payments
Authorization: Bearer [TOKEN]
Card: [CREDIT_CARD], IP: [IP]
API Key used: [API_KEY]
Verification
python -c "
from your_module import sanitize_log
sample = 'Error for user test@example.com with key sk-test12345678901234'
out = sanitize_log(sample)
assert '[EMAIL]' in out, 'Email not masked'
assert '[API_KEY]' in out, 'API key not masked'
print('All checks passed:', out)
"
You should see: All checks passed: followed by the sanitized string with no raw PII.
What You Learned
- Layering regex (for secrets) and NLP (for PII) gives better coverage than either alone
- Presidio runs locally — no data leaves your system during analysis
- Labeled placeholders (
[EMAIL]) are more useful to LLMs than hashes or empty strings
Limitation: Presidio's PERSON entity has false positives on uncommon names. Tune confidence thresholds with score_threshold on the analyze() call if you see over-redaction.
When NOT to use this: If your logs already go through a SIEM or log management tool with built-in masking (Datadog, Splunk), apply masking there instead to keep it centralized.
Tested on Python 3.12, presidio-analyzer 2.2.355, spacy 3.7.x, Ubuntu 22.04 & macOS Sequoia