Mask Sensitive Data Before Sending Logs to OpenAI

Stop leaking PII and secrets to OpenAI. Learn regex, hashing, and tokenization techniques to sanitize logs before every API call.

Problem: Your Logs Are Leaking Secrets to OpenAI

You pipe application logs into OpenAI for analysis or debugging — and you're probably sending emails, API keys, and passwords along with them.

You'll learn:

  • How to detect and redact PII (emails, phone numbers, SSNs, credit cards)
  • How to mask secrets like API keys and tokens before they leave your system
  • How to build a reusable masking pipeline in Python

Time: 20 min | Level: Intermediate


Why This Happens

Application logs are designed for humans reading in a controlled environment. They weren't built with third-party AI APIs in mind. Structured log formats like JSON often include full request/response bodies, headers, and user data — all of which get forwarded verbatim when you pass logs to OpenAI.

Common symptoms:

  • User emails visible in OpenAI Playground history
  • Auth tokens appearing in model context
  • GDPR/CCPA exposure risk from log-based AI workflows

Solution

Step 1: Install Dependencies

pip install presidio-analyzer presidio-anonymizer spacy
python -m spacy download en_core_web_lg

Microsoft Presidio handles PII detection with high accuracy. It runs locally — nothing is sent anywhere during analysis.

Expected: No errors. Presidio installs its analyzer and anonymizer engines separately.

If it fails:

  • spacy model not found: Run python -m spacy download en_core_web_lg explicitly
  • Presidio import error: Ensure both presidio-analyzer and presidio-anonymizer are installed

Step 2: Build a Regex Layer for Secrets

Presidio handles PII well, but secrets like API keys need a pattern-based approach first.

import re

# Patterns for common secrets
SECRET_PATTERNS = [
    # OpenAI, Anthropic, generic API keys
    (r'sk-[A-Za-z0-9]{20,}', '[API_KEY]'),
    # Bearer tokens in headers
    (r'Bearer\s+[A-Za-z0-9\-._~+/]+=*', 'Bearer [TOKEN]'),
    # AWS access keys
    (r'AKIA[0-9A-Z]{16}', '[AWS_KEY]'),
    # Generic hex secrets (32+ chars)
    (r'\b[0-9a-f]{32,}\b', '[HEX_SECRET]'),
    # Connection strings
    (r'(mongodb|postgres|mysql):\/\/[^\s"]+', '[CONNECTION_STRING]'),
]

def mask_secrets(text: str) -> str:
    for pattern, replacement in SECRET_PATTERNS:
        # Replace matches without touching surrounding context
        text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
    return text

Why regex first: Secrets are structurally predictable. Regex is faster than NLP and avoids false negatives on unusual key formats.


Step 3: Add PII Detection with Presidio

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Entities to redact — add or remove based on your compliance requirements
PII_ENTITIES = [
    "EMAIL_ADDRESS",
    "PHONE_NUMBER",
    "CREDIT_CARD",
    "US_SSN",
    "IP_ADDRESS",
    "PERSON",
    "LOCATION",
]

def mask_pii(text: str) -> str:
    results = analyzer.analyze(
        text=text,
        entities=PII_ENTITIES,
        language="en"
    )

    if not results:
        return text

    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators={
            # Replace each entity type with a labeled placeholder
            "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "[EMAIL]"}),
            "PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[PHONE]"}),
            "CREDIT_CARD": OperatorConfig("replace", {"new_value": "[CREDIT_CARD]"}),
            "US_SSN": OperatorConfig("replace", {"new_value": "[SSN]"}),
            "IP_ADDRESS": OperatorConfig("replace", {"new_value": "[IP]"}),
            "PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),
            "LOCATION": OperatorConfig("replace", {"new_value": "[LOCATION]"}),
        }
    )
    return anonymized.text

Why labeled placeholders over hashes: Placeholders like [EMAIL] preserve the semantic structure of the log. The model still understands "a user's email was here" without seeing the actual address.


Step 4: Combine Into a Single Pipeline

def sanitize_log(log_text: str) -> str:
    # Order matters: secrets first (fast regex), then PII (slower NLP)
    log_text = mask_secrets(log_text)
    log_text = mask_pii(log_text)
    return log_text

Now wrap your OpenAI call:

import openai

def analyze_log_with_openai(raw_log: str) -> str:
    # Always sanitize before the log leaves your system
    clean_log = sanitize_log(raw_log)

    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a log analysis assistant."},
            {"role": "user", "content": f"Analyze this log:\n\n{clean_log}"}
        ]
    )
    return response.choices[0].message.content

Expected: OpenAI receives the log with all secrets and PII replaced by labeled tokens.

If it fails:

  • Presidio misses a value: Add a custom regex pattern to SECRET_PATTERNS for that format
  • Performance too slow: Cache the AnalyzerEngine instance — initialize it once at startup, not per call

Step 5: Validate the Masking

# Quick smoke test
test_log = """
2026-02-24 ERROR user john.doe@example.com called /api/payments
Authorization: Bearer eyJhbGc.fake.token
Card: 4111 1111 1111 1111, IP: 192.168.1.1
API Key used: sk-abc123XYZ789longkeyhere
"""

result = sanitize_log(test_log)
print(result)

Expected output:

2026-02-24 ERROR user [EMAIL] called /api/payments
Authorization: Bearer [TOKEN]
Card: [CREDIT_CARD], IP: [IP]
API Key used: [API_KEY]

Verification

python -c "
from your_module import sanitize_log
sample = 'Error for user test@example.com with key sk-test12345678901234'
out = sanitize_log(sample)
assert '[EMAIL]' in out, 'Email not masked'
assert '[API_KEY]' in out, 'API key not masked'
print('All checks passed:', out)
"

You should see: All checks passed: followed by the sanitized string with no raw PII.


What You Learned

  • Layering regex (for secrets) and NLP (for PII) gives better coverage than either alone
  • Presidio runs locally — no data leaves your system during analysis
  • Labeled placeholders ([EMAIL]) are more useful to LLMs than hashes or empty strings

Limitation: Presidio's PERSON entity has false positives on uncommon names. Tune confidence thresholds with score_threshold on the analyze() call if you see over-redaction.

When NOT to use this: If your logs already go through a SIEM or log management tool with built-in masking (Datadog, Splunk), apply masking there instead to keep it centralized.


Tested on Python 3.12, presidio-analyzer 2.2.355, spacy 3.7.x, Ubuntu 22.04 & macOS Sequoia