Problem: Your AI Pipeline Might Be Shipping Copyrighted Content

You're using an LLM to generate articles, code, or marketing copy — and you have no idea if the output contains lifted text from copyrighted sources. One legal notice later, and it's a very expensive problem.

You'll learn:

How to detect near-verbatim text reproduction programmatically
Which tools catch copyright risk in generated code
How to set up a lightweight compliance gate before content ships

Time: 20 min | Level: Intermediate

Why This Happens

LLMs are trained on vast corpora of internet text, books, and code. They don't "memorize" content in a simple lookup sense, but they can reproduce long passages verbatim — especially for highly repeated content like song lyrics, legal boilerplate, or popular open-source code snippets.

Common symptoms:

Generated blog posts contain suspiciously polished paragraphs that feel "too good"
Code completions reproduce GPL-licensed functions without attribution
Marketing copy matches existing product descriptions almost word-for-word

The risk is highest for: creative writing prompts, code generation, and summarization tasks where the source material is narrow and well-known.

Solution

Step 1: Run Text Through a Similarity Detection API

For prose and documentation, similarity search against indexed web content is your first line of defense.

import requests

def check_similarity(text: str, api_key: str) -> dict:
    # Copyleaks and Winston AI are two options with API access
    # This example uses a generic POST pattern
    response = requests.post(
        "https://api.copyleaks.com/v3/businesses/submit/url",
        headers={"Authorization": f"Bearer {api_key}"},
        json={
            "url": "inline",
            "content": text,
            # Flag results above 20% similarity for human review
            "sensitivityLevel": 2
        }
    )
    return response.json()

Expected: A report object with a score (0–100) and matched source URLs.

If it fails:

401 Unauthorized: Regenerate your API token — they expire every 24 hours on free tiers
429 Too Many Requests: You're hitting rate limits; batch submissions with a 1-second delay

Step 2: Check Generated Code for License Violations

Code is a distinct problem. A function can be nearly identical to GPL or AGPL-licensed code without being a literal copy — and that still creates legal exposure.

from difflib import SequenceMatcher

def similarity_ratio(generated: str, reference: str) -> float:
    # SequenceMatcher catches structural similarity even when variable names differ
    return SequenceMatcher(None, generated, reference).ratio()

def flag_risky_code(generated_code: str, known_snippets: list[str], threshold: float = 0.75) -> list[dict]:
    flagged = []
    for snippet in known_snippets:
        ratio = similarity_ratio(generated_code, snippet)
        if ratio >= threshold:
            # 0.75 catches near-copies; lower to 0.6 for stricter enforcement
            flagged.append({"snippet": snippet[:80], "similarity": round(ratio, 3)})
    return flagged

Expected: An empty list for clean outputs. Any matches above 0.75 warrant manual review.

For a more complete solution, pipe generated code through GitHub Copilot's duplication filter if you're already on that stack — it's built specifically for this.

Code similarity detection output in terminal Terminal output showing flagged snippets and their similarity scores

Step 3: Build a Pre-Publish Compliance Gate

Don't rely on manual spot-checks. Wire the detection into your content pipeline so nothing publishes without passing a threshold.

import json
from dataclasses import dataclass

@dataclass
class ComplianceResult:
    passed: bool
    similarity_score: float
    flagged_sources: list[str]
    action: str  # "publish" | "review" | "reject"

def compliance_gate(text: str, similarity_score: float, sources: list[str]) -> ComplianceResult:
    if similarity_score < 0.20:
        return ComplianceResult(True, similarity_score, [], "publish")
    elif similarity_score < 0.40:
        # Route to human review queue rather than auto-reject
        return ComplianceResult(False, similarity_score, sources, "review")
    else:
        return ComplianceResult(False, similarity_score, sources, "reject")

# Example integration in a content generation script
def generate_and_check(prompt: str, llm_client, checker_api_key: str) -> dict:
    output = llm_client.generate(prompt)
    
    check = check_similarity(output, checker_api_key)
    score = check.get("score", 0) / 100  # Normalize to 0–1
    sources = [m["url"] for m in check.get("matches", [])]
    
    result = compliance_gate(output, score, sources)
    
    return {
        "content": output if result.passed else None,
        "compliance": result,
    }

If it fails:

All outputs flagged as "review": Lower your sensitivity threshold or narrow the scope of comparison — common phrases like "click here" will always match
Gate never triggers: Your API may be returning normalized scores differently; print the raw response and adjust the division factor

Verification

Run the gate against a known-problematic output to confirm it's working:

# Test with a sentence copied directly from Wikipedia
test_text = "The mitochondria is the powerhouse of the cell."
result = check_similarity(test_text, api_key="YOUR_KEY")
print(json.dumps(result, indent=2))

You should see: A high similarity score (>60) and a reference to the Wikipedia source or biology textbooks.

# Run your full pipeline end to end
python generate_content.py --prompt "Write about neural networks" --check-compliance

You should see: Either a published output or a review/reject decision with source URLs attached.

What You Learned

LLMs reproduce verbatim content most often with well-known, repetitive text — lyrics, legal clauses, popular code
Similarity APIs catch prose issues; SequenceMatcher or AST comparison handles code
A three-tier gate (publish / review / reject) avoids both false positives and legal exposure
Setting your threshold too low floods your review queue; 20% for prose and 75% structural similarity for code are reasonable starting points

Limitation: These tools catch surface similarity, not semantic reproduction. An LLM can paraphrase copyrighted arguments closely enough to still be legally risky — that requires more nuanced legal review, not a script.

When NOT to use this: Internal tooling that never leaves your organization. Copyright law generally applies to distribution, not private use.

Tested with Python 3.12, Copyleaks API v3, and outputs from Claude Sonnet 4.6 and GPT-4o — February 2026