Problem: Your AI Pipeline Might Be Shipping Copyrighted Content
You're using an LLM to generate articles, code, or marketing copy — and you have no idea if the output contains lifted text from copyrighted sources. One legal notice later, and it's a very expensive problem.
You'll learn:
- How to detect near-verbatim text reproduction programmatically
- Which tools catch copyright risk in generated code
- How to set up a lightweight compliance gate before content ships
Time: 20 min | Level: Intermediate
Why This Happens
LLMs are trained on vast corpora of internet text, books, and code. They don't "memorize" content in a simple lookup sense, but they can reproduce long passages verbatim — especially for highly repeated content like song lyrics, legal boilerplate, or popular open-source code snippets.
Common symptoms:
- Generated blog posts contain suspiciously polished paragraphs that feel "too good"
- Code completions reproduce GPL-licensed functions without attribution
- Marketing copy matches existing product descriptions almost word-for-word
The risk is highest for: creative writing prompts, code generation, and summarization tasks where the source material is narrow and well-known.
Solution
Step 1: Run Text Through a Similarity Detection API
For prose and documentation, similarity search against indexed web content is your first line of defense.
import requests
def check_similarity(text: str, api_key: str) -> dict:
# Copyleaks and Winston AI are two options with API access
# This example uses a generic POST pattern
response = requests.post(
"https://api.copyleaks.com/v3/businesses/submit/url",
headers={"Authorization": f"Bearer {api_key}"},
json={
"url": "inline",
"content": text,
# Flag results above 20% similarity for human review
"sensitivityLevel": 2
}
)
return response.json()
Expected: A report object with a score (0–100) and matched source URLs.
If it fails:
- 401 Unauthorized: Regenerate your API token — they expire every 24 hours on free tiers
- 429 Too Many Requests: You're hitting rate limits; batch submissions with a 1-second delay
Step 2: Check Generated Code for License Violations
Code is a distinct problem. A function can be nearly identical to GPL or AGPL-licensed code without being a literal copy — and that still creates legal exposure.
from difflib import SequenceMatcher
def similarity_ratio(generated: str, reference: str) -> float:
# SequenceMatcher catches structural similarity even when variable names differ
return SequenceMatcher(None, generated, reference).ratio()
def flag_risky_code(generated_code: str, known_snippets: list[str], threshold: float = 0.75) -> list[dict]:
flagged = []
for snippet in known_snippets:
ratio = similarity_ratio(generated_code, snippet)
if ratio >= threshold:
# 0.75 catches near-copies; lower to 0.6 for stricter enforcement
flagged.append({"snippet": snippet[:80], "similarity": round(ratio, 3)})
return flagged
Expected: An empty list for clean outputs. Any matches above 0.75 warrant manual review.
For a more complete solution, pipe generated code through GitHub Copilot's duplication filter if you're already on that stack — it's built specifically for this.
Terminal output showing flagged snippets and their similarity scores
Step 3: Build a Pre-Publish Compliance Gate
Don't rely on manual spot-checks. Wire the detection into your content pipeline so nothing publishes without passing a threshold.
import json
from dataclasses import dataclass
@dataclass
class ComplianceResult:
passed: bool
similarity_score: float
flagged_sources: list[str]
action: str # "publish" | "review" | "reject"
def compliance_gate(text: str, similarity_score: float, sources: list[str]) -> ComplianceResult:
if similarity_score < 0.20:
return ComplianceResult(True, similarity_score, [], "publish")
elif similarity_score < 0.40:
# Route to human review queue rather than auto-reject
return ComplianceResult(False, similarity_score, sources, "review")
else:
return ComplianceResult(False, similarity_score, sources, "reject")
# Example integration in a content generation script
def generate_and_check(prompt: str, llm_client, checker_api_key: str) -> dict:
output = llm_client.generate(prompt)
check = check_similarity(output, checker_api_key)
score = check.get("score", 0) / 100 # Normalize to 0–1
sources = [m["url"] for m in check.get("matches", [])]
result = compliance_gate(output, score, sources)
return {
"content": output if result.passed else None,
"compliance": result,
}
If it fails:
- All outputs flagged as "review": Lower your sensitivity threshold or narrow the scope of comparison — common phrases like "click here" will always match
- Gate never triggers: Your API may be returning normalized scores differently; print the raw response and adjust the division factor
Verification
Run the gate against a known-problematic output to confirm it's working:
# Test with a sentence copied directly from Wikipedia
test_text = "The mitochondria is the powerhouse of the cell."
result = check_similarity(test_text, api_key="YOUR_KEY")
print(json.dumps(result, indent=2))
You should see: A high similarity score (>60) and a reference to the Wikipedia source or biology textbooks.
# Run your full pipeline end to end
python generate_content.py --prompt "Write about neural networks" --check-compliance
You should see: Either a published output or a review/reject decision with source URLs attached.
What You Learned
- LLMs reproduce verbatim content most often with well-known, repetitive text — lyrics, legal clauses, popular code
- Similarity APIs catch prose issues;
SequenceMatcheror AST comparison handles code - A three-tier gate (publish / review / reject) avoids both false positives and legal exposure
- Setting your threshold too low floods your review queue; 20% for prose and 75% structural similarity for code are reasonable starting points
Limitation: These tools catch surface similarity, not semantic reproduction. An LLM can paraphrase copyrighted arguments closely enough to still be legally risky — that requires more nuanced legal review, not a script.
When NOT to use this: Internal tooling that never leaves your organization. Copyright law generally applies to distribution, not private use.
Tested with Python 3.12, Copyleaks API v3, and outputs from Claude Sonnet 4.6 and GPT-4o — February 2026