Watermarking AI Text: Is It Actually Possible? (Technical Analysis)

Problem: You Can't Reliably Tell If Text Was AI-Generated

You need to know whether a document, essay, or report was written by a human or an LLM. Existing detectors give false positives on non-native English writers and fail after minor edits. Watermarking promises a better answer — but the reality is complicated.

You'll learn:

How the three main AI watermarking approaches work under the hood
Why each method breaks down in practice
What the current state of the art actually delivers (and where it fails)

Time: 20 min | Level: Intermediate

Why This Happens

Detecting AI text after the fact is a fundamentally hard problem: LLMs produce fluent, human-like output, and any statistical fingerprint can be erased by paraphrasing.

Watermarking approaches this differently — instead of detecting AI text forensically, you embed a signal during generation that survives reasonable edits. The challenge is making that signal robust without degrading output quality.

Three approaches dominate current research:

Token-level statistical watermarks (embed signal in token selection)
Semantic watermarks (embed signal in meaning/structure)
Post-hoc steganography (add signal after generation)

Common symptoms that motivate this problem:

Detectors like GPTZero flag human-written technical prose as AI
Simple paraphrasing defeats most watermarking schemes
No watermark survives translation and back-translation intact

How Token-Level Watermarking Works

Step 1: The "Green/Red List" Method

The most influential technique comes from a 2023 paper by Kirchenbauer et al. (UMD). The idea is elegant:

# Pseudocode - not a real implementation
def generate_token_with_watermark(logits, prev_token, key):
    # Deterministically split vocabulary into green/red lists
    # using the previous token as a seed
    green_list, red_list = split_vocab(vocab, seed=hash(prev_token, key))

    # Boost logits for green-list tokens
    logits[green_list] += delta  # delta ~= 2.0

    # Sample as normal - green tokens now more likely
    return sample(softmax(logits))

The key insight: a legitimate LLM output will have statistically too many "green" tokens. You can detect this with a z-score test — no access to the model required, just the secret key.

Expected: Detection accuracy >95% on unmodified text, with false positive rate <1% at z=4.

If detection fails:

Low z-score on genuine watermarked text: Delta was set too low, or text is very short (<200 tokens)
False positives on human text: Key collision — try a different key or raise the threshold

Step 2: Why It Breaks Under Paraphrasing

# Attack: paraphrase with a different LLM
watermarked_text = generate_with_watermark(prompt)
attacked_text = paraphrase_llm(watermarked_text)

# Result: z-score drops from ~8 to ~1.5
# Watermark is effectively destroyed

Even light paraphrasing (synonym substitution, sentence reordering) disrupts the token-level signal. The green/red list assignment is tied to specific tokens — change the tokens, lose the watermark.

This is the core weakness of all token-level approaches.

Z-score distribution before and after paraphrasing Z-score drops below detection threshold after ~30% of tokens are substituted

Semantic Watermarking

Step 1: Embed Signal in Meaning, Not Tokens

Semantic watermarks encode information in choices that are meaning-preserving but detectable:

# Conceptual example: encode watermark bit in sentence structure
def encode_bit(sentence, bit):
    if bit == 0:
        # Use active voice
        return to_active_voice(sentence)
    else:
        # Use passive voice  
        return to_passive_voice(sentence)

# Decode: check voice distribution across document
def detect_semantic_watermark(text):
    active_count = count_active_sentences(text)
    passive_count = count_passive_sentences(text)
    # Significant imbalance suggests watermark
    return chi_squared_test(active_count, passive_count)

Real implementations use dozens of such binary choices (synonym selection, clause ordering, anaphora resolution) to encode a multi-bit payload.

Expected: Survives paraphrasing better than token-level — roughly 60-70% bit recovery after moderate paraphrasing vs. ~10% for token-level.

If recovery rate is too low:

Document is too short: Need ~500+ words for reliable multi-bit payload
Aggressive paraphrasing: Semantic watermarks still fail against full rewriting

Step 2: The Quality vs. Robustness Tradeoff

# Strong watermarks require constrained generation
# This hurts output quality

# Weak watermark (high quality, easy to remove)
watermark_strength = 0.1  # Small preference for watermarked choices
perplexity_increase = 2%   # Barely noticeable

# Strong watermark (detectable after edits, lower quality)
watermark_strength = 0.8  # Heavy preference for watermarked choices
perplexity_increase = 15%  # Noticeably worse text

There's no free lunch: robustness comes at the cost of text quality, and a determined attacker with a strong paraphrase model can strip most semantic signals.

Cryptographic and Post-Hoc Approaches

What Works (And What Doesn't)

Post-hoc steganography (embedding a watermark after generation using whitespace, Unicode variation selectors, zero-width characters) is trivially defeated:

# Strip all zero-width and invisible Unicode
echo "$text" | python3 -c "
import sys, unicodedata
text = sys.stdin.read()
cleaned = ''.join(c for c in text if unicodedata.category(c) != 'Cf')
print(cleaned)
"

Any copy-paste through a plain text editor, HTML sanitizer, or PDF render will strip the signal.

Cryptographic watermarking (signing model outputs with a private key) does work — but only if you control the distribution channel. If the LLM provider signs outputs and the consumer verifies against a public key, provenance is guaranteed. This is the approach C2PA and the Coalition for Content Provenance are pushing.

The problem: it requires cooperation from both the generator and a trust infrastructure. It proves this model generated this text — but doesn't survive any editing.

C2PA provenance chain diagram C2PA trust chain: model signs at generation, verifier checks against published key

Verification

What You Can Actually Test Today

# Test UMD watermark implementation (open source)
pip install extended-watermark-processor

python3 -c "
from extended_watermark_processor import WatermarkLogitsProcessor, WatermarkDetector

# Detection example
detector = WatermarkDetector(
    vocab_size=50257,
    gamma=0.25,        # Fraction of vocab in green list
    seeding_scheme='simple_1',
    device='cpu',
    tokenizer=tokenizer,
    z_threshold=4.0    # Detection threshold
)

score = detector.detect(text)
print(f'z-score: {score[\"z_score\"]:.2f}, detected: {score[\"prediction\"]}')
"

You should see: A z-score >4.0 on unmodified watermarked text, <2.0 on human text (with caveats on short documents).

What You Learned

Token-level watermarks (UMD green/red list) are the most practical today but fail after paraphrasing
Semantic watermarks are more robust but degrade output quality and still don't survive aggressive rewriting
Cryptographic/C2PA approaches give strong provenance guarantees but require infrastructure and don't survive editing
No current method reliably watermarks text that will be edited, translated, or paraphrased

Limitation: All watermarking research assumes the generator cooperates. An adversarial actor using a local, unwatermarked model produces undetectable output by definition.

When NOT to use watermarking: Don't rely on it for high-stakes decisions (academic integrity, legal evidence) without understanding its failure modes. It's a probabilistic signal, not a cryptographic proof — except in the C2PA case.

Based on research current as of early 2026. Key papers: Kirchenbauer et al. 2023 (UMD), Christ et al. 2023 (undetectable watermarks), Kuditipudi et al. 2023 (robust watermarks). C2PA spec v2.1.