POC 05Reproducible now

LLM Watermark Detection Reproducibility

The Kirchenbauer watermark detector’s z-score — bit-identical across platforms under SolvNum, with stable detection decisions on every borderline case.

0: SolvNum receipt mismatches
2: Float64 receipt mismatches (expected)
1,296: Detector calls per implementation
≤1 bit: Math-library divergence absorbed

The scenario

Set the picture

Meta and peers have published watermarking schemes for LLM outputs. Every scheme has a detector: a statistical test that computes a z-score and applies a threshold to decide “watermark present” or “watermark absent.”

When the z-score is far from the threshold, drift doesn’t matter. When it’s borderline, platform-dependent arithmetic drift can flip the detection decision — one machine says “AI-generated,” another says “not AI-generated” on the same input.

Cost today

The published detector papers do not analyze detection-decision stability across platforms. There is a gap in the literature.

EU AI Act Article 50 requires transparency about AI-generated content. If the detector that enforces this obligation produces different answers on different machines, the regulatory artifact is unreliable.

What changes with SolvNum

SolvNum-backed Kirchenbauer 2023 z-score detector: bit-identical receipt hash across Windows and Linux. SolvNum-backed Kirchenbauer 2024 weighted detector: same — bit-identical.

Float64 weighted detector: different receipt hashes across the two hosts. The per-token multiplication and reduction accumulate rounding differences that surface as receipt mismatches and, on borderline texts, flipped detection decisions.

The SolvNum detector returns integer triples (sign, q, e) for hashing — never re-converts to float. That’s why the receipt is stable: the hash path never touches libm.

Measurable outcome

What we claim — and how it survives review

Each line below maps to a captured number in the demo section. Every number is reproducible from the benchmark suite.

SolvNum k2023 detector: receipt hash match across Windows and Linux.
SolvNum k2024 weighted detector: receipt hash match across Windows and Linux.
Float64 k2024 weighted (naive + Kahan): receipt hash MISMATCH across Windows and Linux.
SolvNum mismatches: 0. Float mismatches: 2 (both expected on cross-platform runs).
1,296 detector calls (216 sequences × 3 reduction orders × 2 sqrt paths) per implementation.

The demo

What was tested. How. What the script printed.

Synthetic corpus: 216 sequences at varying watermark strengths (γ = 0.25, 0.5, 0.75) and lengths (50, 200, 1000 tokens). Three implementations: float64_naive, float64_kahan, solvnum_backed. Two detectors: Kirchenbauer 2023 z-score, Kirchenbauer 2024 weighted.

Cross-platform verification: bench runs on both hosts, writes anchor files. Verify script compares anchor hashes. SolvNum mismatches must be 0; float mismatches are expected and informational.

Captured benchmark output

The numbers the script actually printed.

Cross-platform watermark detector receipt comparison

Detector × Implementation	Windows hash	Linux hash	Match
k2023 / solvnum_backed	9c0a6229…	9c0a6229…	✓ MATCH
k2024_weighted / solvnum_backed	27c76c7a…	27c76c7a…	✓ MATCH
k2024_weighted / float64_naive	ad6e9f35…	7b9bcf49…	✗ DIFFER
k2024_weighted / float64_kahan	9870edff…	cd551bf5…	✗ DIFFER

k2023 float64 happens to match on these two x86_64 hosts because the only FP work is sqrt over integers. Not a property to rely on.

Composes with

Where this POC sits in the suite

POC 01

AI Decision-Layer Reproducibility

Both POCs address ML pipeline reproducibility. POC 1 targets the decision layer; POC 5 targets the detection statistic.

POC 03

Ads Attribution Reconciliation

Attribution and watermark detection both involve long arithmetic chains where float drift accumulates.

Evidence pointers

Where the claims live in the repo

These are the files a reviewer should run to re-derive every number on this page.

tools/solvnum/buyer_pocs/llm_watermark/bench.py
tools/solvnum/buyer_pocs/llm_watermark/verify.py
tools/solvnum/buyer_pocs/reports/llm_watermark_anchors_win.json
tools/solvnum/buyer_pocs/reports/llm_watermark_anchors_wsl.json
docs/poc/05_llm_watermark.md
docs/poc/05_llm_watermark_xplat_evidence.md

Previous · POC 04

Creator Payout Verifiable Workflow

All POCs

Want to see these receipts on your pipeline?

Run the benchmark against your actual decision pipeline.

Two weeks, $25K, fully credited. No production integration, no data leaving your premises. Every claim above traces back to a script you can run locally.

Talk to us