All platform POCs
POC 05Reproducible now

LLM Watermark Detection Reproducibility

The Kirchenbauer watermark detector’s z-score — bit-identical across platforms under SolvNum, with stable detection decisions on every borderline case.

0
SolvNum receipt mismatches
2
Float64 receipt mismatches (expected)
1,296
Detector calls per implementation
≤1 bit
Math-library divergence absorbed

The scenario

Set the picture

Meta and peers have published watermarking schemes for LLM outputs. Every scheme has a detector: a statistical test that computes a z-score and applies a threshold to decide “watermark present” or “watermark absent.”

When the z-score is far from the threshold, drift doesn’t matter. When it’s borderline, platform-dependent arithmetic drift can flip the detection decision — one machine says “AI-generated,” another says “not AI-generated” on the same input.

Cost today

The published detector papers do not analyze detection-decision stability across platforms. There is a gap in the literature.

EU AI Act Article 50 requires transparency about AI-generated content. If the detector that enforces this obligation produces different answers on different machines, the regulatory artifact is unreliable.

What changes with SolvNum

SolvNum-backed Kirchenbauer 2023 z-score detector: bit-identical receipt hash across Windows and Linux. SolvNum-backed Kirchenbauer 2024 weighted detector: same — bit-identical.

Float64 weighted detector: different receipt hashes across the two hosts. The per-token multiplication and reduction accumulate rounding differences that surface as receipt mismatches and, on borderline texts, flipped detection decisions.

The SolvNum detector returns integer triples (sign, q, e) for hashing — never re-converts to float. That’s why the receipt is stable: the hash path never touches libm.

Measurable outcome

What we claim — and how it survives review

Each line below maps to a captured number in the demo section. Every number is reproducible from the benchmark suite.

  • SolvNum k2023 detector: receipt hash match across Windows and Linux.
  • SolvNum k2024 weighted detector: receipt hash match across Windows and Linux.
  • Float64 k2024 weighted (naive + Kahan): receipt hash MISMATCH across Windows and Linux.
  • SolvNum mismatches: 0. Float mismatches: 2 (both expected on cross-platform runs).
  • 1,296 detector calls (216 sequences × 3 reduction orders × 2 sqrt paths) per implementation.

The demo

What was tested. How. What the script printed.

Synthetic corpus: 216 sequences at varying watermark strengths (γ = 0.25, 0.5, 0.75) and lengths (50, 200, 1000 tokens). Three implementations: float64_naive, float64_kahan, solvnum_backed. Two detectors: Kirchenbauer 2023 z-score, Kirchenbauer 2024 weighted.

Cross-platform verification: bench runs on both hosts, writes anchor files. Verify script compares anchor hashes. SolvNum mismatches must be 0; float mismatches are expected and informational.

Captured benchmark output

The numbers the script actually printed.

Cross-platform watermark detector receipt comparison
Detector × ImplementationWindows hashLinux hashMatch
k2023 / solvnum_backed9c0a6229…9c0a6229…✓ MATCH
k2024_weighted / solvnum_backed27c76c7a…27c76c7a…✓ MATCH
k2024_weighted / float64_naivead6e9f35…7b9bcf49…✗ DIFFER
k2024_weighted / float64_kahan9870edff…cd551bf5…✗ DIFFER

k2023 float64 happens to match on these two x86_64 hosts because the only FP work is sqrt over integers. Not a property to rely on.

Evidence pointers

Where the claims live in the repo

These are the files a reviewer should run to re-derive every number on this page.

  • tools/solvnum/buyer_pocs/llm_watermark/bench.py
  • tools/solvnum/buyer_pocs/llm_watermark/verify.py
  • tools/solvnum/buyer_pocs/reports/llm_watermark_anchors_win.json
  • tools/solvnum/buyer_pocs/reports/llm_watermark_anchors_wsl.json
  • docs/poc/05_llm_watermark.md
  • docs/poc/05_llm_watermark_xplat_evidence.md

Want to see these receipts on your pipeline?

Run the benchmark against your actual decision pipeline.

Two weeks, $25K, fully credited. No production integration, no data leaving your premises. Every claim above traces back to a script you can run locally.

Talk to us