Robust Detection and Proportion Estimation of Statistical Language Watermarks

Xiang Li Speaker
 
Tuesday, Aug 5: 2:30 PM - 2:55 PM
Invited Paper Session 
Music City Center 
Since the introduction of ChatGPT in November 2022, embedding (nearly) unnoticeable statistical signals into text generated by large language models (LLMs)—a technique known as watermarking—has emerged as a principled approach for provably distinguishing LLM-generated text from human-written content. This talk addresses two critical challenges in this domain: (1) robust detection of watermarks and (2) provable estimation of the proportion of statistical language watermarks when users edit the watermarked text generated by LLMs.

For robust detection, we introduce a statistical framework that models the problem as a mixture detection task. Our approach employs a Truncated family of Goodness-of-Fit (Tr-GoF) tests, which we demonstrate to be optimally robust in two key ways: (i) achieving the optimal detection boundary as the watermark signal asymptotically diminishes, and (ii) attaining the highest detection efficiency in the presence of constant modifications. In contrast, existing sum-based detection methods for Gumbel-max watermarks fail to meet these benchmarks without relying on additional problem-specific information. Through simulations, we validate these theoretical guarantees, and real-world experiments confirm the superior or comparable performance of our method in maintaining watermark detectability, particularly in low-temperature settings.

For estimating the proportion of watermarked content, we establish the identifiability of this proportion across three watermarking schemes and propose efficient estimators to quantify the proportion of watermarked subtexts within mixed-source texts. These estimators achieve minimax optimality and exhibit strong performance in both simulated and real datasets.

Keywords

Watermark

LLM

Robust detection

Mixture model

Proportion estimation