Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs can perform well on various tasks
Unregulated use of LLMs can lead to malicious consequences
Detection of AI-generated text is critical
Recent works attempt to detect AI-generated text
Paraphrasing attacks can break detectors
Theoretical impossibility result indicates best-possible detector can only perform marginally better than random classifier
LLMs protected by watermarking schemes can be vulnerable to spoofing attacks

Artificial Intelligence (AI) has made advances in computer vision and natural language processing
Large Language Models (LLMs) can generate texts of high quality with potential applications
AI tools can be misused for unethical purposes such as plagiarism, fake news, and social engineering
Recent research focuses on detecting AI-generated texts
Detection works study this problem as a binary classification problem
Zero-shot AI text detection without additional training overhead is also studied
Watermarking is used to ease the detection of LLM output text
AI-text detectors are not reliable in practical scenarios
Paraphrasing attack can evade various types of detectors
Best-possible detector can perform only marginally better than a random classifier
Spoofing attacks on text generative models are possible
Identifying AI-generated text is important to avoid misuse
Vulnerable detectors can cause damages such as falsely accusing a human of plagiarism

Detecting AI-generated text is important for LLM security
AI text detectors can identify LLM signatures in text
Paraphrasing attacks can remove these signatures without changing the meaning of the text
Detectors face a trade-off between minimizing type-I and type-II errors

Experiments performed on soft watermarking scheme proposed in Kirchenbauer et al. [2023]
Output token of LLM selected from green list determined by its prefix
Paraphrasing used to remove watermark signature from target LLM’s output
Target AI text generator uses transformer-based OPT-1.3B [Zhang et al., 2022] architecture
T5-based [Raffel et al., 2019] paraphrasing model [Damodaran, 2021] with 222M parameters and PEGASUS-based [Zhang et al., 2019] paraphrasing model with 568M parameters used
Paraphrasing model lighter than target OPT-based model
Paraphrasing used to spread fake news or misinformation without detection
100 passages from Extreme Summarization (XSum) dataset [Narayan et al., 2018] used for evaluations
Paraphrasing reduces detector accuracy from 97% to 80% with 3.5 in perplexity score
T5-based paraphraser reduces detector accuracy from 97% to 57%

Non-watermarking detectors use LLM-specific signatures to detect AI-generated texts
Neural network-based detectors are trained or fine-tuned for binary classification
Zero-shot classifiers leverage specific statistical properties of the source LLM outputs
Experiments show non-watermarking detectors are vulnerable to paraphrasing attack
Attack uses pre-trained GPT-2 Medium7 model and T5-based paraphrasing model
True positive rate of OpenAI’s RoBERTa-Large-Detector drops from 100% to around 80%
Multiple queries to detector can bring down true positive rate to 60%
Perplexity of GPT-2 output text only degrades by 2 with multiple queries to detector

Detecting misuse of language models is difficult
AI-generated text looks increasingly similar to human text
Total variation distance between AI and human-generated text diminishes as language models become more sophisticated
Even the most effective detector performs only marginally better than a random classifier when dealing with a sufficiently advanced language model
Theorem 1 formalizes this statement by showing an upper bound on the area under the ROC curve of an arbitrary detector
As the distance between AI and human-generated text diminishes, the AUROC bound approaches 1/2
Corollary 1 shows that AI-generated text can be made difficult to detect by passing it through a paraphrasing tool
Corollary 2 and 3 indicate that certain people’s writing may be detected falsely as AI-generated or watermarked
These results demonstrate fundamental limitations for AI-text detectors, with and without watermarking schemes

Theorem 1 is tight
Constructed an AI-text distribution M and a detector D such that the bound holds with equality
Defined sublevel sets of the probability density function of the distribution of human-generated text
Probability of a sequence drawn from M falling in Ω H (0) is TV(M, H)
Detector D maps each sequence in Ω to the negative of the probability density function of H
TPR γ = min(FPR γ + TV(M, H), 1) similar to Equation 3
Calculating the AUROC in a similar fashion as in the previous section

AI text detection should have low type-I and type-II errors
Type-I error is when human text is detected as AI-generated
Type-II error is when AI-generated text is not detected
An attacker can generate non-AI text that is detected as AI-generated (spoofing attack)
This can be used to affect the reputation of the target LLM’s developers
Proof-of-concept shows soft watermarking detectors can be spoofed
Soft watermarked texts are composed of green list tokens
Attacker can learn green lists to generate human-written texts that are detected as watermarked
Nonsense sentences generated by an adversarial human can be detected as watermarked

Recent advancements in NLP show that LLMs can generate human-like texts
This can create challenges such as misuse for plagiarism, spamming, or social engineering
Demand for efficient LLM text detectors to reduce exploitation
Recent works propose AI text detectors using watermarking, zero-shot methods, and trained neural network-based classifiers
Experiments show that paraphrasing LLM outputs helps evade detectors
Theory demonstrates that for a sufficiently advanced language model, even the best detector can only perform marginally better than a random classifier
Watermarking-based detectors can be spoofed to make human-composed text detected as watermarked
Attackers might choose to break AI detectors in the future using improved paraphrasing models and smart prompting
Future research on AI text detectors must be cautious about these vulnerabilities