Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs can perform well on various tasks
- Unregulated use of LLMs can lead to malicious consequences
- Detection of AI-generated text is critical
- Recent works attempt to detect AI-generated text
- Paraphrasing attacks can break detectors
- Theoretical impossibility result indicates best-possible detector can only perform marginally better than random classifier
- LLMs protected by watermarking schemes can be vulnerable to spoofing attacks
Paper Content
Introduction
- Artificial Intelligence (AI) has made advances in computer vision and natural language processing
- Large Language Models (LLMs) can generate texts of high quality with potential applications
- AI tools can be misused for unethical purposes such as plagiarism, fake news, and social engineering
- Recent research focuses on detecting AI-generated texts
- Detection works study this problem as a binary classification problem
- Zero-shot AI text detection without additional training overhead is also studied
- Watermarking is used to ease the detection of LLM output text
- AI-text detectors are not reliable in practical scenarios
- Paraphrasing attack can evade various types of detectors
- Best-possible detector can perform only marginally better than a random classifier
- Spoofing attacks on text generative models are possible
- Identifying AI-generated text is important to avoid misuse
- Vulnerable detectors can cause damages such as falsely accusing a human of plagiarism
Evading ai-detectors using paraphrasing attacks
- Detecting AI-generated text is important for LLM security
- AI text detectors can identify LLM signatures in text
- Paraphrasing attacks can remove these signatures without changing the meaning of the text
- Detectors face a trade-off between minimizing type-I and type-II errors
Paraphrasing attacks on watermarked ai-generated text
- Experiments performed on soft watermarking scheme proposed in Kirchenbauer et al. [2023]
- Output token of LLM selected from green list determined by its prefix
- Paraphrasing used to remove watermark signature from target LLM’s output
- Target AI text generator uses transformer-based OPT-1.3B [Zhang et al., 2022] architecture
- T5-based [Raffel et al., 2019] paraphrasing model [Damodaran, 2021] with 222M parameters and PEGASUS-based [Zhang et al., 2019] paraphrasing model with 568M parameters used
- Paraphrasing model lighter than target OPT-based model
- Paraphrasing used to spread fake news or misinformation without detection
- 100 passages from Extreme Summarization (XSum) dataset [Narayan et al., 2018] used for evaluations
- Paraphrasing reduces detector accuracy from 97% to 80% with 3.5 in perplexity score
- T5-based paraphraser reduces detector accuracy from 97% to 57%
Paraphrasing attacks on non-watermarked ai-generated texts
- Non-watermarking detectors use LLM-specific signatures to detect AI-generated texts
- Neural network-based detectors are trained or fine-tuned for binary classification
- Zero-shot classifiers leverage specific statistical properties of the source LLM outputs
- Experiments show non-watermarking detectors are vulnerable to paraphrasing attack
- Attack uses pre-trained GPT-2 Medium7 model and T5-based paraphrasing model
- True positive rate of OpenAI’s RoBERTa-Large-Detector drops from 100% to around 80%
- Multiple queries to detector can bring down true positive rate to 60%
- Perplexity of GPT-2 output text only degrades by 2 with multiple queries to detector
Impossibility results for reliable detection of ai-generated text
- Detecting misuse of language models is difficult
- AI-generated text looks increasingly similar to human text
- Total variation distance between AI and human-generated text diminishes as language models become more sophisticated
- Even the most effective detector performs only marginally better than a random classifier when dealing with a sufficiently advanced language model
- Theorem 1 formalizes this statement by showing an upper bound on the area under the ROC curve of an arbitrary detector
- As the distance between AI and human-generated text diminishes, the AUROC bound approaches 1/2
- Corollary 1 shows that AI-generated text can be made difficult to detect by passing it through a paraphrasing tool
- Corollary 2 and 3 indicate that certain people’s writing may be detected falsely as AI-generated or watermarked
- These results demonstrate fundamental limitations for AI-text detectors, with and without watermarking schemes
Tightness analysis
- Theorem 1 is tight
- Constructed an AI-text distribution M and a detector D such that the bound holds with equality
- Defined sublevel sets of the probability density function of the distribution of human-generated text
- Probability of a sequence drawn from M falling in Ω H (0) is TV(M, H)
- Detector D maps each sequence in Ω to the negative of the probability density function of H
- TPR γ = min(FPR γ + TV(M, H), 1) similar to Equation 3
- Calculating the AUROC in a similar fashion as in the previous section
Spoofing attacks on ai-text generative models
- AI text detection should have low type-I and type-II errors
- Type-I error is when human text is detected as AI-generated
- Type-II error is when AI-generated text is not detected
- An attacker can generate non-AI text that is detected as AI-generated (spoofing attack)
- This can be used to affect the reputation of the target LLM’s developers
- Proof-of-concept shows soft watermarking detectors can be spoofed
- Soft watermarked texts are composed of green list tokens
- Attacker can learn green lists to generate human-written texts that are detected as watermarked
- Nonsense sentences generated by an adversarial human can be detected as watermarked
Discussion
- Recent advancements in NLP show that LLMs can generate human-like texts
- This can create challenges such as misuse for plagiarism, spamming, or social engineering
- Demand for efficient LLM text detectors to reduce exploitation
- Recent works propose AI text detectors using watermarking, zero-shot methods, and trained neural network-based classifiers
- Experiments show that paraphrasing LLM outputs helps evade detectors
- Theory demonstrates that for a sufficiently advanced language model, even the best detector can only perform marginally better than a random classifier
- Watermarking-based detectors can be spoofed to make human-composed text detected as watermarked
- Attackers might choose to break AI detectors in the future using improved paraphrasing models and smart prompting
- Future research on AI text detectors must be cautious about these vulnerabilities