Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • LLMs can perform well on various tasks
  • Unregulated use of LLMs can lead to malicious consequences
  • Detection of AI-generated text is critical
  • Recent works attempt to detect AI-generated text
  • Paraphrasing attacks can break detectors
  • Theoretical impossibility result indicates best-possible detector can only perform marginally better than random classifier
  • LLMs protected by watermarking schemes can be vulnerable to spoofing attacks

Paper Content

Introduction

  • Artificial Intelligence (AI) has made advances in computer vision and natural language processing
  • Large Language Models (LLMs) can generate texts of high quality with potential applications
  • AI tools can be misused for unethical purposes such as plagiarism, fake news, and social engineering
  • Recent research focuses on detecting AI-generated texts
  • Detection works study this problem as a binary classification problem
  • Zero-shot AI text detection without additional training overhead is also studied
  • Watermarking is used to ease the detection of LLM output text
  • AI-text detectors are not reliable in practical scenarios
  • Paraphrasing attack can evade various types of detectors
  • Best-possible detector can perform only marginally better than a random classifier
  • Spoofing attacks on text generative models are possible
  • Identifying AI-generated text is important to avoid misuse
  • Vulnerable detectors can cause damages such as falsely accusing a human of plagiarism

Evading ai-detectors using paraphrasing attacks

  • Detecting AI-generated text is important for LLM security
  • AI text detectors can identify LLM signatures in text
  • Paraphrasing attacks can remove these signatures without changing the meaning of the text
  • Detectors face a trade-off between minimizing type-I and type-II errors

Paraphrasing attacks on watermarked ai-generated text

  • Experiments performed on soft watermarking scheme proposed in Kirchenbauer et al. [2023]
  • Output token of LLM selected from green list determined by its prefix
  • Paraphrasing used to remove watermark signature from target LLM’s output
  • Target AI text generator uses transformer-based OPT-1.3B [Zhang et al., 2022] architecture
  • T5-based [Raffel et al., 2019] paraphrasing model [Damodaran, 2021] with 222M parameters and PEGASUS-based [Zhang et al., 2019] paraphrasing model with 568M parameters used
  • Paraphrasing model lighter than target OPT-based model
  • Paraphrasing used to spread fake news or misinformation without detection
  • 100 passages from Extreme Summarization (XSum) dataset [Narayan et al., 2018] used for evaluations
  • Paraphrasing reduces detector accuracy from 97% to 80% with 3.5 in perplexity score
  • T5-based paraphraser reduces detector accuracy from 97% to 57%

Paraphrasing attacks on non-watermarked ai-generated texts

  • Non-watermarking detectors use LLM-specific signatures to detect AI-generated texts
  • Neural network-based detectors are trained or fine-tuned for binary classification
  • Zero-shot classifiers leverage specific statistical properties of the source LLM outputs
  • Experiments show non-watermarking detectors are vulnerable to paraphrasing attack
  • Attack uses pre-trained GPT-2 Medium7 model and T5-based paraphrasing model
  • True positive rate of OpenAI’s RoBERTa-Large-Detector drops from 100% to around 80%
  • Multiple queries to detector can bring down true positive rate to 60%
  • Perplexity of GPT-2 output text only degrades by 2 with multiple queries to detector

Impossibility results for reliable detection of ai-generated text

  • Detecting misuse of language models is difficult
  • AI-generated text looks increasingly similar to human text
  • Total variation distance between AI and human-generated text diminishes as language models become more sophisticated
  • Even the most effective detector performs only marginally better than a random classifier when dealing with a sufficiently advanced language model
  • Theorem 1 formalizes this statement by showing an upper bound on the area under the ROC curve of an arbitrary detector
  • As the distance between AI and human-generated text diminishes, the AUROC bound approaches 1/2
  • Corollary 1 shows that AI-generated text can be made difficult to detect by passing it through a paraphrasing tool
  • Corollary 2 and 3 indicate that certain people’s writing may be detected falsely as AI-generated or watermarked
  • These results demonstrate fundamental limitations for AI-text detectors, with and without watermarking schemes

Tightness analysis

  • Theorem 1 is tight
  • Constructed an AI-text distribution M and a detector D such that the bound holds with equality
  • Defined sublevel sets of the probability density function of the distribution of human-generated text
  • Probability of a sequence drawn from M falling in Ω H (0) is TV(M, H)
  • Detector D maps each sequence in Ω to the negative of the probability density function of H
  • TPR γ = min(FPR γ + TV(M, H), 1) similar to Equation 3
  • Calculating the AUROC in a similar fashion as in the previous section

Spoofing attacks on ai-text generative models

  • AI text detection should have low type-I and type-II errors
  • Type-I error is when human text is detected as AI-generated
  • Type-II error is when AI-generated text is not detected
  • An attacker can generate non-AI text that is detected as AI-generated (spoofing attack)
  • This can be used to affect the reputation of the target LLM’s developers
  • Proof-of-concept shows soft watermarking detectors can be spoofed
  • Soft watermarked texts are composed of green list tokens
  • Attacker can learn green lists to generate human-written texts that are detected as watermarked
  • Nonsense sentences generated by an adversarial human can be detected as watermarked

Discussion

  • Recent advancements in NLP show that LLMs can generate human-like texts
  • This can create challenges such as misuse for plagiarism, spamming, or social engineering
  • Demand for efficient LLM text detectors to reduce exploitation
  • Recent works propose AI text detectors using watermarking, zero-shot methods, and trained neural network-based classifiers
  • Experiments show that paraphrasing LLM outputs helps evade detectors
  • Theory demonstrates that for a sufficiently advanced language model, even the best detector can only perform marginally better than a random classifier
  • Watermarking-based detectors can be spoofed to make human-composed text detected as watermarked
  • Attackers might choose to break AI detectors in the future using improved paraphrasing models and smart prompting
  • Future research on AI text detectors must be cautious about these vulnerabilities