Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Detection algorithms have been proposed to identify AI-generated text
  • A 11B parameter paraphrase generation model (DIPPER) was trained to paraphrase paragraphs
  • Several detectors were tested and found to be evaded by DIPPER
  • A defense was introduced to increase robustness of AI-generated text detection to paraphrase attacks
  • Code, model and data will be open sourced for future research

Paper Content

Introduction

  • LLMs can write coherent and relevant longform text
  • Fears of malicious applications such as fake news and homework answers
  • Algorithms proposed to detect machine-generated text
  • Unclear how robust these algorithms are to paraphrase attacks
  • Demonstrate vulnerability of existing detectors to paraphrase attacks
  • Train 11B parameter paraphrase generation model called DIPPER
  • DIPPER can paraphrase paragraph-length texts
  • DIPPER has two features to help evade AI-generated text detectors
  • Attack several recently proposed AI-generated text detection algorithms
  • Experiments show all detection algorithms misclassify AI-generated texts
  • Propose to use retrieval methods to detect AI-generated text
  • 97.3% of PG19 paraphrases and 80.4% of Wikipedia paraphrases detected
  • Release DIPPER model, training dataset, and codebase

Watermarking language model outputs

  • A watermark is a modification to generated text that can be detected by a computer but not by humans.
  • Watermarks are difficult to remove and don’t affect the quality of the text.
  • Previous research has attempted to watermark natural language using syntax tree manipulations.
  • Kirchenbauer et al. (2023) proposed an algorithm to add watermarks using logits and a hash function.

Statistical outlier detection methods

  • Outlier detection algorithms attempt to distinguish between human-written and machine-generated text.
  • Early methods detect statistical irregularities in measures such as entropy, perplexity, and n-gram frequencies.
  • GLTR, GPTZero, and Detect-GPT are tools developed to assist in detecting machine-generated text.

Classifiers

  • Detection methods rely on classifiers to distinguish human-written text from machine-generated text.
  • Early efforts used classifiers to detect fake reviews and fake news.
  • Studies examine classification performance and decoding strategies.

Comparison to sadasivan et al. (2023)

  • Sadasivan et al. (2023) demonstrated the utility of paraphrasing attacks against AI-generated text detectors
  • DIPPER has advanced discourse-level rewriting capabilities and fine-grained diversity control
  • Experiments encompass more tasks, datasets, and detection algorithms, and evaluate larger language models
  • Retrieval-based defense directly contradicts the “impossibility result” of Sadasivan et al. (2023)

Building a controllable discourse paraphraser

  • Paraphrasing can be used to fool existing machine-generated text detection techniques.
  • Paraphrasing can change the statistical properties of model generated text.
  • Paraphrasing must be able to handle context and be controllable.
  • Paraphrasing must not change the semantics of the input.
  • Paraphrasing must be implemented with a model different from the watermarked model.

Constructing paraphrase data

  • Process involves fine-tuning a large language model on a parallel dataset of paragraph-level paraphrases
  • Objective is to develop a model capable of paraphrasing any given sequence of sentences
  • Leverage PAR3 dataset to train DIPPER
  • Align sentences of two translations using semantic similarity
  • Choose subset of alignments
  • Shuffle sentences, compute control codes
  • Map shuffled sentences to original sentences using context and control codes
  • Inference time: paraphrase arbitrary sequence of sentences by marking with tags and assigning values to control codes

Model and training details

  • Model is a Transformer neural network
  • Model is initialized with T5-XXL checkpoint
  • Model is fine-tuned on paraphrase generation data
  • Maximum of 3 consecutive sentences paraphrased at a time
  • Model implemented in JAX using T5X library
  • Nucleus sampling used at inference time
  • Experiments attacking existing AI-generated text detectors
  • Metrics, models, and detectors detailed in Sections 4.1 and 4.2
  • Results discussed in Section 4.3
  • Paraphrasing easily evades all detectors across 3 language models and 2 tasks

Evaluation metrics

  • Used two metrics to evaluate attack success rates after paraphrasing: detection accuracy and semantic similarity
  • Detection accuracy measured by AUC-ROC metric with false positive rate set to 1%
  • Semantic similarity measured by P-SP model from Wieting et al. (2022)
  • Semantics considered to be preserved if P-SP score is higher than 0.76

Models, datasets & detection algorithms

  • Conducted attacks on 3 language models of varying sizes
  • Sampled generations 300 tokens long
  • Experimented with 2 text generation tasks with long-form outputs
  • Used 5 AI detection algorithms
  • Paraphrased text using L = 20, 40, 60 and O = 0, 60
  • Truncated machine-generated, paraphrased, and human-written sequences to the length of the shortest
  • Paraphrased long sequences 3 sentences at a time

Attacking ai-generated text detectors

  • Paraphrasing significantly lowers detection accuracy across all diversity control codes.
  • Paraphrasing reduces watermark detection accuracy from 100% to 57.2%.
  • Paraphrasing reduces OpenAI’s text classifier accuracy from 30.0% to 15.6%.
  • Non-watermarking detectors are generally ineffective.

Alternative paraphrasing attacks

  • Paraphrasing multiple times can improve the effectiveness of a paraphrase attack.
  • Non-DIPPER paraphrasers can be used to evade detection, but have lower quality.
  • Large language models can be used to perform few-shot contextual paraphrasing, but may be detectable.
  • Retrieval over previously-generated sequences can be used as a defense against paraphrase attacks.
  • Controlled comparisons and a large retrieval corpus of 15M generations were used to evaluate the method.

Formulating the retrieval defense

  • LLM API takes prompt x as input and returns a continuation y
  • Database Y is dynamically updated and stored on the API side
  • Querying the database to check if y was generated by the API
  • Detection score is computed using maximum similarity score
  • Two choices for retriever f ret: P-SP and BM25
  • Detection accuracy of paraphrases is high at 1% FPR

Controlled comparisons of retrieval with other ai-generated text detectors

  • Conducted a controlled comparison between detection algorithms and retrieval method
  • Constructed two retrieval corpora for experiment
  • Performed retrieval using original AI-generated text, its paraphrase, and human-written text as queries
  • Retrieval is a much more effective detector than baseline detectors
  • Retrieval has a 100% detection accuracy on unperturbed machine-generated text
  • Retrieval with BM25 is quite effective on paraphrased text
  • BM25 is a more effective retriever than P-SP

Is retrieval an effective detector with a large retrieval corpus?

  • Retrieval-based detectors scale well with larger corpus sizes.
  • Detection accuracy remains consistently high across different corpus sizes.
  • BM25 outperforms P-SP across different retrieval corpus sizes.
  • Detection works best with 50 or more tokens of generated text.

Ideas to make retrieval detection work well at an even larger scale

Limitations of retrieval for detection

  • Detection is specific to an API
  • Retrieval infrastructure needed for large databases
  • False positives due to training data memorization
  • Privacy concerns
  • Slight reduction in accuracy with large databases
  • Tasks with constrained output space or short outputs difficult to detect

Experiments measuring intrinsic paraphrase generation quality

  • Experiments focused on attacking AI-generated text detectors with paraphrases and defending against these attacks
  • DIPPER used as underlying paraphrase generation model
  • Ablation experiments and human evaluations conducted to validate effectiveness of DIPPER
  • DIPPER leverages context from outside of text to be paraphrased
  • DIPPER can paraphrase multiple sentences at once
  • Ablated version of DIPPER (DIPPER-no-ctx) used to compare paraphrasing 3 sentences at a time vs 1 sentence at a time
  • Paraphrasing multiple sentences at a time produces higher quality paraphrases
  • Contextual paraphrasing leads to higher quality paraphrases
  • GPT3.5 and RANKGEN usually prefer multi-sentence paraphrases over single-sentence
  • Human evaluation used to evaluate semantic fidelity of paraphrases
  • Over 80% of the time, annotators rate DIPPER’s paraphrases as nearly equivalent or approximately equivalent
  • DIPPER leverages information from context to increase diversity while maintaining coherence
  • Shortcoming of DIPPER is that it can modify unique proper nouns when using high lexical code

Conclusion

  • We present DIPPER, a discourse paraphrase generation model that can rewrite multiple sentences of text and use context.
  • We use DIPPER to test current AI-generated text detectors and find that DIPPER paraphrases easily evade these detectors.
  • We propose a retrieval-based mechanism to defend against such paraphrase attacks.
  • Experiments show that this defense significantly outperforms baseline detectors on paraphrased text.
  • We survey 25 papers on paraphrase generation from 2018 to 2022 and find that only 3 can paraphrase multiple sentences at once, none can merge or split sentences, none use context, and 14 offer ways to customize diversity.
  • DIPPER combines all desiderata into one model and offers intuitive control knobs for lexical and syntactic diversity.
  • Automatic and human evaluation show that DIPPER can efficiently leverage context information and reorganize sentences while having high fidelity in meaning.
  • We discuss related work on contextual machine translation, text simplification, and machine translation.
  • We provide an appendix section with further information on our human evaluation study.