Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Detection algorithms have been proposed to identify AI-generated text
- A 11B parameter paraphrase generation model (DIPPER) was trained to paraphrase paragraphs
- Several detectors were tested and found to be evaded by DIPPER
- A defense was introduced to increase robustness of AI-generated text detection to paraphrase attacks
- Code, model and data will be open sourced for future research
Paper Content
Introduction
- LLMs can write coherent and relevant longform text
- Fears of malicious applications such as fake news and homework answers
- Algorithms proposed to detect machine-generated text
- Unclear how robust these algorithms are to paraphrase attacks
- Demonstrate vulnerability of existing detectors to paraphrase attacks
- Train 11B parameter paraphrase generation model called DIPPER
- DIPPER can paraphrase paragraph-length texts
- DIPPER has two features to help evade AI-generated text detectors
- Attack several recently proposed AI-generated text detection algorithms
- Experiments show all detection algorithms misclassify AI-generated texts
- Propose to use retrieval methods to detect AI-generated text
- 97.3% of PG19 paraphrases and 80.4% of Wikipedia paraphrases detected
- Release DIPPER model, training dataset, and codebase
Watermarking language model outputs
- A watermark is a modification to generated text that can be detected by a computer but not by humans.
- Watermarks are difficult to remove and don’t affect the quality of the text.
- Previous research has attempted to watermark natural language using syntax tree manipulations.
- Kirchenbauer et al. (2023) proposed an algorithm to add watermarks using logits and a hash function.
Statistical outlier detection methods
- Outlier detection algorithms attempt to distinguish between human-written and machine-generated text.
- Early methods detect statistical irregularities in measures such as entropy, perplexity, and n-gram frequencies.
- GLTR, GPTZero, and Detect-GPT are tools developed to assist in detecting machine-generated text.
Classifiers
- Detection methods rely on classifiers to distinguish human-written text from machine-generated text.
- Early efforts used classifiers to detect fake reviews and fake news.
- Studies examine classification performance and decoding strategies.
Comparison to sadasivan et al. (2023)
- Sadasivan et al. (2023) demonstrated the utility of paraphrasing attacks against AI-generated text detectors
- DIPPER has advanced discourse-level rewriting capabilities and fine-grained diversity control
- Experiments encompass more tasks, datasets, and detection algorithms, and evaluate larger language models
- Retrieval-based defense directly contradicts the “impossibility result” of Sadasivan et al. (2023)
Building a controllable discourse paraphraser
- Paraphrasing can be used to fool existing machine-generated text detection techniques.
- Paraphrasing can change the statistical properties of model generated text.
- Paraphrasing must be able to handle context and be controllable.
- Paraphrasing must not change the semantics of the input.
- Paraphrasing must be implemented with a model different from the watermarked model.
Constructing paraphrase data
- Process involves fine-tuning a large language model on a parallel dataset of paragraph-level paraphrases
- Objective is to develop a model capable of paraphrasing any given sequence of sentences
- Leverage PAR3 dataset to train DIPPER
- Align sentences of two translations using semantic similarity
- Choose subset of alignments
- Shuffle sentences, compute control codes
- Map shuffled sentences to original sentences using context and control codes
- Inference time: paraphrase arbitrary sequence of sentences by marking with tags and assigning values to control codes
Model and training details
- Model is a Transformer neural network
- Model is initialized with T5-XXL checkpoint
- Model is fine-tuned on paraphrase generation data
- Maximum of 3 consecutive sentences paraphrased at a time
- Model implemented in JAX using T5X library
- Nucleus sampling used at inference time
- Experiments attacking existing AI-generated text detectors
- Metrics, models, and detectors detailed in Sections 4.1 and 4.2
- Results discussed in Section 4.3
- Paraphrasing easily evades all detectors across 3 language models and 2 tasks
Evaluation metrics
- Used two metrics to evaluate attack success rates after paraphrasing: detection accuracy and semantic similarity
- Detection accuracy measured by AUC-ROC metric with false positive rate set to 1%
- Semantic similarity measured by P-SP model from Wieting et al. (2022)
- Semantics considered to be preserved if P-SP score is higher than 0.76
Models, datasets & detection algorithms
- Conducted attacks on 3 language models of varying sizes
- Sampled generations 300 tokens long
- Experimented with 2 text generation tasks with long-form outputs
- Used 5 AI detection algorithms
- Paraphrased text using L = 20, 40, 60 and O = 0, 60
- Truncated machine-generated, paraphrased, and human-written sequences to the length of the shortest
- Paraphrased long sequences 3 sentences at a time
Attacking ai-generated text detectors
- Paraphrasing significantly lowers detection accuracy across all diversity control codes.
- Paraphrasing reduces watermark detection accuracy from 100% to 57.2%.
- Paraphrasing reduces OpenAI’s text classifier accuracy from 30.0% to 15.6%.
- Non-watermarking detectors are generally ineffective.
Alternative paraphrasing attacks
- Paraphrasing multiple times can improve the effectiveness of a paraphrase attack.
- Non-DIPPER paraphrasers can be used to evade detection, but have lower quality.
- Large language models can be used to perform few-shot contextual paraphrasing, but may be detectable.
- Retrieval over previously-generated sequences can be used as a defense against paraphrase attacks.
- Controlled comparisons and a large retrieval corpus of 15M generations were used to evaluate the method.
Formulating the retrieval defense
- LLM API takes prompt x as input and returns a continuation y
- Database Y is dynamically updated and stored on the API side
- Querying the database to check if y was generated by the API
- Detection score is computed using maximum similarity score
- Two choices for retriever f ret: P-SP and BM25
- Detection accuracy of paraphrases is high at 1% FPR
Controlled comparisons of retrieval with other ai-generated text detectors
- Conducted a controlled comparison between detection algorithms and retrieval method
- Constructed two retrieval corpora for experiment
- Performed retrieval using original AI-generated text, its paraphrase, and human-written text as queries
- Retrieval is a much more effective detector than baseline detectors
- Retrieval has a 100% detection accuracy on unperturbed machine-generated text
- Retrieval with BM25 is quite effective on paraphrased text
- BM25 is a more effective retriever than P-SP
Is retrieval an effective detector with a large retrieval corpus?
- Retrieval-based detectors scale well with larger corpus sizes.
- Detection accuracy remains consistently high across different corpus sizes.
- BM25 outperforms P-SP across different retrieval corpus sizes.
- Detection works best with 50 or more tokens of generated text.
Ideas to make retrieval detection work well at an even larger scale
Limitations of retrieval for detection
- Detection is specific to an API
- Retrieval infrastructure needed for large databases
- False positives due to training data memorization
- Privacy concerns
- Slight reduction in accuracy with large databases
- Tasks with constrained output space or short outputs difficult to detect
Experiments measuring intrinsic paraphrase generation quality
- Experiments focused on attacking AI-generated text detectors with paraphrases and defending against these attacks
- DIPPER used as underlying paraphrase generation model
- Ablation experiments and human evaluations conducted to validate effectiveness of DIPPER
- DIPPER leverages context from outside of text to be paraphrased
- DIPPER can paraphrase multiple sentences at once
- Ablated version of DIPPER (DIPPER-no-ctx) used to compare paraphrasing 3 sentences at a time vs 1 sentence at a time
- Paraphrasing multiple sentences at a time produces higher quality paraphrases
- Contextual paraphrasing leads to higher quality paraphrases
- GPT3.5 and RANKGEN usually prefer multi-sentence paraphrases over single-sentence
- Human evaluation used to evaluate semantic fidelity of paraphrases
- Over 80% of the time, annotators rate DIPPER’s paraphrases as nearly equivalent or approximately equivalent
- DIPPER leverages information from context to increase diversity while maintaining coherence
- Shortcoming of DIPPER is that it can modify unique proper nouns when using high lexical code
Conclusion
- We present DIPPER, a discourse paraphrase generation model that can rewrite multiple sentences of text and use context.
- We use DIPPER to test current AI-generated text detectors and find that DIPPER paraphrases easily evade these detectors.
- We propose a retrieval-based mechanism to defend against such paraphrase attacks.
- Experiments show that this defense significantly outperforms baseline detectors on paraphrased text.
- We survey 25 papers on paraphrase generation from 2018 to 2022 and find that only 3 can paraphrase multiple sentences at once, none can merge or split sentences, none use context, and 14 offer ways to customize diversity.
- DIPPER combines all desiderata into one model and offers intuitive control knobs for lexical and syntactic diversity.
- Automatic and human evaluation show that DIPPER can efficiently leverage context information and reorganize sentences while having high fidelity in meaning.
- We discuss related work on contextual machine translation, text simplification, and machine translation.
- We provide an appendix section with further information on our human evaluation study.