Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Detection algorithms have been proposed to identify AI-generated text
A 11B parameter paraphrase generation model (DIPPER) was trained to paraphrase paragraphs
Several detectors were tested and found to be evaded by DIPPER
A defense was introduced to increase robustness of AI-generated text detection to paraphrase attacks
Code, model and data will be open sourced for future research

Paper Content

Introduction

LLMs can write coherent and relevant longform text
Fears of malicious applications such as fake news and homework answers
Algorithms proposed to detect machine-generated text
Unclear how robust these algorithms are to paraphrase attacks
Demonstrate vulnerability of existing detectors to paraphrase attacks
Train 11B parameter paraphrase generation model called DIPPER
DIPPER can paraphrase paragraph-length texts
DIPPER has two features to help evade AI-generated text detectors
Attack several recently proposed AI-generated text detection algorithms
Experiments show all detection algorithms misclassify AI-generated texts
Propose to use retrieval methods to detect AI-generated text
97.3% of PG19 paraphrases and 80.4% of Wikipedia paraphrases detected
Release DIPPER model, training dataset, and codebase

Watermarking language model outputs

A watermark is a modification to generated text that can be detected by a computer but not by humans.
Watermarks are difficult to remove and don’t affect the quality of the text.
Previous research has attempted to watermark natural language using syntax tree manipulations.
Kirchenbauer et al. (2023) proposed an algorithm to add watermarks using logits and a hash function.

Statistical outlier detection methods

Outlier detection algorithms attempt to distinguish between human-written and machine-generated text.
Early methods detect statistical irregularities in measures such as entropy, perplexity, and n-gram frequencies.
GLTR, GPTZero, and Detect-GPT are tools developed to assist in detecting machine-generated text.

Classifiers

Detection methods rely on classifiers to distinguish human-written text from machine-generated text.
Early efforts used classifiers to detect fake reviews and fake news.
Studies examine classification performance and decoding strategies.

Comparison to sadasivan et al. (2023)

Sadasivan et al. (2023) demonstrated the utility of paraphrasing attacks against AI-generated text detectors
DIPPER has advanced discourse-level rewriting capabilities and fine-grained diversity control
Experiments encompass more tasks, datasets, and detection algorithms, and evaluate larger language models
Retrieval-based defense directly contradicts the “impossibility result” of Sadasivan et al. (2023)

Building a controllable discourse paraphraser

Paraphrasing can be used to fool existing machine-generated text detection techniques.
Paraphrasing can change the statistical properties of model generated text.
Paraphrasing must be able to handle context and be controllable.
Paraphrasing must not change the semantics of the input.
Paraphrasing must be implemented with a model different from the watermarked model.

Constructing paraphrase data

Process involves fine-tuning a large language model on a parallel dataset of paragraph-level paraphrases
Objective is to develop a model capable of paraphrasing any given sequence of sentences
Leverage PAR3 dataset to train DIPPER
Align sentences of two translations using semantic similarity
Choose subset of alignments
Shuffle sentences, compute control codes
Map shuffled sentences to original sentences using context and control codes
Inference time: paraphrase arbitrary sequence of sentences by marking with tags and assigning values to control codes

Model and training details

Model is a Transformer neural network
Model is initialized with T5-XXL checkpoint
Model is fine-tuned on paraphrase generation data
Maximum of 3 consecutive sentences paraphrased at a time
Model implemented in JAX using T5X library
Nucleus sampling used at inference time
Experiments attacking existing AI-generated text detectors
Metrics, models, and detectors detailed in Sections 4.1 and 4.2
Results discussed in Section 4.3
Paraphrasing easily evades all detectors across 3 language models and 2 tasks

Evaluation metrics

Used two metrics to evaluate attack success rates after paraphrasing: detection accuracy and semantic similarity
Detection accuracy measured by AUC-ROC metric with false positive rate set to 1%
Semantic similarity measured by P-SP model from Wieting et al. (2022)
Semantics considered to be preserved if P-SP score is higher than 0.76

Models, datasets & detection algorithms

Conducted attacks on 3 language models of varying sizes
Sampled generations 300 tokens long
Experimented with 2 text generation tasks with long-form outputs
Used 5 AI detection algorithms
Paraphrased text using L = 20, 40, 60 and O = 0, 60
Truncated machine-generated, paraphrased, and human-written sequences to the length of the shortest
Paraphrased long sequences 3 sentences at a time

Attacking ai-generated text detectors

Paraphrasing significantly lowers detection accuracy across all diversity control codes.
Paraphrasing reduces watermark detection accuracy from 100% to 57.2%.
Paraphrasing reduces OpenAI’s text classifier accuracy from 30.0% to 15.6%.
Non-watermarking detectors are generally ineffective.

Alternative paraphrasing attacks

Paraphrasing multiple times can improve the effectiveness of a paraphrase attack.
Non-DIPPER paraphrasers can be used to evade detection, but have lower quality.
Large language models can be used to perform few-shot contextual paraphrasing, but may be detectable.
Retrieval over previously-generated sequences can be used as a defense against paraphrase attacks.
Controlled comparisons and a large retrieval corpus of 15M generations were used to evaluate the method.

Formulating the retrieval defense

LLM API takes prompt x as input and returns a continuation y
Database Y is dynamically updated and stored on the API side
Querying the database to check if y was generated by the API
Detection score is computed using maximum similarity score
Two choices for retriever f ret: P-SP and BM25
Detection accuracy of paraphrases is high at 1% FPR

Controlled comparisons of retrieval with other ai-generated text detectors

Conducted a controlled comparison between detection algorithms and retrieval method
Constructed two retrieval corpora for experiment
Performed retrieval using original AI-generated text, its paraphrase, and human-written text as queries
Retrieval is a much more effective detector than baseline detectors
Retrieval has a 100% detection accuracy on unperturbed machine-generated text
Retrieval with BM25 is quite effective on paraphrased text
BM25 is a more effective retriever than P-SP

Is retrieval an effective detector with a large retrieval corpus?

Retrieval-based detectors scale well with larger corpus sizes.
Detection accuracy remains consistently high across different corpus sizes.
BM25 outperforms P-SP across different retrieval corpus sizes.
Detection works best with 50 or more tokens of generated text.

Ideas to make retrieval detection work well at an even larger scale

Limitations of retrieval for detection

Detection is specific to an API
Retrieval infrastructure needed for large databases
False positives due to training data memorization
Privacy concerns
Slight reduction in accuracy with large databases
Tasks with constrained output space or short outputs difficult to detect

Experiments measuring intrinsic paraphrase generation quality

Experiments focused on attacking AI-generated text detectors with paraphrases and defending against these attacks
DIPPER used as underlying paraphrase generation model
Ablation experiments and human evaluations conducted to validate effectiveness of DIPPER
DIPPER leverages context from outside of text to be paraphrased
DIPPER can paraphrase multiple sentences at once
Ablated version of DIPPER (DIPPER-no-ctx) used to compare paraphrasing 3 sentences at a time vs 1 sentence at a time
Paraphrasing multiple sentences at a time produces higher quality paraphrases
Contextual paraphrasing leads to higher quality paraphrases
GPT3.5 and RANKGEN usually prefer multi-sentence paraphrases over single-sentence
Human evaluation used to evaluate semantic fidelity of paraphrases
Over 80% of the time, annotators rate DIPPER’s paraphrases as nearly equivalent or approximately equivalent
DIPPER leverages information from context to increase diversity while maintaining coherence
Shortcoming of DIPPER is that it can modify unique proper nouns when using high lexical code

Conclusion

We present DIPPER, a discourse paraphrase generation model that can rewrite multiple sentences of text and use context.
We use DIPPER to test current AI-generated text detectors and find that DIPPER paraphrases easily evade these detectors.
We propose a retrieval-based mechanism to defend against such paraphrase attacks.
Experiments show that this defense significantly outperforms baseline detectors on paraphrased text.
We survey 25 papers on paraphrase generation from 2018 to 2022 and find that only 3 can paraphrase multiple sentences at once, none can merge or split sentences, none use context, and 14 offer ways to customize diversity.
DIPPER combines all desiderata into one model and offers intuitive control knobs for lexical and syntactic diversity.
Automatic and human evaluation show that DIPPER can efficiently leverage context information and reorganize sentences while having high fidelity in meaning.
We discuss related work on contextual machine translation, text simplification, and machine translation.
We provide an appendix section with further information on our human evaluation study.

Link to paper#

Abstract#

Paper Content#

Introduction#

Watermarking language model outputs#

Statistical outlier detection methods#

Classifiers#

Comparison to sadasivan et al. (2023)#

Building a controllable discourse paraphraser#

Constructing paraphrase data#

Model and training details#

Evaluation metrics#

Models, datasets & detection algorithms#

Attacking ai-generated text detectors#

Alternative paraphrasing attacks#

Formulating the retrieval defense#

Controlled comparisons of retrieval with other ai-generated text detectors#

Is retrieval an effective detector with a large retrieval corpus?#

Ideas to make retrieval detection work well at an even larger scale#

Limitations of retrieval for detection#

Experiments measuring intrinsic paraphrase generation quality#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Watermarking language model outputs

Statistical outlier detection methods

Classifiers

Comparison to sadasivan et al. (2023)

Building a controllable discourse paraphraser

Constructing paraphrase data

Model and training details

Evaluation metrics

Models, datasets & detection algorithms

Attacking ai-generated text detectors

Alternative paraphrasing attacks

Formulating the retrieval defense

Controlled comparisons of retrieval with other ai-generated text detectors

Is retrieval an effective detector with a large retrieval corpus?

Ideas to make retrieval detection work well at an even larger scale

Limitations of retrieval for detection

Experiments measuring intrinsic paraphrase generation quality

Conclusion