Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Text detoxification can reduce the harms of toxicity by changing the text to remove offensive meaning.
- MaRCo is an algorithm that combines controllable generation and text rewriting methods to mask and replace words.
- MaRCo outperforms baselines on automatic metrics and is preferred 2.1 times more in human evaluation.
- MaRCo is especially useful for addressing subtle toxicity and online hate.
Paper Content
Introduction
- Toxic language is increasingly prevalent and can cause harm.
- NLP systems have difficulty detecting subtle biases.
- Text detoxification can rewrite text to be less toxic while preserving meaning.
- MARCO is a new algorithm for text detoxification that combines mask-and-replace text denoising with controllable text generation.
- MARCO outperforms state-of-the-art detoxification baselines.
Background: text detoxification
- Text detoxification is a form of stylistic rewriting
- Goal is to produce a non-toxic rewrite given a toxic input sentence
- Difficult task as it requires both detoxification and preservation of non-toxic meaning
- Unsupervised masking-and-reconstructing approaches are often used
- Framed as a translation or paraphrasing task, using a classifier to steer away from toxic content
- Method for unsupervised controlled revision is based on denoising autoencoder LMs
Contextual masking
- Identify locations which could convey toxic meaning
- Mask token and generate probability distributions over vocabulary
- Compute distance between probability distributions using Jensen-Shannon divergence
- Normalize distances by mean
- Mask tokens whose distance is above threshold
Contextual replacing
- MARCO replaces potentially toxic locations with more benign tokens
- KL divergence is used to determine if locations are toxic
- DEXPERTS framework is transformed to enable rewriting
- Logits from base, expert and anti-expert AE-LMs are ensembled
- Hyperparameters are used to control impact of expert and anti-expert
- Rewrites have minimal but meaningful edits on toxic tokens
Detoxification experiments & results
- Focused on rewriting sentences from three toxicity datasets
- Used both automatic and human evaluations to measure performance
Datasets
- Rewrite English sentences that are known to be or annotated as toxic
- Focus on sentences with subtle or implicit biases
- Use three out-of-domain datasets with subtle toxicity
- Datasets include Microagressions.com, Social Bias Frames, and DynaHate
Baselines
- MARCO is compared to two baseline approaches from Dale et al. (2021).
- ParaGeDi uses a class-conditioned language model and paraphrasing language model.
- CondBERT uses a pointwise editing setup and a lexicon-based approach to masking words.
Evaluation setup
- Perform automatic and human evaluations
- Rewrite toxic example from SBF
- MARCO detects and masks “cotton” as toxicity indicator
- Assess quality of rewrites with automatic metrics
- Measure fluency and meaning similarity of rewrites
- Conduct head-to-head human evaluation of toxicity of rewrites
Results
- MARCO is better at detoxification than baselines by 10.3% on average
- MARCO is rated as less toxic than CondBERT 2.2 times more often than vice versa
- MARCO is on par with CondBERT in terms of meaning preservation
- MARCO works especially well on subtle toxicity
Conclusion
- MARCO is a novel method for text detoxification
- MARCO utilizes auto-encoder language model experts in a mask and reconstruct process
- MARCO outperforms strong baselines in automatic and human evaluations
- MARCO demonstrates the effectiveness of controllable generation mixed with text rewriting methods for controllable revision
- MARCO highlights the usefulness of using LMs for capturing toxicity
- MARCO has several limitations, ethical considerations, and broader impacts
- MARCO could be used for malicious purposes
- MARCO requires finetuning two pretrained LMs
- MARCO uses the Perspective API to automatically assess toxicity
- MARCO uses the Jigsaw Unintended Bias in Toxicity Classification dataset
- MARCO uses the HuggingFace Transformers library
- MARCO uses the Tumblr API
- MARCO uses GPT2-base to measure fluency
- MARCO uses RoBERTa-large to measure meaning preservation
- MARCO uses annotators from the USA and Canada on Amazon Mechanical Turk
- MARCO pays a median wage of $8/h