Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Text detoxification can reduce the harms of toxicity by changing the text to remove offensive meaning.
  • MaRCo is an algorithm that combines controllable generation and text rewriting methods to mask and replace words.
  • MaRCo outperforms baselines on automatic metrics and is preferred 2.1 times more in human evaluation.
  • MaRCo is especially useful for addressing subtle toxicity and online hate.

Paper Content

Introduction

  • Toxic language is increasingly prevalent and can cause harm.
  • NLP systems have difficulty detecting subtle biases.
  • Text detoxification can rewrite text to be less toxic while preserving meaning.
  • MARCO is a new algorithm for text detoxification that combines mask-and-replace text denoising with controllable text generation.
  • MARCO outperforms state-of-the-art detoxification baselines.

Background: text detoxification

  • Text detoxification is a form of stylistic rewriting
  • Goal is to produce a non-toxic rewrite given a toxic input sentence
  • Difficult task as it requires both detoxification and preservation of non-toxic meaning
  • Unsupervised masking-and-reconstructing approaches are often used
  • Framed as a translation or paraphrasing task, using a classifier to steer away from toxic content
  • Method for unsupervised controlled revision is based on denoising autoencoder LMs

Contextual masking

  • Identify locations which could convey toxic meaning
  • Mask token and generate probability distributions over vocabulary
  • Compute distance between probability distributions using Jensen-Shannon divergence
  • Normalize distances by mean
  • Mask tokens whose distance is above threshold

Contextual replacing

  • MARCO replaces potentially toxic locations with more benign tokens
  • KL divergence is used to determine if locations are toxic
  • DEXPERTS framework is transformed to enable rewriting
  • Logits from base, expert and anti-expert AE-LMs are ensembled
  • Hyperparameters are used to control impact of expert and anti-expert
  • Rewrites have minimal but meaningful edits on toxic tokens

Detoxification experiments & results

  • Focused on rewriting sentences from three toxicity datasets
  • Used both automatic and human evaluations to measure performance

Datasets

  • Rewrite English sentences that are known to be or annotated as toxic
  • Focus on sentences with subtle or implicit biases
  • Use three out-of-domain datasets with subtle toxicity
  • Datasets include Microagressions.com, Social Bias Frames, and DynaHate

Baselines

  • MARCO is compared to two baseline approaches from Dale et al. (2021).
  • ParaGeDi uses a class-conditioned language model and paraphrasing language model.
  • CondBERT uses a pointwise editing setup and a lexicon-based approach to masking words.

Evaluation setup

  • Perform automatic and human evaluations
  • Rewrite toxic example from SBF
  • MARCO detects and masks “cotton” as toxicity indicator
  • Assess quality of rewrites with automatic metrics
  • Measure fluency and meaning similarity of rewrites
  • Conduct head-to-head human evaluation of toxicity of rewrites

Results

  • MARCO is better at detoxification than baselines by 10.3% on average
  • MARCO is rated as less toxic than CondBERT 2.2 times more often than vice versa
  • MARCO is on par with CondBERT in terms of meaning preservation
  • MARCO works especially well on subtle toxicity

Conclusion

  • MARCO is a novel method for text detoxification
  • MARCO utilizes auto-encoder language model experts in a mask and reconstruct process
  • MARCO outperforms strong baselines in automatic and human evaluations
  • MARCO demonstrates the effectiveness of controllable generation mixed with text rewriting methods for controllable revision
  • MARCO highlights the usefulness of using LMs for capturing toxicity
  • MARCO has several limitations, ethical considerations, and broader impacts
  • MARCO could be used for malicious purposes
  • MARCO requires finetuning two pretrained LMs
  • MARCO uses the Perspective API to automatically assess toxicity
  • MARCO uses the Jigsaw Unintended Bias in Toxicity Classification dataset
  • MARCO uses the HuggingFace Transformers library
  • MARCO uses the Tumblr API
  • MARCO uses GPT2-base to measure fluency
  • MARCO uses RoBERTa-large to measure meaning preservation
  • MARCO uses annotators from the USA and Canada on Amazon Mechanical Turk
  • MARCO pays a median wage of $8/h