Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Text detoxification can reduce the harms of toxicity by changing the text to remove offensive meaning.
MaRCo is an algorithm that combines controllable generation and text rewriting methods to mask and replace words.
MaRCo outperforms baselines on automatic metrics and is preferred 2.1 times more in human evaluation.
MaRCo is especially useful for addressing subtle toxicity and online hate.

Paper Content

Introduction

Toxic language is increasingly prevalent and can cause harm.
NLP systems have difficulty detecting subtle biases.
Text detoxification can rewrite text to be less toxic while preserving meaning.
MARCO is a new algorithm for text detoxification that combines mask-and-replace text denoising with controllable text generation.
MARCO outperforms state-of-the-art detoxification baselines.

Background: text detoxification

Text detoxification is a form of stylistic rewriting
Goal is to produce a non-toxic rewrite given a toxic input sentence
Difficult task as it requires both detoxification and preservation of non-toxic meaning
Unsupervised masking-and-reconstructing approaches are often used
Framed as a translation or paraphrasing task, using a classifier to steer away from toxic content
Method for unsupervised controlled revision is based on denoising autoencoder LMs

Contextual masking

Identify locations which could convey toxic meaning
Mask token and generate probability distributions over vocabulary
Compute distance between probability distributions using Jensen-Shannon divergence
Normalize distances by mean
Mask tokens whose distance is above threshold

Contextual replacing

MARCO replaces potentially toxic locations with more benign tokens
KL divergence is used to determine if locations are toxic
DEXPERTS framework is transformed to enable rewriting
Logits from base, expert and anti-expert AE-LMs are ensembled
Hyperparameters are used to control impact of expert and anti-expert
Rewrites have minimal but meaningful edits on toxic tokens

Detoxification experiments & results

Focused on rewriting sentences from three toxicity datasets
Used both automatic and human evaluations to measure performance

Datasets

Rewrite English sentences that are known to be or annotated as toxic
Focus on sentences with subtle or implicit biases
Use three out-of-domain datasets with subtle toxicity
Datasets include Microagressions.com, Social Bias Frames, and DynaHate

Baselines

MARCO is compared to two baseline approaches from Dale et al. (2021).
ParaGeDi uses a class-conditioned language model and paraphrasing language model.
CondBERT uses a pointwise editing setup and a lexicon-based approach to masking words.

Evaluation setup

Perform automatic and human evaluations
Rewrite toxic example from SBF
MARCO detects and masks “cotton” as toxicity indicator
Assess quality of rewrites with automatic metrics
Measure fluency and meaning similarity of rewrites
Conduct head-to-head human evaluation of toxicity of rewrites

Results

MARCO is better at detoxification than baselines by 10.3% on average
MARCO is rated as less toxic than CondBERT 2.2 times more often than vice versa
MARCO is on par with CondBERT in terms of meaning preservation
MARCO works especially well on subtle toxicity

Conclusion

MARCO is a novel method for text detoxification
MARCO utilizes auto-encoder language model experts in a mask and reconstruct process
MARCO outperforms strong baselines in automatic and human evaluations
MARCO demonstrates the effectiveness of controllable generation mixed with text rewriting methods for controllable revision
MARCO highlights the usefulness of using LMs for capturing toxicity
MARCO has several limitations, ethical considerations, and broader impacts
MARCO could be used for malicious purposes
MARCO requires finetuning two pretrained LMs
MARCO uses the Perspective API to automatically assess toxicity
MARCO uses the Jigsaw Unintended Bias in Toxicity Classification dataset
MARCO uses the HuggingFace Transformers library
MARCO uses the Tumblr API
MARCO uses GPT2-base to measure fluency
MARCO uses RoBERTa-large to measure meaning preservation
MARCO uses annotators from the USA and Canada on Amazon Mechanical Turk
MARCO pays a median wage of $8/h

Link to paper#

Abstract#

Paper Content#

Introduction#

Background: text detoxification#

Contextual masking#

Contextual replacing#

Detoxification experiments & results#

Datasets#

Baselines#

Evaluation setup#

Results#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background: text detoxification

Contextual masking

Contextual replacing

Detoxification experiments & results

Datasets

Baselines

Evaluation setup

Results

Conclusion