Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Common to have multiple annotators label text and obtain ground truth labels based on agreement of major annotators
Need NLP systems to represent people’s diverse voices on subjective matters and predict level of diversity
Examines whether text of task and annotators’ demographic background info can be used to estimate level of disagreement

Paper Content

Introduction

Supervised AI systems are trained on annotated datasets with labels determined by consensus among multiple annotators.
Annotators often disagree on the final labels due to subjective opinions.
Majority-based aggregation often fails to learn the true distribution of annotators’ voices in more subjective tasks.
Annotators’ disagreement can be caused by limited representations of the annotator group or the natural controversy of the text.
This paper explores the relationship between the annotator group and natural controversy in text.

Tasks like toxicity detection, sentiment analysis, and social, ethical labeling are subjective and controversial
Disagreement between annotators can result in different reliabilities
Agreement is measured using metrics like Cohen’s Kappa and Fleiss’ Kappa
Aggregating labels can conceal informative disagreement
Acceptability of answers is important in subjective tasks
Annotator demographics can improve annotation quality
Jury learning can recommend certain people or groups for certain tasks

Methods

Quantifying subjective annotation disagreements using demographic information of each annotator
Modeling annotation disagreement with pre-trained language model, e.g. RoBERTa

Preliminaries

Problem setup is a text classification scenario with K classes
Dataset D consists of texts X and annotation matrix Y
Each entry of Y represents an annotation assigned to a text
N different annotations for each text
T different demographic information of all N annotators available
Majority voting assigns label from multiple annotations to maximally agreed label
Binary disagreement label indicates if there are different opinions
Continuous disagreement label measures degree of disagreement
Text with highest disagreement is very controversial

Disagreement prediction with demographic information

Goal is to predict disagreement of given text
Annotation could be labeled differently between texts
Utilize pre-trained language model to train predictor
Incorporate demographic information of annotators
Two ways to incorporate demographic information: group and personal
Two formats to combine demographic information and text: templated and sentence

Simulation of demographic information

Proposed simulation of demographic information to analyze how different annotator groups impact disagreement prediction
Simulated demographic information combined with given text and annotations to simulate scenario with different annotators
Gender and ethnicity demographic types with 4 and 7 options respectively, for a total of 28 combinations
Predicted disagreement evaluated to distinguish between disagreement from controversy of text or uncertainty from annotators

Benchmark datasets

SBIC contains 150k structured annotations of social media posts
SChem101 is a corpus of cultural norms via free-text rules-of-thumb
SCruples-dilemmas is a resource for normative ranking actions
Dyna-Sentiment is an English language benchmark task for ternary sentiment analysis
Wikipedia Politeness is a collection of requests from Wikipedia Talk pages

Experimental details

Fine-tuned RoBERTabase with Adam optimizer and fixed learning rate
Compared with different input types and disagreement labeling setups
RoBERTa performed best
Evaluated performance with F1 and MSE
Compared binary disagreement label and continuous disagreement rate

Main results

Continuous disagreement achieves better prediction than binary disagreement for most datasets.
Binary label prediction is close to continuous prediction for SBIC and Politeness datasets.
Binary label is not reliable for SChem and Dilemmas datasets.
Binary label has an inconsistent performance for Dynasent dataset.

Simulation of everyone’s voices with artificial demographics

Exploring how to reflect diverse opinions on annotation tasks
Simulated different combinations of artificial demographic groups
Motivated by Intersectionality theory
Used disagreement predictor to predict disagreement of simulated demographic information and text

Conclusion

Proposed disagreement prediction framework
Measures annotators’ disagreement in subjective tasks
Predicts disagreement with/without demographic information
Simulates 140 artificial annotators to build annotation pool
Results show disagreement can be predicted from text and better with demographic information
Evaluation results of predictions with/without demographic information

Link to paper#

Abstract#

Paper Content#

Introduction#

Related works#

Methods#

Preliminaries#

Disagreement prediction with demographic information#

Simulation of demographic information#

Benchmark datasets#

Experimental details#

Main results#

Simulation of everyone’s voices with artificial demographics#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related works

Methods

Preliminaries

Disagreement prediction with demographic information

Simulation of demographic information

Benchmark datasets

Experimental details

Main results

Simulation of everyone’s voices with artificial demographics

Conclusion