Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Common to have multiple annotators label text and obtain ground truth labels based on agreement of major annotators
- Need NLP systems to represent people’s diverse voices on subjective matters and predict level of diversity
- Examines whether text of task and annotators’ demographic background info can be used to estimate level of disagreement
Paper Content
Introduction
- Supervised AI systems are trained on annotated datasets with labels determined by consensus among multiple annotators.
- Annotators often disagree on the final labels due to subjective opinions.
- Majority-based aggregation often fails to learn the true distribution of annotators’ voices in more subjective tasks.
- Annotators’ disagreement can be caused by limited representations of the annotator group or the natural controversy of the text.
- This paper explores the relationship between the annotator group and natural controversy in text.
Related works
- Tasks like toxicity detection, sentiment analysis, and social, ethical labeling are subjective and controversial
- Disagreement between annotators can result in different reliabilities
- Agreement is measured using metrics like Cohen’s Kappa and Fleiss’ Kappa
- Aggregating labels can conceal informative disagreement
- Acceptability of answers is important in subjective tasks
- Annotator demographics can improve annotation quality
- Jury learning can recommend certain people or groups for certain tasks
Methods
- Quantifying subjective annotation disagreements using demographic information of each annotator
- Modeling annotation disagreement with pre-trained language model, e.g. RoBERTa
Preliminaries
- Problem setup is a text classification scenario with K classes
- Dataset D consists of texts X and annotation matrix Y
- Each entry of Y represents an annotation assigned to a text
- N different annotations for each text
- T different demographic information of all N annotators available
- Majority voting assigns label from multiple annotations to maximally agreed label
- Binary disagreement label indicates if there are different opinions
- Continuous disagreement label measures degree of disagreement
- Text with highest disagreement is very controversial
Disagreement prediction with demographic information
- Goal is to predict disagreement of given text
- Annotation could be labeled differently between texts
- Utilize pre-trained language model to train predictor
- Incorporate demographic information of annotators
- Two ways to incorporate demographic information: group and personal
- Two formats to combine demographic information and text: templated and sentence
Simulation of demographic information
- Proposed simulation of demographic information to analyze how different annotator groups impact disagreement prediction
- Simulated demographic information combined with given text and annotations to simulate scenario with different annotators
- Gender and ethnicity demographic types with 4 and 7 options respectively, for a total of 28 combinations
- Predicted disagreement evaluated to distinguish between disagreement from controversy of text or uncertainty from annotators
Benchmark datasets
- SBIC contains 150k structured annotations of social media posts
- SChem101 is a corpus of cultural norms via free-text rules-of-thumb
- SCruples-dilemmas is a resource for normative ranking actions
- Dyna-Sentiment is an English language benchmark task for ternary sentiment analysis
- Wikipedia Politeness is a collection of requests from Wikipedia Talk pages
Experimental details
- Fine-tuned RoBERTabase with Adam optimizer and fixed learning rate
- Compared with different input types and disagreement labeling setups
- RoBERTa performed best
- Evaluated performance with F1 and MSE
- Compared binary disagreement label and continuous disagreement rate
Main results
- Continuous disagreement achieves better prediction than binary disagreement for most datasets.
- Binary label prediction is close to continuous prediction for SBIC and Politeness datasets.
- Binary label is not reliable for SChem and Dilemmas datasets.
- Binary label has an inconsistent performance for Dynasent dataset.
Simulation of everyone’s voices with artificial demographics
- Exploring how to reflect diverse opinions on annotation tasks
- Simulated different combinations of artificial demographic groups
- Motivated by Intersectionality theory
- Used disagreement predictor to predict disagreement of simulated demographic information and text
Conclusion
- Proposed disagreement prediction framework
- Measures annotators’ disagreement in subjective tasks
- Predicts disagreement with/without demographic information
- Simulates 140 artificial annotators to build annotation pool
- Results show disagreement can be predicted from text and better with demographic information
- Evaluation results of predictions with/without demographic information