Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Difficult to select effective training signal for natural language processing tasks
Expert annotations expensive, crowd-sourced annotations may not be reliable
Recent work in machine learning shows learning from soft-labels acquired from crowd annotations can be effective
Proposes new methods for acquiring soft-labels from crowd-annotations by aggregating distributions
Results in best or near-best performance and uncertainty estimation across tasks

Paper Content

Introduction

Supervised machine learning requires labels as training data
Tradeoffs exist when deciding how to collect and use labels
Recent work has focused on soft-labeling schemes to improve accuracy and uncertainty estimation
Little work has compared how different soft-labeling schemes affect out-of-distribution performance and uncertainty estimation in NLP
This paper seeks to fill this gap by providing an in-depth study into soft-labeling techniques
It proposes four multiview aggregation methods to generate aggregated soft-labels
It also looks at knowledge distillation to build compact but robust models
It uses multi-view structure to improve test set performance of the final distilled model

Methods

Learning from crowd annotations is a topic with a rich literature.
Soft-labeling schemes provide a distribution over possible classes in a dataset.
Soft-labeling schemes can help regularize a downstream classifier.
Multiple soft-labeling schemes have been demonstrated to provide good training signals on different NLP tasks.
This paper provides a systematic analysis of performance and best practices when considering generalization to out of domain data.
This paper proposes several methods for aggregating the distributions over class labels.

Soft labeling methods

Standard normalization scheme transforms crowd-sourced labels into a probability distribution
Softmax function used to obtain soft labels
Dawid & Skene model used to aggregate crowd-sourced labels into a single groundtruth label
Model designed to explain away inconsistencies of individual annotators

Combining soft labels

Acquire labels that capture multiple views of annotations
Combine soft-labels from different views
Inexpensive, requires no additional annotations
Goal is to produce a categorical distribution from a set of distributions
Averaging is the most basic model
Centroid is the minimizer of the sum of Jensen-Shannon divergences
Temperature scaling softens each distribution
Hybrid approach combines temperature scaling and centroid

Learning from soft labels

Learning from soft labels requires minimizing divergence between classifier probability distribution and label distribution
Kullback-Leibler divergence used as loss function between classifier probability distribution and soft labels from Sections 3.1 and 3.2

Experimental setup

Focus experiments on 3 research questions related to learning from crowd-sourced labels
Experiments focus on out-of-domain setting with distribution shift between training and test data
Use RoBERTa as base network with same training hyperparameters
Experiments measure impact of crowd-sourced soft targets on model generalization
Tasks include recognizing textual entailment, relation extraction, part-of-speech tagging, and toxicity detection
Training datasets include Pascal RTE-1, MRE, Gimpel, and Google Jigsaw
Test datasets include SNLI, causal claim-strength, Penn Treebank POS, and Civil Comments

Results and discussion

Evaluated performance of soft-labeling methods using two metrics: F1 score and calibrated log-likelihood
Compared methods using calibrated log-likelihood to obtain fair comparison
Showed average performance of individual methods and results using only expert annotations and majority vote

Raw performance

RTE and MRE datasets are difficult to generalize from
Gold labels yield worse performance than soft labels in out-of-domain setting
POS tagging sees best performance with gold labels
Toxicity task benefits from both gold and crowd labels
Aggregation methods consistently perform better than individual soft-labeling methods
Adding gold labels for RTE and MRE tasks leads to worse performance

Uncertainty estimation

Uncertainty estimation can be improved with the addition of soft-labels, except for POS tagging.
Benefits are more pronounced for RTE and MRE tasks.
JSC aggregation method provides most consistent results across tasks.
Hybrid method offers good uncertainty estimation, especially in large-data regime.
Including gold labels yields better uncertainty estimation when labeled data is abundant.

Research questions

Performance of individual soft-labeling techniques is inconsistent in out-of-domain setting.
Aggregating soft-labels can improve performance.
Aggregating with JSC leads to best or near-best performance.
Aggregating across soft-labeling methods using JSC is consistently high performing.
Different individual soft-labeling methods are inconsistent in their uncertainty estimation.
Aggregating these different views of the crowd-sourced labels mitigates these fluctuations.
Jenson-Shannon centroid is a sensible and consistent choice across tasks.

Analysis

JSD of aggregation methods compared to individual methods
Statistically significant correlation between performance and JSD for RTE dataset
JSC aggregation method produces distributions closer to hubs of ensemble
JSC aggregation method may ignore divergent views of data

Conclusion

Systematic comparison of soft-labeling techniques from crowd-sourced labels
Out-of-domain setting allows for generalization to unseen data
Four novel methods for aggregating multiple views of crowd-sourced labels
JSC yields consistently high raw performance and good uncertainty estimation

Link to paper#

Abstract#

Paper Content#

Introduction#

Methods#

Soft labeling methods#

Combining soft labels#

Learning from soft labels#

Experimental setup#

Results and discussion#

Raw performance#

Uncertainty estimation#

Research questions#

Analysis#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Methods

Soft labeling methods

Combining soft labels

Learning from soft labels

Experimental setup

Results and discussion

Raw performance

Uncertainty estimation

Research questions

Analysis

Conclusion