Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Difficult to select effective training signal for natural language processing tasks
  • Expert annotations expensive, crowd-sourced annotations may not be reliable
  • Recent work in machine learning shows learning from soft-labels acquired from crowd annotations can be effective
  • Proposes new methods for acquiring soft-labels from crowd-annotations by aggregating distributions
  • Results in best or near-best performance and uncertainty estimation across tasks

Paper Content

Introduction

  • Supervised machine learning requires labels as training data
  • Tradeoffs exist when deciding how to collect and use labels
  • Recent work has focused on soft-labeling schemes to improve accuracy and uncertainty estimation
  • Little work has compared how different soft-labeling schemes affect out-of-distribution performance and uncertainty estimation in NLP
  • This paper seeks to fill this gap by providing an in-depth study into soft-labeling techniques
  • It proposes four multiview aggregation methods to generate aggregated soft-labels
  • It also looks at knowledge distillation to build compact but robust models
  • It uses multi-view structure to improve test set performance of the final distilled model

Methods

  • Learning from crowd annotations is a topic with a rich literature.
  • Soft-labeling schemes provide a distribution over possible classes in a dataset.
  • Soft-labeling schemes can help regularize a downstream classifier.
  • Multiple soft-labeling schemes have been demonstrated to provide good training signals on different NLP tasks.
  • This paper provides a systematic analysis of performance and best practices when considering generalization to out of domain data.
  • This paper proposes several methods for aggregating the distributions over class labels.

Soft labeling methods

  • Standard normalization scheme transforms crowd-sourced labels into a probability distribution
  • Softmax function used to obtain soft labels
  • Dawid & Skene model used to aggregate crowd-sourced labels into a single groundtruth label
  • Model designed to explain away inconsistencies of individual annotators

Combining soft labels

  • Acquire labels that capture multiple views of annotations
  • Combine soft-labels from different views
  • Inexpensive, requires no additional annotations
  • Goal is to produce a categorical distribution from a set of distributions
  • Averaging is the most basic model
  • Centroid is the minimizer of the sum of Jensen-Shannon divergences
  • Temperature scaling softens each distribution
  • Hybrid approach combines temperature scaling and centroid

Learning from soft labels

  • Learning from soft labels requires minimizing divergence between classifier probability distribution and label distribution
  • Kullback-Leibler divergence used as loss function between classifier probability distribution and soft labels from Sections 3.1 and 3.2

Experimental setup

  • Focus experiments on 3 research questions related to learning from crowd-sourced labels
  • Experiments focus on out-of-domain setting with distribution shift between training and test data
  • Use RoBERTa as base network with same training hyperparameters
  • Experiments measure impact of crowd-sourced soft targets on model generalization
  • Tasks include recognizing textual entailment, relation extraction, part-of-speech tagging, and toxicity detection
  • Training datasets include Pascal RTE-1, MRE, Gimpel, and Google Jigsaw
  • Test datasets include SNLI, causal claim-strength, Penn Treebank POS, and Civil Comments

Results and discussion

  • Evaluated performance of soft-labeling methods using two metrics: F1 score and calibrated log-likelihood
  • Compared methods using calibrated log-likelihood to obtain fair comparison
  • Showed average performance of individual methods and results using only expert annotations and majority vote

Raw performance

  • RTE and MRE datasets are difficult to generalize from
  • Gold labels yield worse performance than soft labels in out-of-domain setting
  • POS tagging sees best performance with gold labels
  • Toxicity task benefits from both gold and crowd labels
  • Aggregation methods consistently perform better than individual soft-labeling methods
  • Adding gold labels for RTE and MRE tasks leads to worse performance

Uncertainty estimation

  • Uncertainty estimation can be improved with the addition of soft-labels, except for POS tagging.
  • Benefits are more pronounced for RTE and MRE tasks.
  • JSC aggregation method provides most consistent results across tasks.
  • Hybrid method offers good uncertainty estimation, especially in large-data regime.
  • Including gold labels yields better uncertainty estimation when labeled data is abundant.

Research questions

  • Performance of individual soft-labeling techniques is inconsistent in out-of-domain setting.
  • Aggregating soft-labels can improve performance.
  • Aggregating with JSC leads to best or near-best performance.
  • Aggregating across soft-labeling methods using JSC is consistently high performing.
  • Different individual soft-labeling methods are inconsistent in their uncertainty estimation.
  • Aggregating these different views of the crowd-sourced labels mitigates these fluctuations.
  • Jenson-Shannon centroid is a sensible and consistent choice across tasks.

Analysis

  • JSD of aggregation methods compared to individual methods
  • Statistically significant correlation between performance and JSD for RTE dataset
  • JSC aggregation method produces distributions closer to hubs of ensemble
  • JSC aggregation method may ignore divergent views of data

Conclusion

  • Systematic comparison of soft-labeling techniques from crowd-sourced labels
  • Out-of-domain setting allows for generalization to unseen data
  • Four novel methods for aggregating multiple views of crowd-sourced labels
  • JSC yields consistently high raw performance and good uncertainty estimation