Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Labels for real-world data are often given by multiple annotators.
  • CROWDLAB is an approach to estimate a consensus label, a confidence score, and a rating for each annotator.
  • CROWDLAB is based on simple weighted ensembling and utilizes a classifier model trained on the features of the examples.
  • CROWDLAB provides superior estimates than alternative algorithms in evaluations on real-world data.

Paper Content

Introduction

  • Training data for multiclass classification is often labeled by multiple annotators
  • Each example in the dataset is labeled by at least one annotator
  • Labels are aggregated into a single consensus label
  • CROWD-LAB is a method that leverages a trained classifier to establish accurate consensus labels, estimate their quality, and estimate the quality of each annotator
  • CROWD-LAB is easy to implement, computationally efficient, and flexible
  • Real-world multi-annotator datasets have a large disparity in annotator quality and many examples whose consensus label will be incorrect with majority vote
  • Leveraging a trained classifier can help estimate label quality for examples with fewer annotations
  • CROWD-LAB accounts for the number of annotations an example has received, the quality of the annotators, and the accuracy and confidence of the classifier predictions
  • Notation includes examples, annotators, classes, consensus labels, consensus quality scores, annotator quality scores, and label quality scores

Methods

  • Classifier model M is used to predict labels based on feature values
  • CROWD-LAB can be used with any type of classifier
  • Cross-validation is used to avoid overfit predictions
  • Consensus labels are used to train M
  • Performance of methods leveraging M will benefit from improved classifier accuracy
  • CROWDLAB aims to account for classifier prediction shortcomings

Consensus quality scoring methods

  • Estimate confidence of given consensus label for each example
  • Choose consensus label with highest consensus quality score
  • Fraction of annotators who agree with consensus label
  • Label Quality Score estimated by trained classifier model
  • CROWDLAB uses weighted ensemble to modify prediction output
  • Likelihood parameter P set as average annotator agreement
  • Annotator predicted probability vector used in CROWDLAB
  • Dawid-Skene specifies generative model of dataset annotations
  • GLAD specifies more complex generative model of dataset annotations
  • Dawid-Skene with Model uses classifier to produce class predictions
  • GLAD with Model adds model’s predicted labels as additional annotator
  • Empirical Bayes uses classifier-derived prior distribution and likelihoods
  • Active Label Cleaning subtracts cross-entropy between classifier predicted probabilities and individual annotations

Annotator quality scoring methods

  • Estimate consensus labels and their quality
  • Rank which annotators provide the best/worst labels
  • Agreement-based scores rate annotators based on accuracy of labels
  • Label Quality Score uses classifier predictions to rate annotator quality
  • CROWDLAB takes into account label quality and agreement with consensus
  • Dawid-Skene scores annotators based on probability of agreement with true label
  • GLAD estimates expertise of each annotator
  • CROWDLAB estimates single likelihood parameter and per annotator statistic

Experiments

  • Evaluate various methods using real-world multi-annotator data with label errors
  • Three benchmarks based on CIFAR-10H data
  • Ground truth labels from original CIFAR-10 dataset
  • Evaluate methods using ResNet-18 and Swin Transformer models
  • Metrics to measure estimation tasks: accuracy, precision/recall, Spearman correlation
  • AUROC, AUPRC, Lift at various cutoffs to evaluate consensus quality scores

Results

  • CROWDLAB performs best across evaluations for consensus and annotator quality scores
  • All evaluation metrics improve when used with Swin Transformer vs. ResNet-18 model
  • Label Quality Score estimates consensus quality when classifier is accurate
  • Label Quality Score performs worse with lower accuracy classifier
  • CROWDLAB outperforms other methods regardless of classifier accuracy
  • CROWDLAB retains strong performance on datasets with varying numbers of annotations

Discussion

  • CROWDLAB considers a model’s estimated confidence and accuracy relative to individual annotators.
  • CROWDLAB is compatible with any classifier and training strategy.
  • CROWDLAB has a limitation in settings where every example is labeled by a large number of annotators.

A experiment details

  • Two popular architectures used for image classification
  • Training done using 5-fold cross-validation
  • Annotator quality scores not evaluated
  • CIFAR-10 dataset is easy to label
  • CIFAR-10H dataset has unrealistically high annotator agreement
  • Hardest dataset used as primary benchmark, with 511 annotators in total
  • Uniform and Complete datasets also evaluated
  • Results based on separate classifier models trained for each dataset
  • Majority-vote consensus labels and true labels used for evaluation
  • Lift@T used to measure precision of consensus quality scoring methods