Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Labels for real-world data are often given by multiple annotators.
CROWDLAB is an approach to estimate a consensus label, a confidence score, and a rating for each annotator.
CROWDLAB is based on simple weighted ensembling and utilizes a classifier model trained on the features of the examples.
CROWDLAB provides superior estimates than alternative algorithms in evaluations on real-world data.

Training data for multiclass classification is often labeled by multiple annotators
Each example in the dataset is labeled by at least one annotator
Labels are aggregated into a single consensus label
CROWD-LAB is a method that leverages a trained classifier to establish accurate consensus labels, estimate their quality, and estimate the quality of each annotator
CROWD-LAB is easy to implement, computationally efficient, and flexible
Real-world multi-annotator datasets have a large disparity in annotator quality and many examples whose consensus label will be incorrect with majority vote
Leveraging a trained classifier can help estimate label quality for examples with fewer annotations
CROWD-LAB accounts for the number of annotations an example has received, the quality of the annotators, and the accuracy and confidence of the classifier predictions
Notation includes examples, annotators, classes, consensus labels, consensus quality scores, annotator quality scores, and label quality scores

Classifier model M is used to predict labels based on feature values
CROWD-LAB can be used with any type of classifier
Cross-validation is used to avoid overfit predictions
Consensus labels are used to train M
Performance of methods leveraging M will benefit from improved classifier accuracy
CROWDLAB aims to account for classifier prediction shortcomings

Estimate confidence of given consensus label for each example
Choose consensus label with highest consensus quality score
Fraction of annotators who agree with consensus label
Label Quality Score estimated by trained classifier model
CROWDLAB uses weighted ensemble to modify prediction output
Likelihood parameter P set as average annotator agreement
Annotator predicted probability vector used in CROWDLAB
Dawid-Skene specifies generative model of dataset annotations
GLAD specifies more complex generative model of dataset annotations
Dawid-Skene with Model uses classifier to produce class predictions
GLAD with Model adds model’s predicted labels as additional annotator
Empirical Bayes uses classifier-derived prior distribution and likelihoods
Active Label Cleaning subtracts cross-entropy between classifier predicted probabilities and individual annotations

Evaluate various methods using real-world multi-annotator data with label errors
Three benchmarks based on CIFAR-10H data
Ground truth labels from original CIFAR-10 dataset
Evaluate methods using ResNet-18 and Swin Transformer models
Metrics to measure estimation tasks: accuracy, precision/recall, Spearman correlation
AUROC, AUPRC, Lift at various cutoffs to evaluate consensus quality scores

CROWDLAB performs best across evaluations for consensus and annotator quality scores
All evaluation metrics improve when used with Swin Transformer vs. ResNet-18 model
Label Quality Score estimates consensus quality when classifier is accurate
Label Quality Score performs worse with lower accuracy classifier
CROWDLAB outperforms other methods regardless of classifier accuracy
CROWDLAB retains strong performance on datasets with varying numbers of annotations

CROWDLAB considers a model’s estimated confidence and accuracy relative to individual annotators.
CROWDLAB is compatible with any classifier and training strategy.
CROWDLAB has a limitation in settings where every example is labeled by a large number of annotators.