Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Labels for real-world data are often given by multiple annotators.
- CROWDLAB is an approach to estimate a consensus label, a confidence score, and a rating for each annotator.
- CROWDLAB is based on simple weighted ensembling and utilizes a classifier model trained on the features of the examples.
- CROWDLAB provides superior estimates than alternative algorithms in evaluations on real-world data.
Paper Content
Introduction
- Training data for multiclass classification is often labeled by multiple annotators
- Each example in the dataset is labeled by at least one annotator
- Labels are aggregated into a single consensus label
- CROWD-LAB is a method that leverages a trained classifier to establish accurate consensus labels, estimate their quality, and estimate the quality of each annotator
- CROWD-LAB is easy to implement, computationally efficient, and flexible
- Real-world multi-annotator datasets have a large disparity in annotator quality and many examples whose consensus label will be incorrect with majority vote
- Leveraging a trained classifier can help estimate label quality for examples with fewer annotations
- CROWD-LAB accounts for the number of annotations an example has received, the quality of the annotators, and the accuracy and confidence of the classifier predictions
- Notation includes examples, annotators, classes, consensus labels, consensus quality scores, annotator quality scores, and label quality scores
Methods
- Classifier model M is used to predict labels based on feature values
- CROWD-LAB can be used with any type of classifier
- Cross-validation is used to avoid overfit predictions
- Consensus labels are used to train M
- Performance of methods leveraging M will benefit from improved classifier accuracy
- CROWDLAB aims to account for classifier prediction shortcomings
Consensus quality scoring methods
- Estimate confidence of given consensus label for each example
- Choose consensus label with highest consensus quality score
- Fraction of annotators who agree with consensus label
- Label Quality Score estimated by trained classifier model
- CROWDLAB uses weighted ensemble to modify prediction output
- Likelihood parameter P set as average annotator agreement
- Annotator predicted probability vector used in CROWDLAB
- Dawid-Skene specifies generative model of dataset annotations
- GLAD specifies more complex generative model of dataset annotations
- Dawid-Skene with Model uses classifier to produce class predictions
- GLAD with Model adds model’s predicted labels as additional annotator
- Empirical Bayes uses classifier-derived prior distribution and likelihoods
- Active Label Cleaning subtracts cross-entropy between classifier predicted probabilities and individual annotations
Annotator quality scoring methods
- Estimate consensus labels and their quality
- Rank which annotators provide the best/worst labels
- Agreement-based scores rate annotators based on accuracy of labels
- Label Quality Score uses classifier predictions to rate annotator quality
- CROWDLAB takes into account label quality and agreement with consensus
- Dawid-Skene scores annotators based on probability of agreement with true label
- GLAD estimates expertise of each annotator
- CROWDLAB estimates single likelihood parameter and per annotator statistic
Experiments
- Evaluate various methods using real-world multi-annotator data with label errors
- Three benchmarks based on CIFAR-10H data
- Ground truth labels from original CIFAR-10 dataset
- Evaluate methods using ResNet-18 and Swin Transformer models
- Metrics to measure estimation tasks: accuracy, precision/recall, Spearman correlation
- AUROC, AUPRC, Lift at various cutoffs to evaluate consensus quality scores
Results
- CROWDLAB performs best across evaluations for consensus and annotator quality scores
- All evaluation metrics improve when used with Swin Transformer vs. ResNet-18 model
- Label Quality Score estimates consensus quality when classifier is accurate
- Label Quality Score performs worse with lower accuracy classifier
- CROWDLAB outperforms other methods regardless of classifier accuracy
- CROWDLAB retains strong performance on datasets with varying numbers of annotations
Discussion
- CROWDLAB considers a model’s estimated confidence and accuracy relative to individual annotators.
- CROWDLAB is compatible with any classifier and training strategy.
- CROWDLAB has a limitation in settings where every example is labeled by a large number of annotators.
A experiment details
- Two popular architectures used for image classification
- Training done using 5-fold cross-validation
- Annotator quality scores not evaluated
- CIFAR-10 dataset is easy to label
- CIFAR-10H dataset has unrealistically high annotator agreement
- Hardest dataset used as primary benchmark, with 511 annotators in total
- Uniform and Complete datasets also evaluated
- Results based on separate classifier models trained for each dataset
- Majority-vote consensus labels and true labels used for evaluation
- Lift@T used to measure precision of consensus quality scoring methods