Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Human evaluation is the foundation for evaluating summarization systems and automatic metrics.
- Existing protocols and benchmarks have low inter-annotator agreement or lack scale.
- Proposed modified summarization salience protocol with fine-grained semantic units.
- RoSE benchmark with over 22k summary-level annotations.
- Comparing ACU protocol with other protocols.
- Evaluating existing automatic metrics using collected human annotations.
- Implications for evaluating large language models.
Paper Content
Introduction
- Human evaluation is essential for assessing summarization systems and automatic metrics
- Inter-annotator agreement can be difficult to achieve
- Human evaluation studies need a large enough sample size to find statistically significant results
- ACU protocol proposed to ensure human evaluation is more objective
- RoSE benchmark created with 22k summary-level annotations
- Four human evaluation protocols compared and can lead to different model preferences
- Automatic metrics evaluated across different human evaluation protocols
Related work
Human evaluation benchmarks
- Human judgment data must be collected to analyze summarization progress
- Recent efforts focus on aggregating model outputs and annotating quality dimensions
- Bhandari et al. (2020) annotates summaries according to semantic content units
- Benchmark only covers a single dataset without focus on similarly-performing systems
- Modified protocol for summarization salience evaluation introduced with higher interannotator agreement
- System outputs collected from recently-introduced models over three summarization datasets
Summarization meta-evaluation
- Meta-evaluation of current state of evaluation
- Analysis of ROUGE and its variations
- Analysis of broader set of metrics
- Evaluating zeroshot large language models
- Comparing approaches to human evaluation
- Studying annotation protocols for quality dimensions
Atomic content units for summarization evaluation
- Describes an Atomic Content Unit (ACU) annotation protocol
- Used for reference-based summary salience evaluation
Preliminaries
- Salience is a desired summary quality
- Human evaluation of summary salience can be conducted in either reference-free or reference-based manners
- Focus on reference-based evaluation for dataset collection
Acu annotation protocol
- Main goal of ACU protocol is to reduce subjectivity of reference-based summarization human evaluation
- Inter-annotator agreement is hard to achieve in human evaluation at summary level
- Difficulty in achieving consensus on comparison between two long text sequences
- ACU protocol simplifies basic unit of annotation task
- Annotators only need to decide on presence of single fact in another text sequence
- Evaluation process decomposed into two steps: extracting facts from one text sequence and checking for presence of extracted facts in another sequence
- ACUs constructed with one atomic fact and other minimal, necessary information
- ACU matching annotations can be aggregated into summary-level score
- Normalized ACU score penalizes summaries longer than reference
Acu annotation collection
- Collected human annotations according to ACU annotation protocol on 3 summarization datasets
- Collected and annotated generated summaries of pretrained summarization systems
- Collected around 21.8k ACU-level annotations and around 22k summary-level annotations
- Inter-annotator agreement score of 0.7571 for summary-level annotations and 0.7528 for ACU-level annotations
- Higher agreement scores than prior collections
Power analysis
- Statistical power is the probability that the null hypothesis of a statistical significance test is rejected given there is a real effect.
- Statistical power depends on the number of test examples and the observed system difference.
- High statistical power is difficult to reach when the system performance is similar.
- Increasing the sample size can effectively raise the statistical power.
- The dataset can provide more stable summarization system evaluation thanks to its higher statistical power.
Summarization system analysis
- ACU score has strong correlation with summary length
- Average summary length of different systems can vary greatly
- Summary length difference not always captured by ROUGE F1 scores
- Summary length can be a separate aspect of summary quality in evaluation
Evaluating annotation protocols
- Human evaluation protocol design can affect evaluation results and system/metric evaluation
- Three protocols used: Prior, Ref-free, and Ref-based
- Prior protocol evaluates annotators’ preferences without input document
- Ref-free protocol evaluates if summaries cover salient information of input document
- Ref-based protocol evaluates if generated summaries accurately cover reference summaries
Annotation collection
- Collected three annotations per summary on a 100-example subset of input documents
- Annotators judge all of the summaries within a single HIT with a score from 1 to 5
- Inter-annotator agreement measured by Krippendorff’s alpha for the protocols of 0.3455, 0.2201, 0.2741
- Prior protocol achieves highest agreement despite not being grounded in any other text
- Collected additional set of annotations to understand annotation protocols with large language models
- Annotated summaries for the protocols, including ACU protocol, for summaries from GPT-3, 11 T0, BRIO, and BART
- Reference-free annotations performed before reference-based annotations to avoid potential biases
Results analysis
- Investigated correlation of evaluation results of different protocols
- Calculated correlation at system and summary level
- Used normalized ACU score
- Ref-free protocol has strong correlation with Prior protocol
- Both Prior and Ref-free protocols have strong correlation with summary length
- Ref-free and Ref-based protocols have near-zero correlation at summary level and negative one at system level
- Ref-based protocol has strong correlation with ACU protocol
- ACU protocol has higher statistical power than Ref-based protocol
- GPT-3 receives highest score under Ref-free protocol
- Annotators have prior preference for GPT-3 summaries
- Annotator-based case study found annotator’s own Prior result is better prediction of their Ref-free result than Ref-free result of other annotators
Evaluating automatic metrics
- Analyzed correlations and robustness of metrics
- Used representative subset of recent metrics
- Additional results in Appendix E on 50 automatic metrics
Metrics
- ROUGE, METEOR, CHRF, BERTScore, BARTScore, SummaQA, QAEval, and Lite 2 Pyramid are metrics analyzed in the paper
- ROUGE measures n-gram and sequence overlap between generated summary and reference summaries
- METEOR aligns unigrams stemming and synonym matches
- CHRF calculates character-based overlap between model and reference summaries
- Table 8 shows the correlation between automatic metric scores and ACU scores of system outputs
Metric evaluation with acu annotations
- Automatic metrics can achieve high correlation with ACU scores, especially at the system level
- Automatic metrics perform better at the system level than at the summary level
- Metric performance varies across different datasets
- Correlation between automatic metrics and human evaluation is lower when system performance is more similar
Analysis
- Analyze metric evaluation through correlation confidence intervals, statistical power of system comparisons, and difference in correlations
- Confidence intervals of system-level correlations with ACU scores calculated using bootstrapping
- Confidence intervals are large
- Increasing sample size reduces confidence interval and increases statistical power
Power analysis of metric comparison
- Power analysis of metric comparison conducted with 200 metric pairs
- Difficult to find significant results when metric performance is similar
- Increasing sample size increases chance of finding significant results
- Automatic metric performance differs greatly under different human evaluation protocols
- Reference-based automatic metrics perform better under reference-based evaluation protocols
Discussion
- Different human evaluation protocols can lead to drastically different results
- Reference-based automatic metrics should be evaluated by reference-based human evaluation
- Reference-free human evaluation protocol does not faithfully reflect the metric’s ability to perform its intended reference-based evaluation
- Targeted evaluation of text summarization should be defined by both the source and target texts
- Targeted human evaluation can yield more reliable and objective results
- Reference summaries in existing datasets can still be useful for the appropriate purpose
Conclusion
- We introduce RoSE, a benchmark for summarization evaluation
- RoSE allows for more robust summarization evaluation across three datasets and two domains
- We re-evaluate the current state of human evaluation and its implications for summarization systems and automatic metrics
- Potential data biases in the data annotator and data models are noted
- English-language data only included in benchmark and analysis
- High quality benchmark ensured through spot checks and worker qualifications
- Quality of references questioned
- ACUs not weighted during aggregation
- ACUs written consistently by multiple annotators
- MTurk workers recruited with qualifications and paid $12/hour
- Summarization models annotated on CNNDM, XSum, and SamSum
- Power analysis conducted with bootstrapping test
- Correlations used to analyze similarity between human evaluation protocols and automatic metrics
- 50 different automatic metrics evaluated with ACU benchmark
- Confidence intervals of system-level correlations with ACU scores calculated
- Power analysis of metric comparison conducted with Kendall’s correlations and permutation test
- Survey of human evaluation practices of 55 selected papers conducted