Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Human evaluation is the foundation for evaluating summarization systems and automatic metrics.
Existing protocols and benchmarks have low inter-annotator agreement or lack scale.
Proposed modified summarization salience protocol with fine-grained semantic units.
RoSE benchmark with over 22k summary-level annotations.
Comparing ACU protocol with other protocols.
Evaluating existing automatic metrics using collected human annotations.
Implications for evaluating large language models.

Paper Content

Introduction

Human evaluation is essential for assessing summarization systems and automatic metrics
Inter-annotator agreement can be difficult to achieve
Human evaluation studies need a large enough sample size to find statistically significant results
ACU protocol proposed to ensure human evaluation is more objective
RoSE benchmark created with 22k summary-level annotations
Four human evaluation protocols compared and can lead to different model preferences
Automatic metrics evaluated across different human evaluation protocols

Human evaluation benchmarks

Human judgment data must be collected to analyze summarization progress
Recent efforts focus on aggregating model outputs and annotating quality dimensions
Bhandari et al. (2020) annotates summaries according to semantic content units
Benchmark only covers a single dataset without focus on similarly-performing systems
Modified protocol for summarization salience evaluation introduced with higher interannotator agreement
System outputs collected from recently-introduced models over three summarization datasets

Summarization meta-evaluation

Meta-evaluation of current state of evaluation
Analysis of ROUGE and its variations
Analysis of broader set of metrics
Evaluating zeroshot large language models
Comparing approaches to human evaluation
Studying annotation protocols for quality dimensions

Atomic content units for summarization evaluation

Describes an Atomic Content Unit (ACU) annotation protocol
Used for reference-based summary salience evaluation

Preliminaries

Salience is a desired summary quality
Human evaluation of summary salience can be conducted in either reference-free or reference-based manners
Focus on reference-based evaluation for dataset collection

Acu annotation protocol

Main goal of ACU protocol is to reduce subjectivity of reference-based summarization human evaluation
Inter-annotator agreement is hard to achieve in human evaluation at summary level
Difficulty in achieving consensus on comparison between two long text sequences
ACU protocol simplifies basic unit of annotation task
Annotators only need to decide on presence of single fact in another text sequence
Evaluation process decomposed into two steps: extracting facts from one text sequence and checking for presence of extracted facts in another sequence
ACUs constructed with one atomic fact and other minimal, necessary information
ACU matching annotations can be aggregated into summary-level score
Normalized ACU score penalizes summaries longer than reference

Acu annotation collection

Collected human annotations according to ACU annotation protocol on 3 summarization datasets
Collected and annotated generated summaries of pretrained summarization systems
Collected around 21.8k ACU-level annotations and around 22k summary-level annotations
Inter-annotator agreement score of 0.7571 for summary-level annotations and 0.7528 for ACU-level annotations
Higher agreement scores than prior collections

Power analysis

Statistical power is the probability that the null hypothesis of a statistical significance test is rejected given there is a real effect.
Statistical power depends on the number of test examples and the observed system difference.
High statistical power is difficult to reach when the system performance is similar.
Increasing the sample size can effectively raise the statistical power.
The dataset can provide more stable summarization system evaluation thanks to its higher statistical power.

Summarization system analysis

ACU score has strong correlation with summary length
Average summary length of different systems can vary greatly
Summary length difference not always captured by ROUGE F1 scores
Summary length can be a separate aspect of summary quality in evaluation

Evaluating annotation protocols

Human evaluation protocol design can affect evaluation results and system/metric evaluation
Three protocols used: Prior, Ref-free, and Ref-based
Prior protocol evaluates annotators’ preferences without input document
Ref-free protocol evaluates if summaries cover salient information of input document
Ref-based protocol evaluates if generated summaries accurately cover reference summaries

Annotation collection

Collected three annotations per summary on a 100-example subset of input documents
Annotators judge all of the summaries within a single HIT with a score from 1 to 5
Inter-annotator agreement measured by Krippendorff’s alpha for the protocols of 0.3455, 0.2201, 0.2741
Prior protocol achieves highest agreement despite not being grounded in any other text
Collected additional set of annotations to understand annotation protocols with large language models
Annotated summaries for the protocols, including ACU protocol, for summaries from GPT-3, 11 T0, BRIO, and BART
Reference-free annotations performed before reference-based annotations to avoid potential biases

Results analysis

Investigated correlation of evaluation results of different protocols
Calculated correlation at system and summary level
Used normalized ACU score
Ref-free protocol has strong correlation with Prior protocol
Both Prior and Ref-free protocols have strong correlation with summary length
Ref-free and Ref-based protocols have near-zero correlation at summary level and negative one at system level
Ref-based protocol has strong correlation with ACU protocol
ACU protocol has higher statistical power than Ref-based protocol
GPT-3 receives highest score under Ref-free protocol
Annotators have prior preference for GPT-3 summaries
Annotator-based case study found annotator’s own Prior result is better prediction of their Ref-free result than Ref-free result of other annotators

Evaluating automatic metrics

Analyzed correlations and robustness of metrics
Used representative subset of recent metrics
Additional results in Appendix E on 50 automatic metrics

Metrics

ROUGE, METEOR, CHRF, BERTScore, BARTScore, SummaQA, QAEval, and Lite 2 Pyramid are metrics analyzed in the paper
ROUGE measures n-gram and sequence overlap between generated summary and reference summaries
METEOR aligns unigrams stemming and synonym matches
CHRF calculates character-based overlap between model and reference summaries
Table 8 shows the correlation between automatic metric scores and ACU scores of system outputs

Metric evaluation with acu annotations

Automatic metrics can achieve high correlation with ACU scores, especially at the system level
Automatic metrics perform better at the system level than at the summary level
Metric performance varies across different datasets
Correlation between automatic metrics and human evaluation is lower when system performance is more similar

Analysis

Analyze metric evaluation through correlation confidence intervals, statistical power of system comparisons, and difference in correlations
Confidence intervals of system-level correlations with ACU scores calculated using bootstrapping
Confidence intervals are large
Increasing sample size reduces confidence interval and increases statistical power

Power analysis of metric comparison

Power analysis of metric comparison conducted with 200 metric pairs
Difficult to find significant results when metric performance is similar
Increasing sample size increases chance of finding significant results
Automatic metric performance differs greatly under different human evaluation protocols
Reference-based automatic metrics perform better under reference-based evaluation protocols

Discussion

Different human evaluation protocols can lead to drastically different results
Reference-based automatic metrics should be evaluated by reference-based human evaluation
Reference-free human evaluation protocol does not faithfully reflect the metric’s ability to perform its intended reference-based evaluation
Targeted evaluation of text summarization should be defined by both the source and target texts
Targeted human evaluation can yield more reliable and objective results
Reference summaries in existing datasets can still be useful for the appropriate purpose

Conclusion

We introduce RoSE, a benchmark for summarization evaluation
RoSE allows for more robust summarization evaluation across three datasets and two domains
We re-evaluate the current state of human evaluation and its implications for summarization systems and automatic metrics
Potential data biases in the data annotator and data models are noted
English-language data only included in benchmark and analysis
High quality benchmark ensured through spot checks and worker qualifications
Quality of references questioned
ACUs not weighted during aggregation
ACUs written consistently by multiple annotators
MTurk workers recruited with qualifications and paid $12/hour
Summarization models annotated on CNNDM, XSum, and SamSum
Power analysis conducted with bootstrapping test
Correlations used to analyze similarity between human evaluation protocols and automatic metrics
50 different automatic metrics evaluated with ACU benchmark
Confidence intervals of system-level correlations with ACU scores calculated
Power analysis of metric comparison conducted with Kendall’s correlations and permutation test
Survey of human evaluation practices of 55 selected papers conducted

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Human evaluation benchmarks#

Summarization meta-evaluation#

Atomic content units for summarization evaluation#

Preliminaries#

Acu annotation protocol#

Acu annotation collection#

Power analysis#

Summarization system analysis#

Evaluating annotation protocols#

Annotation collection#

Results analysis#

Evaluating automatic metrics#

Metrics#

Metric evaluation with acu annotations#

Analysis#

Power analysis of metric comparison#

Discussion#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Human evaluation benchmarks

Summarization meta-evaluation

Atomic content units for summarization evaluation

Preliminaries

Acu annotation protocol

Acu annotation collection

Power analysis

Summarization system analysis

Evaluating annotation protocols

Annotation collection

Results analysis

Evaluating automatic metrics

Metrics

Metric evaluation with acu annotations

Analysis

Power analysis of metric comparison

Discussion

Conclusion