Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Human evaluation is the foundation for evaluating summarization systems and automatic metrics.
  • Existing protocols and benchmarks have low inter-annotator agreement or lack scale.
  • Proposed modified summarization salience protocol with fine-grained semantic units.
  • RoSE benchmark with over 22k summary-level annotations.
  • Comparing ACU protocol with other protocols.
  • Evaluating existing automatic metrics using collected human annotations.
  • Implications for evaluating large language models.

Paper Content

Introduction

  • Human evaluation is essential for assessing summarization systems and automatic metrics
  • Inter-annotator agreement can be difficult to achieve
  • Human evaluation studies need a large enough sample size to find statistically significant results
  • ACU protocol proposed to ensure human evaluation is more objective
  • RoSE benchmark created with 22k summary-level annotations
  • Four human evaluation protocols compared and can lead to different model preferences
  • Automatic metrics evaluated across different human evaluation protocols

Human evaluation benchmarks

  • Human judgment data must be collected to analyze summarization progress
  • Recent efforts focus on aggregating model outputs and annotating quality dimensions
  • Bhandari et al. (2020) annotates summaries according to semantic content units
  • Benchmark only covers a single dataset without focus on similarly-performing systems
  • Modified protocol for summarization salience evaluation introduced with higher interannotator agreement
  • System outputs collected from recently-introduced models over three summarization datasets

Summarization meta-evaluation

  • Meta-evaluation of current state of evaluation
  • Analysis of ROUGE and its variations
  • Analysis of broader set of metrics
  • Evaluating zeroshot large language models
  • Comparing approaches to human evaluation
  • Studying annotation protocols for quality dimensions

Atomic content units for summarization evaluation

  • Describes an Atomic Content Unit (ACU) annotation protocol
  • Used for reference-based summary salience evaluation

Preliminaries

  • Salience is a desired summary quality
  • Human evaluation of summary salience can be conducted in either reference-free or reference-based manners
  • Focus on reference-based evaluation for dataset collection

Acu annotation protocol

  • Main goal of ACU protocol is to reduce subjectivity of reference-based summarization human evaluation
  • Inter-annotator agreement is hard to achieve in human evaluation at summary level
  • Difficulty in achieving consensus on comparison between two long text sequences
  • ACU protocol simplifies basic unit of annotation task
  • Annotators only need to decide on presence of single fact in another text sequence
  • Evaluation process decomposed into two steps: extracting facts from one text sequence and checking for presence of extracted facts in another sequence
  • ACUs constructed with one atomic fact and other minimal, necessary information
  • ACU matching annotations can be aggregated into summary-level score
  • Normalized ACU score penalizes summaries longer than reference

Acu annotation collection

  • Collected human annotations according to ACU annotation protocol on 3 summarization datasets
  • Collected and annotated generated summaries of pretrained summarization systems
  • Collected around 21.8k ACU-level annotations and around 22k summary-level annotations
  • Inter-annotator agreement score of 0.7571 for summary-level annotations and 0.7528 for ACU-level annotations
  • Higher agreement scores than prior collections

Power analysis

  • Statistical power is the probability that the null hypothesis of a statistical significance test is rejected given there is a real effect.
  • Statistical power depends on the number of test examples and the observed system difference.
  • High statistical power is difficult to reach when the system performance is similar.
  • Increasing the sample size can effectively raise the statistical power.
  • The dataset can provide more stable summarization system evaluation thanks to its higher statistical power.

Summarization system analysis

  • ACU score has strong correlation with summary length
  • Average summary length of different systems can vary greatly
  • Summary length difference not always captured by ROUGE F1 scores
  • Summary length can be a separate aspect of summary quality in evaluation

Evaluating annotation protocols

  • Human evaluation protocol design can affect evaluation results and system/metric evaluation
  • Three protocols used: Prior, Ref-free, and Ref-based
  • Prior protocol evaluates annotators’ preferences without input document
  • Ref-free protocol evaluates if summaries cover salient information of input document
  • Ref-based protocol evaluates if generated summaries accurately cover reference summaries

Annotation collection

  • Collected three annotations per summary on a 100-example subset of input documents
  • Annotators judge all of the summaries within a single HIT with a score from 1 to 5
  • Inter-annotator agreement measured by Krippendorff’s alpha for the protocols of 0.3455, 0.2201, 0.2741
  • Prior protocol achieves highest agreement despite not being grounded in any other text
  • Collected additional set of annotations to understand annotation protocols with large language models
  • Annotated summaries for the protocols, including ACU protocol, for summaries from GPT-3, 11 T0, BRIO, and BART
  • Reference-free annotations performed before reference-based annotations to avoid potential biases

Results analysis

  • Investigated correlation of evaluation results of different protocols
  • Calculated correlation at system and summary level
  • Used normalized ACU score
  • Ref-free protocol has strong correlation with Prior protocol
  • Both Prior and Ref-free protocols have strong correlation with summary length
  • Ref-free and Ref-based protocols have near-zero correlation at summary level and negative one at system level
  • Ref-based protocol has strong correlation with ACU protocol
  • ACU protocol has higher statistical power than Ref-based protocol
  • GPT-3 receives highest score under Ref-free protocol
  • Annotators have prior preference for GPT-3 summaries
  • Annotator-based case study found annotator’s own Prior result is better prediction of their Ref-free result than Ref-free result of other annotators

Evaluating automatic metrics

  • Analyzed correlations and robustness of metrics
  • Used representative subset of recent metrics
  • Additional results in Appendix E on 50 automatic metrics

Metrics

  • ROUGE, METEOR, CHRF, BERTScore, BARTScore, SummaQA, QAEval, and Lite 2 Pyramid are metrics analyzed in the paper
  • ROUGE measures n-gram and sequence overlap between generated summary and reference summaries
  • METEOR aligns unigrams stemming and synonym matches
  • CHRF calculates character-based overlap between model and reference summaries
  • Table 8 shows the correlation between automatic metric scores and ACU scores of system outputs

Metric evaluation with acu annotations

  • Automatic metrics can achieve high correlation with ACU scores, especially at the system level
  • Automatic metrics perform better at the system level than at the summary level
  • Metric performance varies across different datasets
  • Correlation between automatic metrics and human evaluation is lower when system performance is more similar

Analysis

  • Analyze metric evaluation through correlation confidence intervals, statistical power of system comparisons, and difference in correlations
  • Confidence intervals of system-level correlations with ACU scores calculated using bootstrapping
  • Confidence intervals are large
  • Increasing sample size reduces confidence interval and increases statistical power

Power analysis of metric comparison

  • Power analysis of metric comparison conducted with 200 metric pairs
  • Difficult to find significant results when metric performance is similar
  • Increasing sample size increases chance of finding significant results
  • Automatic metric performance differs greatly under different human evaluation protocols
  • Reference-based automatic metrics perform better under reference-based evaluation protocols

Discussion

  • Different human evaluation protocols can lead to drastically different results
  • Reference-based automatic metrics should be evaluated by reference-based human evaluation
  • Reference-free human evaluation protocol does not faithfully reflect the metric’s ability to perform its intended reference-based evaluation
  • Targeted evaluation of text summarization should be defined by both the source and target texts
  • Targeted human evaluation can yield more reliable and objective results
  • Reference summaries in existing datasets can still be useful for the appropriate purpose

Conclusion

  • We introduce RoSE, a benchmark for summarization evaluation
  • RoSE allows for more robust summarization evaluation across three datasets and two domains
  • We re-evaluate the current state of human evaluation and its implications for summarization systems and automatic metrics
  • Potential data biases in the data annotator and data models are noted
  • English-language data only included in benchmark and analysis
  • High quality benchmark ensured through spot checks and worker qualifications
  • Quality of references questioned
  • ACUs not weighted during aggregation
  • ACUs written consistently by multiple annotators
  • MTurk workers recruited with qualifications and paid $12/hour
  • Summarization models annotated on CNNDM, XSum, and SamSum
  • Power analysis conducted with bootstrapping test
  • Correlations used to analyze similarity between human evaluation protocols and automatic metrics
  • 50 different automatic metrics evaluated with ACU benchmark
  • Confidence intervals of system-level correlations with ACU scores calculated
  • Power analysis of metric comparison conducted with Kendall’s correlations and permutation test
  • Survey of human evaluation practices of 55 selected papers conducted