Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • LLMs have shown promise for automatic summarization
  • Instruction tuning is the key to LLM’s zero-shot summarization capability
  • Existing studies have been limited by low-quality references
  • Human evaluation over high-quality summaries from freelance writers shows LLM summaries are on par with human written summaries

Paper Content

Introduction

  • Large language models (LLMs) have shown promising results in zero-/few-shot tasks across a wide range of domains
  • LLMs have potential for automatic summarization
  • Design decisions contributing to success on summarization remain poorly understood
  • Evaluation of 10 diverse LLMs with human evaluation on news summarization
  • Instruction tuning is key to zero-shot summarization capability
  • Self-supervised learning alone cannot induce strong summarization performance in the zero-shot setting
  • Poor quality reference summaries reduce correlation between metric results and human judgement
  • Recruit freelance writers to re-annotate 100 articles from the test set of CNN/DM and XSUM
  • Best performing LLM is rated as comparable to freelance writers
  • Instruction tuning, not model scale, is key to LLMs’ summarization capability
  • Poor quality of training data makes comparison difficult

News summarization

  • News summarization is the task of producing a concise paragraph that captures the main points of a news article.
  • Two popular news summarization benchmarks are CNN/DM and XSUM.
  • Reference summaries in these datasets are known to have quality issues.

Large language models

  • LLMs have larger scale and don’t require finetuning
  • Instruction-tuning is an effective way to improve LLM prompting performance
  • Goyal and Durrett (2020) showed instruct-tuned GPT-3 Davinci model is better than finetuned LMs
  • Comprehensive benchmark of 10 different LLMs to understand effect of model scale, incontext learning and instruction tuning

Human evaluation on news summarization benchmarks

  • Human evaluation is used to benchmark a set of 10 LLMs on news summarization.
  • Instruction tuning is important for strong summarization capability.
  • Low-quality reference summaries may underestimate few-shot or finetuning performance.

Experimental setup

  • Recruited 6 writers with experience in writing blog posts, landing page introductions, or product descriptions from Upwork
  • Selected writers based on faithfulness, coherence, and relevance of their summaries
  • Estimated time to summarize a CNN/DM or XSUM article is 12-15 minutes, paid writers $4 per article
  • Instructed writers to summarize each article in around 50 words
  • Modified zero-shot prompt to elicit summaries that are around 50 words
  • Evaluated a random subset of 100 summaries using same annotation scheme
  • Evaluated 10 LLMs across different pretraining strategies and model scales
  • Evaluated two state-of-the-art fine-tuned LMs: Pegasus and BRIO
  • Evaluated existing reference summaries in CNN/DM and XSUM validation sets
  • Annotators evaluated summaries based on faithfulness, coherence, and relevance

Evaluation results

  • Instruction tuned models have strong summarization ability
  • Zero-shot instruction-tuned GPT-3 models perform best overall
  • Instruction tuning is more important than scale
  • Non-instruction-tuned LLMs can improve performance with incontext learning
  • Reference summaries in current benchmarks are low quality
  • Most automatic summarization systems score better than reference summaries
  • Instruction-tuned LLMs are better than non-instruction-tuned LLMs
  • Reference summaries are not representative of human performance

Understanding automatic metrics

  • Six popular automatic metrics are evaluated
  • Rouge-L has a 0.72 Kendall’s tau correlation coefficient with relevance on CNN/DM
  • Rouge-L prefers finetuned LMs to LLMs
  • Reference-based metrics correlate better with human judgments when reference summaries have better scores

Comparing the best llm to freelance writers

  • Low-quality reference summaries make studying and benchmarking LLMs difficult
  • Recruited Upwork freelance writers to collect better quality summaries
  • Aim to answer two questions: 1) Has best LLM reached human-level performance? 2) How well do reference-based metrics correlate with human judgments?

Paired comparison between llm and freelance writers

  • LLM summaries and freelance writer summaries have distinctive styles
  • Coverage and density for Instruct Davinci summaries are higher than freelance writer summaries
  • Cut and paste operations used to compare stylistic differences
  • Human evaluation to compare LLM and freelance writer summaries
  • Annotators equally prefer freelance writer and Instruct Davinci summaries
  • Annotators rate more abstractive summaries as more informative 51.1% of the time
  • Interannotator agreement is low
  • Annotator 1 prefers Instruct Davinci 57% of the time

Reevaluating reference-based metrics

  • Performance of automated metrics depends on quality of reference summaries
  • Initial study conducted on effect of using better quality summaries
  • Rouge-L used to evaluate faithfulness on XSUM dataset
  • Existing reference summaries highly unfaithful
  • System-level Rouge-L negatively correlated with human ratings
  • Using better reference summaries leads to more positive correlation

Discussion

  • Instruction tuning contributes most to LLMs’ summarization capability
  • Quality of summarization data used in instruction tuning is important
  • Learning algorithm used for instruction tuning can be important
  • Multi-task learning can be important
  • Human evaluation is limited by reference quality and subjectivity

Conclusion

  • Human evaluation of 10 LLMs across two popular news summarization benchmarks
  • Instruction tuning is key factor for success
  • Reference quality is important for model development and evaluation
  • Significant individual variation from annotator pool
  • Human written summaries contain more lexical paraphrasing and sentence reduction