Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs have shown promise for automatic summarization
Instruction tuning is the key to LLM’s zero-shot summarization capability
Existing studies have been limited by low-quality references
Human evaluation over high-quality summaries from freelance writers shows LLM summaries are on par with human written summaries

Paper Content

Introduction

Large language models (LLMs) have shown promising results in zero-/few-shot tasks across a wide range of domains
LLMs have potential for automatic summarization
Design decisions contributing to success on summarization remain poorly understood
Evaluation of 10 diverse LLMs with human evaluation on news summarization
Instruction tuning is key to zero-shot summarization capability
Self-supervised learning alone cannot induce strong summarization performance in the zero-shot setting
Poor quality reference summaries reduce correlation between metric results and human judgement
Recruit freelance writers to re-annotate 100 articles from the test set of CNN/DM and XSUM
Best performing LLM is rated as comparable to freelance writers
Instruction tuning, not model scale, is key to LLMs’ summarization capability
Poor quality of training data makes comparison difficult

News summarization

News summarization is the task of producing a concise paragraph that captures the main points of a news article.
Two popular news summarization benchmarks are CNN/DM and XSUM.
Reference summaries in these datasets are known to have quality issues.

Large language models

LLMs have larger scale and don’t require finetuning
Instruction-tuning is an effective way to improve LLM prompting performance
Goyal and Durrett (2020) showed instruct-tuned GPT-3 Davinci model is better than finetuned LMs
Comprehensive benchmark of 10 different LLMs to understand effect of model scale, incontext learning and instruction tuning

Human evaluation on news summarization benchmarks

Human evaluation is used to benchmark a set of 10 LLMs on news summarization.
Instruction tuning is important for strong summarization capability.
Low-quality reference summaries may underestimate few-shot or finetuning performance.

Experimental setup

Recruited 6 writers with experience in writing blog posts, landing page introductions, or product descriptions from Upwork
Selected writers based on faithfulness, coherence, and relevance of their summaries
Estimated time to summarize a CNN/DM or XSUM article is 12-15 minutes, paid writers $4 per article
Instructed writers to summarize each article in around 50 words
Modified zero-shot prompt to elicit summaries that are around 50 words
Evaluated a random subset of 100 summaries using same annotation scheme
Evaluated 10 LLMs across different pretraining strategies and model scales
Evaluated two state-of-the-art fine-tuned LMs: Pegasus and BRIO
Evaluated existing reference summaries in CNN/DM and XSUM validation sets
Annotators evaluated summaries based on faithfulness, coherence, and relevance

Evaluation results

Instruction tuned models have strong summarization ability
Zero-shot instruction-tuned GPT-3 models perform best overall
Instruction tuning is more important than scale
Non-instruction-tuned LLMs can improve performance with incontext learning
Reference summaries in current benchmarks are low quality
Most automatic summarization systems score better than reference summaries
Instruction-tuned LLMs are better than non-instruction-tuned LLMs
Reference summaries are not representative of human performance

Understanding automatic metrics

Six popular automatic metrics are evaluated
Rouge-L has a 0.72 Kendall’s tau correlation coefficient with relevance on CNN/DM
Rouge-L prefers finetuned LMs to LLMs
Reference-based metrics correlate better with human judgments when reference summaries have better scores

Comparing the best llm to freelance writers

Low-quality reference summaries make studying and benchmarking LLMs difficult
Recruited Upwork freelance writers to collect better quality summaries
Aim to answer two questions: 1) Has best LLM reached human-level performance? 2) How well do reference-based metrics correlate with human judgments?

Paired comparison between llm and freelance writers

LLM summaries and freelance writer summaries have distinctive styles
Coverage and density for Instruct Davinci summaries are higher than freelance writer summaries
Cut and paste operations used to compare stylistic differences
Human evaluation to compare LLM and freelance writer summaries
Annotators equally prefer freelance writer and Instruct Davinci summaries
Annotators rate more abstractive summaries as more informative 51.1% of the time
Interannotator agreement is low
Annotator 1 prefers Instruct Davinci 57% of the time

Reevaluating reference-based metrics

Performance of automated metrics depends on quality of reference summaries
Initial study conducted on effect of using better quality summaries
Rouge-L used to evaluate faithfulness on XSUM dataset
Existing reference summaries highly unfaithful
System-level Rouge-L negatively correlated with human ratings
Using better reference summaries leads to more positive correlation

Discussion

Instruction tuning contributes most to LLMs’ summarization capability
Quality of summarization data used in instruction tuning is important
Learning algorithm used for instruction tuning can be important
Multi-task learning can be important
Human evaluation is limited by reference quality and subjectivity

Conclusion

Human evaluation of 10 LLMs across two popular news summarization benchmarks
Instruction tuning is key factor for success
Reference quality is important for model development and evaluation
Significant individual variation from annotator pool
Human written summaries contain more lexical paraphrasing and sentence reduction

Link to paper#

Abstract#

Paper Content#

Introduction#

News summarization#

Large language models#

Human evaluation on news summarization benchmarks#

Experimental setup#

Evaluation results#

Understanding automatic metrics#

Comparing the best llm to freelance writers#

Paired comparison between llm and freelance writers#

Reevaluating reference-based metrics#

Discussion#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

News summarization

Large language models

Human evaluation on news summarization benchmarks

Experimental setup

Evaluation results

Understanding automatic metrics

Comparing the best llm to freelance writers

Paired comparison between llm and freelance writers

Reevaluating reference-based metrics

Discussion

Conclusion