Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Study of how decoding strategies affect faithfulness in abstractive summarization
Beam search with large beam sizes produces most faithful summaries, nucleus sampling produces least faithful
Two faithfulness-aware generation methods proposed to improve faithfulness
Distillation approach allows model to generate faithful summaries with greedy decoding

Paper Content

Introduction

Recent developments in large pre-trained language models have achieved remarkable performance on abstractive summarization
Problem of hallucinations, where generated summary contains facts not present in original document
Prior research has analyzed and defined potential error types and typology
Effect of decoding strategies on faithfulness of abstractive summarization is less understood
Analysis of popular decoding strategies (greedy, beam, nucleus sampling) on two datasets
Beam search provides most faithful summaries
Randomness introduced by sampling hurts faithfulness
Two faithfulness-aware decoding methods proposed
Distillation approach to generate faithful summaries with greedy decoding

Faithfulness behavior of popular decoding strategies

Investigated effect of decoding strategies on faithfulness
Investigated whether better exploration of search space can improve faithfulness
Investigated how randomness impacts faithfulness
Explored three common decoding strategies: greedy, beam search, and nucleus sampling

Faithfulness-aware decoding strategies

Hypothesize that current decoding methods may not explore paths that focus on faithfulness effectively
Propose two faithfulness-aware methods to modify how the space is explored
Method 1: Ranking makes use of beam search and picks the most faithful path
Method 2: Lookahead guides the search process by adding faithfulness heuristics

Ranking with faithfulness metrics

Beam search explores many suitable candidates during the decoding process
We propose to rerank the generated candidates according to faithfulness metrics
Falke et al. (2019) used NLI models to rerank, but it increased the number of unfaithful summaries
We explore using faithfulness metrics directly for ranking
We use a composite metric to aggregate the vote of several popular metrics

Lookahead

Model score is used to select summary tokens
Reference-free faithfulness evaluation function assigns a score to the summary
Weight and number of tokens to look into the future are also taken into account
Number of summaries generated depends on the decoding strategy used
Full summary is generated instead of partial summary

Combining ranking and lookahead

Combining two methods to improve faithfulness
Use BEAM+LOOKAHEAD to generate beam candidates
Select best candidates with ranking

Efficient decoding via distillation

Proposed decoding methods have heavy computational cost
Exploring using distillation to transfer knowledge of faithfulness-aware decoding to student model
Distillation aims to improve decoding time, not model size
New decoding distillation loss proposed
Iterative distillation process proposed

Experiments

Datasets and models

Performed experiments on two datasets for abstractive summarization
Used BART-large checkpoint for the two datasets
Experiment also done with PEGASUS

Evaluation metrics

Used F1 measure of ROUGE-L and BERTScore to evaluate summary quality
Used BS-Fact, FactCC, DAE, and QuestEval for faithfulness evaluation

Human evaluation setup

Evaluated faithfulness and informativeness of summaries using Amazon Mechanical Turk
Faithfulness judged using 3-star rating system
Informativeness judged using best-worst-scaling method

Decoding setting details

Basic decoding methods compared using greedy search, beam search (k = 10), and nucleus sampling (p = 0.9).
Composite metric used to rank candidates, trained using FactCC, BS-Fact, DAE, and QuestEval.
Lookahead used BS-Fact as faithfulness metric, applied to both greedy and beam searches.
Distillation used checkpoint of two proposed faithfulness-aware decoding methods as teacher model, student model from BART-LARGE.

Baseline decoding results

Common decoding strategies are analyzed in 2022
Beam search is used to explore if larger beam sizes result in more faithful summaries
Increasing the beam size improves all faithfulness scores
Reranking strategy has potential to output more faithful summaries

Faithfulness-aware decoding results

Faithfulness-aware methods are compared to traditional decoding methods
Lookahead improves faithfulness
Base decoding strategy is still the dominating factor
Combination of lookahead and ranking can further improve faithfulness
Applying faithful decoding methods decreases ROUGE score

Human evaluation results

Human evaluation results show that our proposed decoding strategies generate more faithful summaries than baseline decoding methods.
Our proposed methods reduce the percentage of summaries with major factual errors.
Our proposed methods generate the most informative summaries according to human evaluation.

Abstractiveness

Models can become more faithful by becoming more extractive
Experiments conducted on XSum to measure faithfulness and abstractiveness
More faithful models tend to be more extractive
Lookahead method allows balancing of faithfulness and abstractiveness

Distillation

Student models approach the performance of teacher models
Student models generate more faithful summaries than greedy search baseline
Decoding speed improved by 40%
Student model improves DAE by 6.6 points and QuestEval by 1 point compared to greedy search baseline
Student model outperforms teacher model on faithfulness metrics with two iterations

Decoding methods are used to select the best tokens to form a hypothesis
Different decoding strategies have been analyzed for natural language generation
Distillation is used to compress knowledge from a larger model into a smaller one
Pseudo-labeling is used to reduce computational cost during decoding

Conclusion

Popular decoding strategies analyzed for effect on faithfulness for abstractive summarization
Two newly proposed faithfulness-aware decoding strategies, ranking and lookahead, can improve faithfulness
Distillation trick can be used to improve decoding speed
Human evaluation on informativeness with 100% accuracy
Krippendorff alpha for CNN/DM and XSum annotation is 0.22 and 0.32 respectively
Beam search performs best in terms of faithfulness except for FactCC on XSum dataset
Ranking and lookahead improve faithfulness when combined
Composite metric is robust for another domain
Lookahead heuristic prevents score from dipping
Optimizing for one metric will lead to improvement in other faithfulness metrics
Composite metric is able to achieve a similarly good score for all faithfulness metrics
Lookahead heuristic improves search space
Gain in faithfulness outweighs decrease in abstractiveness
Human faithfulness score shows BEAM+LOOKAHEAD+ABSTR achieves highest MINT score
Average of faithfulness metrics shows BEAM+RANKING achieves highest score

Link to paper#

Abstract#

Paper Content#

Introduction#

Faithfulness behavior of popular decoding strategies#

Faithfulness-aware decoding strategies#

Ranking with faithfulness metrics#

Lookahead#

Combining ranking and lookahead#

Efficient decoding via distillation#

Experiments#

Datasets and models#

Evaluation metrics#

Human evaluation setup#

Decoding setting details#

Baseline decoding results#

Faithfulness-aware decoding results#

Human evaluation results#

Abstractiveness#

Distillation#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Faithfulness behavior of popular decoding strategies

Faithfulness-aware decoding strategies

Ranking with faithfulness metrics

Lookahead

Combining ranking and lookahead

Efficient decoding via distillation

Experiments

Datasets and models

Evaluation metrics

Human evaluation setup

Decoding setting details

Baseline decoding results

Faithfulness-aware decoding results

Human evaluation results

Abstractiveness

Distillation

Related work

Conclusion