Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Study of how decoding strategies affect faithfulness in abstractive summarization
  • Beam search with large beam sizes produces most faithful summaries, nucleus sampling produces least faithful
  • Two faithfulness-aware generation methods proposed to improve faithfulness
  • Distillation approach allows model to generate faithful summaries with greedy decoding

Paper Content

Introduction

  • Recent developments in large pre-trained language models have achieved remarkable performance on abstractive summarization
  • Problem of hallucinations, where generated summary contains facts not present in original document
  • Prior research has analyzed and defined potential error types and typology
  • Effect of decoding strategies on faithfulness of abstractive summarization is less understood
  • Analysis of popular decoding strategies (greedy, beam, nucleus sampling) on two datasets
  • Beam search provides most faithful summaries
  • Randomness introduced by sampling hurts faithfulness
  • Two faithfulness-aware decoding methods proposed
  • Distillation approach to generate faithful summaries with greedy decoding
  • Investigated effect of decoding strategies on faithfulness
  • Investigated whether better exploration of search space can improve faithfulness
  • Investigated how randomness impacts faithfulness
  • Explored three common decoding strategies: greedy, beam search, and nucleus sampling

Faithfulness-aware decoding strategies

  • Hypothesize that current decoding methods may not explore paths that focus on faithfulness effectively
  • Propose two faithfulness-aware methods to modify how the space is explored
  • Method 1: Ranking makes use of beam search and picks the most faithful path
  • Method 2: Lookahead guides the search process by adding faithfulness heuristics

Ranking with faithfulness metrics

  • Beam search explores many suitable candidates during the decoding process
  • We propose to rerank the generated candidates according to faithfulness metrics
  • Falke et al. (2019) used NLI models to rerank, but it increased the number of unfaithful summaries
  • We explore using faithfulness metrics directly for ranking
  • We use a composite metric to aggregate the vote of several popular metrics

Lookahead

  • Model score is used to select summary tokens
  • Reference-free faithfulness evaluation function assigns a score to the summary
  • Weight and number of tokens to look into the future are also taken into account
  • Number of summaries generated depends on the decoding strategy used
  • Full summary is generated instead of partial summary

Combining ranking and lookahead

  • Combining two methods to improve faithfulness
  • Use BEAM+LOOKAHEAD to generate beam candidates
  • Select best candidates with ranking

Efficient decoding via distillation

  • Proposed decoding methods have heavy computational cost
  • Exploring using distillation to transfer knowledge of faithfulness-aware decoding to student model
  • Distillation aims to improve decoding time, not model size
  • New decoding distillation loss proposed
  • Iterative distillation process proposed

Experiments

Datasets and models

  • Performed experiments on two datasets for abstractive summarization
  • Used BART-large checkpoint for the two datasets
  • Experiment also done with PEGASUS

Evaluation metrics

  • Used F1 measure of ROUGE-L and BERTScore to evaluate summary quality
  • Used BS-Fact, FactCC, DAE, and QuestEval for faithfulness evaluation

Human evaluation setup

  • Evaluated faithfulness and informativeness of summaries using Amazon Mechanical Turk
  • Faithfulness judged using 3-star rating system
  • Informativeness judged using best-worst-scaling method

Decoding setting details

  • Basic decoding methods compared using greedy search, beam search (k = 10), and nucleus sampling (p = 0.9).
  • Composite metric used to rank candidates, trained using FactCC, BS-Fact, DAE, and QuestEval.
  • Lookahead used BS-Fact as faithfulness metric, applied to both greedy and beam searches.
  • Distillation used checkpoint of two proposed faithfulness-aware decoding methods as teacher model, student model from BART-LARGE.

Baseline decoding results

  • Common decoding strategies are analyzed in 2022
  • Beam search is used to explore if larger beam sizes result in more faithful summaries
  • Increasing the beam size improves all faithfulness scores
  • Reranking strategy has potential to output more faithful summaries

Faithfulness-aware decoding results

  • Faithfulness-aware methods are compared to traditional decoding methods
  • Lookahead improves faithfulness
  • Base decoding strategy is still the dominating factor
  • Combination of lookahead and ranking can further improve faithfulness
  • Applying faithful decoding methods decreases ROUGE score

Human evaluation results

  • Human evaluation results show that our proposed decoding strategies generate more faithful summaries than baseline decoding methods.
  • Our proposed methods reduce the percentage of summaries with major factual errors.
  • Our proposed methods generate the most informative summaries according to human evaluation.

Abstractiveness

  • Models can become more faithful by becoming more extractive
  • Experiments conducted on XSum to measure faithfulness and abstractiveness
  • More faithful models tend to be more extractive
  • Lookahead method allows balancing of faithfulness and abstractiveness

Distillation

  • Student models approach the performance of teacher models
  • Student models generate more faithful summaries than greedy search baseline
  • Decoding speed improved by 40%
  • Student model improves DAE by 6.6 points and QuestEval by 1 point compared to greedy search baseline
  • Student model outperforms teacher model on faithfulness metrics with two iterations
  • Decoding methods are used to select the best tokens to form a hypothesis
  • Different decoding strategies have been analyzed for natural language generation
  • Distillation is used to compress knowledge from a larger model into a smaller one
  • Pseudo-labeling is used to reduce computational cost during decoding

Conclusion

  • Popular decoding strategies analyzed for effect on faithfulness for abstractive summarization
  • Two newly proposed faithfulness-aware decoding strategies, ranking and lookahead, can improve faithfulness
  • Distillation trick can be used to improve decoding speed
  • Human evaluation on informativeness with 100% accuracy
  • Krippendorff alpha for CNN/DM and XSum annotation is 0.22 and 0.32 respectively
  • Beam search performs best in terms of faithfulness except for FactCC on XSum dataset
  • Ranking and lookahead improve faithfulness when combined
  • Composite metric is robust for another domain
  • Lookahead heuristic prevents score from dipping
  • Optimizing for one metric will lead to improvement in other faithfulness metrics
  • Composite metric is able to achieve a similarly good score for all faithfulness metrics
  • Lookahead heuristic improves search space
  • Gain in faithfulness outweighs decrease in abstractiveness
  • Human faithfulness score shows BEAM+LOOKAHEAD+ABSTR achieves highest MINT score
  • Average of faithfulness metrics shows BEAM+RANKING achieves highest score