Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Study of how decoding strategies affect faithfulness in abstractive summarization
- Beam search with large beam sizes produces most faithful summaries, nucleus sampling produces least faithful
- Two faithfulness-aware generation methods proposed to improve faithfulness
- Distillation approach allows model to generate faithful summaries with greedy decoding
Paper Content
Introduction
- Recent developments in large pre-trained language models have achieved remarkable performance on abstractive summarization
- Problem of hallucinations, where generated summary contains facts not present in original document
- Prior research has analyzed and defined potential error types and typology
- Effect of decoding strategies on faithfulness of abstractive summarization is less understood
- Analysis of popular decoding strategies (greedy, beam, nucleus sampling) on two datasets
- Beam search provides most faithful summaries
- Randomness introduced by sampling hurts faithfulness
- Two faithfulness-aware decoding methods proposed
- Distillation approach to generate faithful summaries with greedy decoding
Faithfulness behavior of popular decoding strategies
- Investigated effect of decoding strategies on faithfulness
- Investigated whether better exploration of search space can improve faithfulness
- Investigated how randomness impacts faithfulness
- Explored three common decoding strategies: greedy, beam search, and nucleus sampling
Faithfulness-aware decoding strategies
- Hypothesize that current decoding methods may not explore paths that focus on faithfulness effectively
- Propose two faithfulness-aware methods to modify how the space is explored
- Method 1: Ranking makes use of beam search and picks the most faithful path
- Method 2: Lookahead guides the search process by adding faithfulness heuristics
Ranking with faithfulness metrics
- Beam search explores many suitable candidates during the decoding process
- We propose to rerank the generated candidates according to faithfulness metrics
- Falke et al. (2019) used NLI models to rerank, but it increased the number of unfaithful summaries
- We explore using faithfulness metrics directly for ranking
- We use a composite metric to aggregate the vote of several popular metrics
Lookahead
- Model score is used to select summary tokens
- Reference-free faithfulness evaluation function assigns a score to the summary
- Weight and number of tokens to look into the future are also taken into account
- Number of summaries generated depends on the decoding strategy used
- Full summary is generated instead of partial summary
Combining ranking and lookahead
- Combining two methods to improve faithfulness
- Use BEAM+LOOKAHEAD to generate beam candidates
- Select best candidates with ranking
Efficient decoding via distillation
- Proposed decoding methods have heavy computational cost
- Exploring using distillation to transfer knowledge of faithfulness-aware decoding to student model
- Distillation aims to improve decoding time, not model size
- New decoding distillation loss proposed
- Iterative distillation process proposed
Experiments
Datasets and models
- Performed experiments on two datasets for abstractive summarization
- Used BART-large checkpoint for the two datasets
- Experiment also done with PEGASUS
Evaluation metrics
- Used F1 measure of ROUGE-L and BERTScore to evaluate summary quality
- Used BS-Fact, FactCC, DAE, and QuestEval for faithfulness evaluation
Human evaluation setup
- Evaluated faithfulness and informativeness of summaries using Amazon Mechanical Turk
- Faithfulness judged using 3-star rating system
- Informativeness judged using best-worst-scaling method
Decoding setting details
- Basic decoding methods compared using greedy search, beam search (k = 10), and nucleus sampling (p = 0.9).
- Composite metric used to rank candidates, trained using FactCC, BS-Fact, DAE, and QuestEval.
- Lookahead used BS-Fact as faithfulness metric, applied to both greedy and beam searches.
- Distillation used checkpoint of two proposed faithfulness-aware decoding methods as teacher model, student model from BART-LARGE.
Baseline decoding results
- Common decoding strategies are analyzed in 2022
- Beam search is used to explore if larger beam sizes result in more faithful summaries
- Increasing the beam size improves all faithfulness scores
- Reranking strategy has potential to output more faithful summaries
Faithfulness-aware decoding results
- Faithfulness-aware methods are compared to traditional decoding methods
- Lookahead improves faithfulness
- Base decoding strategy is still the dominating factor
- Combination of lookahead and ranking can further improve faithfulness
- Applying faithful decoding methods decreases ROUGE score
Human evaluation results
- Human evaluation results show that our proposed decoding strategies generate more faithful summaries than baseline decoding methods.
- Our proposed methods reduce the percentage of summaries with major factual errors.
- Our proposed methods generate the most informative summaries according to human evaluation.
Abstractiveness
- Models can become more faithful by becoming more extractive
- Experiments conducted on XSum to measure faithfulness and abstractiveness
- More faithful models tend to be more extractive
- Lookahead method allows balancing of faithfulness and abstractiveness
Distillation
- Student models approach the performance of teacher models
- Student models generate more faithful summaries than greedy search baseline
- Decoding speed improved by 40%
- Student model improves DAE by 6.6 points and QuestEval by 1 point compared to greedy search baseline
- Student model outperforms teacher model on faithfulness metrics with two iterations
Related work
- Decoding methods are used to select the best tokens to form a hypothesis
- Different decoding strategies have been analyzed for natural language generation
- Distillation is used to compress knowledge from a larger model into a smaller one
- Pseudo-labeling is used to reduce computational cost during decoding
Conclusion
- Popular decoding strategies analyzed for effect on faithfulness for abstractive summarization
- Two newly proposed faithfulness-aware decoding strategies, ranking and lookahead, can improve faithfulness
- Distillation trick can be used to improve decoding speed
- Human evaluation on informativeness with 100% accuracy
- Krippendorff alpha for CNN/DM and XSum annotation is 0.22 and 0.32 respectively
- Beam search performs best in terms of faithfulness except for FactCC on XSum dataset
- Ranking and lookahead improve faithfulness when combined
- Composite metric is robust for another domain
- Lookahead heuristic prevents score from dipping
- Optimizing for one metric will lead to improvement in other faithfulness metrics
- Composite metric is able to achieve a similarly good score for all faithfulness metrics
- Lookahead heuristic improves search space
- Gain in faithfulness outweighs decrease in abstractiveness
- Human faithfulness score shows BEAM+LOOKAHEAD+ABSTR achieves highest MINT score
- Average of faithfulness metrics shows BEAM+RANKING achieves highest score