Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Pre-trained language models have been successful in NLG tasks.
  • Various decoding methods have been employed, but often produce suboptimal results.
  • A novel method, \textsc{PairReranker}, was proposed to improve reranking for NLG tasks.
  • Experiments on three NLG tasks demonstrated the effectiveness and flexibility of \textsc{PairReranker}.
  • \textsc{PairReranker} can generalize to improve GPT-3 results.

Paper Content

Introduction

  • Pre-trained encoder-decoder language models (LMs) are effective for natural language generation (NLG) tasks.
  • BART and T5 are examples of pre-trained models.
  • Task-specific pre-trained models, such as PEGASUS, have also been used.
  • Various decoding approaches, such as beam search, diverse beam search, top-k sampling, and top-p sampling, are used during inference.
  • These methods often produce suboptimal results.
  • Selecting the best output from the results of multiple decoding methods can improve performance.
  • For example, using the PEGASUS model on the CNNDM dataset, the Rouge-2 score for the top beam search generation can be increased by 57%.
  • The T5-large model on the CommonGen dataset can achieve a 93% gain in CIDEr score.
  • The Opus-MT model can achieve a 79.7% gain in BLEU on the WMT18 (zh-en) translation task.
  • Re-ranking candidates after decoding is a way to mitigate the gap between oracle selections and top-ranked outputs.
  • SimCLS and SummaReranker are examples of re-ranking approaches.
  • PAIRRERANKER is a novel re-ranking method that uses a single encoder and pairwise loss function.
  • Experiments demonstrate that PAIRRERANKER outperforms the baseline methods and is compatible with large language models.

Problem formulation

Methods

  • Previous baselines have methods that can be formulated
  • Limitations of previous baselines exist
  • A pair-based reranker is proposed to address the limitations

Baseline methods

  • SimCLS Liu and Liu (2021) treat reranking as a learning-to-rank problem and use cosine similarity between source and candidate as the predicted score.
  • SummaReranker Ravaut et al. (2022) frame the problem as a binary classification task and use a mixture-of-expert layer for optimization.

Pairwise reranking

  • Candidates are highly homogeneous, making it difficult for the model to learn their difference.
  • Traditional document retrieval has a rich and heterogeneous document corpus.
  • Search space of the problem only contains dozens of generated candidates from a normal language model.
  • Goal is to train a reranker to capture the subtle nuance among the candidates.
  • Method follows the paradigm of two-stage training.
  • Reranker is expected to output two scores for a given pair of candidates.
  • Model is a multitask binary classification problem.
  • In-context attention is used to capture the difference among the highly homogeneous candidate groups.
  • Subsampling strategy is used during training and inference.
  • Single bubble run of comparisons is used to select the best candidate.

Base models

  • Used PEGASUSlarge and BART-large for summarization task on CNN/DailyMail dataset
  • Used T5-large for generative commonsense reasoning task on CommonGen dataset
  • Used opus-mt checkpoint for Chinese-English translation task on WMT2018 dataset

Evaluation setups

  • Construct training dataset for reranker
  • Ensure base model used to generate candidates on training dataset has not seen them
  • Generate candidates on training dataset using decoding method
  • Use public checkpoints for inference
  • Use beam search and diverse beam search for experiments
  • Generate 15 candidates for each decoding method for training and inference
  • Train reranker for 5 epochs

Main results

  • Overall performance in summarization improved by 6.12% in Rouge-1
  • Task generalization improved by 2.90% in CIDEr on CommonGen dataset and 3.87% in BLEU on WMT2018 dataset
  • Transferring re-rankers to GPT-3 improved by 6.45% on CNN/DM dataset and 24.55% on CommonGen dataset
  • Reranker consistent with itself more than 90% of the time
  • Shuffling order of candidates before inference removes bias of initial order
  • Exploiting encoder-decoder LMs for NLG tasks is a focus in NLG research
  • Decoding methods like beam search are used to generate high quality candidates
  • Top-beam candidate is not always the best one
  • Reranking methods are used to improve quality of NLG
  • Pairwise ranking has shown great performance on NLP tasks
  • Attention mechanism captures difference between a pair of data
  • Self-critic algorithm used to train language model and reranker jointly

Conclusion

  • Pre-trained encoder-decoder language models can be used for NLG tasks.
  • Decoding methods like beam search and top-k sampling can produce suboptimal results.
  • Re-ranking candidate outputs after decoding can improve performance.
  • PAIRRERANKER is a novel reranking method that outperforms baseline methods.