Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Pre-trained language models have been successful in NLG tasks.
- Various decoding methods have been employed, but often produce suboptimal results.
- A novel method, \textsc{PairReranker}, was proposed to improve reranking for NLG tasks.
- Experiments on three NLG tasks demonstrated the effectiveness and flexibility of \textsc{PairReranker}.
- \textsc{PairReranker} can generalize to improve GPT-3 results.
Paper Content
Introduction
- Pre-trained encoder-decoder language models (LMs) are effective for natural language generation (NLG) tasks.
- BART and T5 are examples of pre-trained models.
- Task-specific pre-trained models, such as PEGASUS, have also been used.
- Various decoding approaches, such as beam search, diverse beam search, top-k sampling, and top-p sampling, are used during inference.
- These methods often produce suboptimal results.
- Selecting the best output from the results of multiple decoding methods can improve performance.
- For example, using the PEGASUS model on the CNNDM dataset, the Rouge-2 score for the top beam search generation can be increased by 57%.
- The T5-large model on the CommonGen dataset can achieve a 93% gain in CIDEr score.
- The Opus-MT model can achieve a 79.7% gain in BLEU on the WMT18 (zh-en) translation task.
- Re-ranking candidates after decoding is a way to mitigate the gap between oracle selections and top-ranked outputs.
- SimCLS and SummaReranker are examples of re-ranking approaches.
- PAIRRERANKER is a novel re-ranking method that uses a single encoder and pairwise loss function.
- Experiments demonstrate that PAIRRERANKER outperforms the baseline methods and is compatible with large language models.
Problem formulation
Methods
- Previous baselines have methods that can be formulated
- Limitations of previous baselines exist
- A pair-based reranker is proposed to address the limitations
Baseline methods
- SimCLS Liu and Liu (2021) treat reranking as a learning-to-rank problem and use cosine similarity between source and candidate as the predicted score.
- SummaReranker Ravaut et al. (2022) frame the problem as a binary classification task and use a mixture-of-expert layer for optimization.
Pairwise reranking
- Candidates are highly homogeneous, making it difficult for the model to learn their difference.
- Traditional document retrieval has a rich and heterogeneous document corpus.
- Search space of the problem only contains dozens of generated candidates from a normal language model.
- Goal is to train a reranker to capture the subtle nuance among the candidates.
- Method follows the paradigm of two-stage training.
- Reranker is expected to output two scores for a given pair of candidates.
- Model is a multitask binary classification problem.
- In-context attention is used to capture the difference among the highly homogeneous candidate groups.
- Subsampling strategy is used during training and inference.
- Single bubble run of comparisons is used to select the best candidate.
Base models
- Used PEGASUSlarge and BART-large for summarization task on CNN/DailyMail dataset
- Used T5-large for generative commonsense reasoning task on CommonGen dataset
- Used opus-mt checkpoint for Chinese-English translation task on WMT2018 dataset
Evaluation setups
- Construct training dataset for reranker
- Ensure base model used to generate candidates on training dataset has not seen them
- Generate candidates on training dataset using decoding method
- Use public checkpoints for inference
- Use beam search and diverse beam search for experiments
- Generate 15 candidates for each decoding method for training and inference
- Train reranker for 5 epochs
Main results
- Overall performance in summarization improved by 6.12% in Rouge-1
- Task generalization improved by 2.90% in CIDEr on CommonGen dataset and 3.87% in BLEU on WMT2018 dataset
- Transferring re-rankers to GPT-3 improved by 6.45% on CNN/DM dataset and 24.55% on CommonGen dataset
- Reranker consistent with itself more than 90% of the time
- Shuffling order of candidates before inference removes bias of initial order
Related work
- Exploiting encoder-decoder LMs for NLG tasks is a focus in NLG research
- Decoding methods like beam search are used to generate high quality candidates
- Top-beam candidate is not always the best one
- Reranking methods are used to improve quality of NLG
- Pairwise ranking has shown great performance on NLP tasks
- Attention mechanism captures difference between a pair of data
- Self-critic algorithm used to train language model and reranker jointly
Conclusion
- Pre-trained encoder-decoder language models can be used for NLG tasks.
- Decoding methods like beam search and top-k sampling can produce suboptimal results.
- Re-ranking candidate outputs after decoding can improve performance.
- PAIRRERANKER is a novel reranking method that outperforms baseline methods.