Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Current large language models can perform well on complex tasks with few-shot learning
- ALERT is a benchmark and suite of analyses to assess language models’ reasoning ability
- ALERT covers 10 different reasoning skills and 20 datasets
- Finetuning helps language models learn more reasoning skills
- Finetuning can lead to overfitting and generalization problems
Paper Content
Introduction
- Large language models (LLMs) have shown increasing in-context learning capabilities with scaling up the model and data size
- LLMs still struggle with tasks such as commonsense reasoning and math word problems
- Recent work used different prompting methods to improve LLMs’ performance on tasks that require multiple steps of reasoning
- ALERT is a new pipeline to benchmark different LLMs on various reasoning skills
- ALERT covers 10 different reasoning skills including logical, causal, commonsense, abductive, spatial, analogical, argument and deductive reasoning
- Experiments indicate that there is no strong correlation between high vocabulary overlap and performance gain on evaluation datasets
- Finetuning helps to improve certain reasoning capabilities of LLMs but not all of them
- Finetuning can cause overfitting towards prompt templates
- Evaluating reasoning skills with ALERT provides new insights on how models have or have not succeeded in generalizing beyond their experience
Motivation and our benchmark
- ALERT is a computer science paper that focuses on measuring LLMs performance on tasks that require contextual understanding and multi-step operations.
- ALERT tasks involve writing an implausible answer to a question that involves event duration.
- ALERT datasets are selected from NIV2 benchmark and modified to make them easier to analyze.
- ALERT reasoning skills are divided into in-domain skills and held-out skills.
- In-domain skills can be learned from finetuning data, but these in-domain datasets are actually on-held datasets.
- All 20 datasets are used for all other analysis.
Experiment setup
Models
- We compare 3 models: pre-trained, finetuned, and rationale-based finetuned
- We use OPT models of two scales: 1.3B and 13B
- We train all models in Pytorch using Metaseq3
- We initialize model hyperparameters for each model scale following OPT
- We pack training examples into sequences of length 2048
- We use AdamW with 32-bit state and linearly warm up and decay the learning rate
- For 1.3B models, we use batch size of 128, and for 13B models, we use batch size of 256
Finetuning details
- Finetuning Data consists of 10 datasets
- Datasets include ProofWriter, StrategyQA, ECQA, CoQA, GSM8K, AQUA-RAT, Ling
Evaluation
- Variable prompt templates used in experiments
- Five templates used: T1-T5
- Evaluation metrics: ROUGE-L, exact-match, relaxed-match
- Default score is highest score among five templates
Analysis
Does finetuning help?
- Figure 4 shows the performance of 3 models across 2 scales
- OPT-CoT improves performance on both scales, while OPT-FT sometimes yields worse results
- This is likely due to models getting overfitted to the prompt format used in finetuning data, while OPT-CoT avoids this issue
What does llms learn during finetuning?
- CoT-finetuning improves performance on reasoning tasks
- Finetuning from three perspectives: data memorizing, reasoning skills transfer, and prompt template memorizing
- Finetuned models should be evaluated on heldout datasets
- No strong correlation between data similarity and performance
- CoT-finetuning improves in-domain reasoning skills
- Finetuning helps learn reasoning skills not seen during pretraining
- Finetuning can memorize data format and templates
- Finetuned models contain bias towards data formats and templates
- CoT-finetuning is better than OPT-FT, but not as good as pre-trained model
Related works
- Large language models (LLMs) have demonstrated remarkable performance on various NLP tasks, but reasoning tasks remain more challenging.
- Several research trajectories have been proposed to improve LLMs’ reasoning abilities.
- Existing benchmarks are used to evaluate language models’ performance, but none are specifically designed for reasoning tasks.
- Finetuning LLMs on a range of NLP tasks has shown improved performance on held-out downstream tasks.
- It is unclear whether the improvement comes from simply adding more training data or finetuning on rationales.
Conclusions
- Introduce ALERT, a benchmark for assessing LLMs’ reasoning abilities
- Span over 20 datasets and covers 10 different reasoning skills
- Explore what is the meaning of finetuning for these complex tasks
- LLMs do not rely on memorizing training data, but are capable to learn diverse reasoning skills
- Finetuning yields performance improvement, but also has side effects
- LLMs memorize data template representation and templates seen during finetuning
- CoT-finetuning alleviates this problem to a certain extent
- Select checkpoints based on perplexity or loss from validation datasets of finetuning corpus
- Select 3 tasks from NIV2 benchmark and compile a development set
- Measure unigram vocabulary overlaps between finetuning corpus and evaluation corpus
- GPT-3 achieves random results on datasets due to unnatural data expression
- Checkpoint selection can determine the final performance of the LLMs
- Examples from tasks that require reasoning skills and generated outputs from ChatGPT