Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Current large language models can perform well on complex tasks with few-shot learning
ALERT is a benchmark and suite of analyses to assess language models’ reasoning ability
ALERT covers 10 different reasoning skills and 20 datasets
Finetuning helps language models learn more reasoning skills
Finetuning can lead to overfitting and generalization problems

Paper Content

Introduction

Large language models (LLMs) have shown increasing in-context learning capabilities with scaling up the model and data size
LLMs still struggle with tasks such as commonsense reasoning and math word problems
Recent work used different prompting methods to improve LLMs’ performance on tasks that require multiple steps of reasoning
ALERT is a new pipeline to benchmark different LLMs on various reasoning skills
ALERT covers 10 different reasoning skills including logical, causal, commonsense, abductive, spatial, analogical, argument and deductive reasoning
Experiments indicate that there is no strong correlation between high vocabulary overlap and performance gain on evaluation datasets
Finetuning helps to improve certain reasoning capabilities of LLMs but not all of them
Finetuning can cause overfitting towards prompt templates
Evaluating reasoning skills with ALERT provides new insights on how models have or have not succeeded in generalizing beyond their experience

Motivation and our benchmark

ALERT is a computer science paper that focuses on measuring LLMs performance on tasks that require contextual understanding and multi-step operations.
ALERT tasks involve writing an implausible answer to a question that involves event duration.
ALERT datasets are selected from NIV2 benchmark and modified to make them easier to analyze.
ALERT reasoning skills are divided into in-domain skills and held-out skills.
In-domain skills can be learned from finetuning data, but these in-domain datasets are actually on-held datasets.
All 20 datasets are used for all other analysis.

Experiment setup

Models

We compare 3 models: pre-trained, finetuned, and rationale-based finetuned
We use OPT models of two scales: 1.3B and 13B
We train all models in Pytorch using Metaseq3
We initialize model hyperparameters for each model scale following OPT
We pack training examples into sequences of length 2048
We use AdamW with 32-bit state and linearly warm up and decay the learning rate
For 1.3B models, we use batch size of 128, and for 13B models, we use batch size of 256

Finetuning details

Finetuning Data consists of 10 datasets
Datasets include ProofWriter, StrategyQA, ECQA, CoQA, GSM8K, AQUA-RAT, Ling

Evaluation

Variable prompt templates used in experiments
Five templates used: T1-T5
Evaluation metrics: ROUGE-L, exact-match, relaxed-match
Default score is highest score among five templates

Analysis

Does finetuning help?

Figure 4 shows the performance of 3 models across 2 scales
OPT-CoT improves performance on both scales, while OPT-FT sometimes yields worse results
This is likely due to models getting overfitted to the prompt format used in finetuning data, while OPT-CoT avoids this issue

What does llms learn during finetuning?

CoT-finetuning improves performance on reasoning tasks
Finetuning from three perspectives: data memorizing, reasoning skills transfer, and prompt template memorizing
Finetuned models should be evaluated on heldout datasets
No strong correlation between data similarity and performance
CoT-finetuning improves in-domain reasoning skills
Finetuning helps learn reasoning skills not seen during pretraining
Finetuning can memorize data format and templates
Finetuned models contain bias towards data formats and templates
CoT-finetuning is better than OPT-FT, but not as good as pre-trained model

Large language models (LLMs) have demonstrated remarkable performance on various NLP tasks, but reasoning tasks remain more challenging.
Several research trajectories have been proposed to improve LLMs’ reasoning abilities.
Existing benchmarks are used to evaluate language models’ performance, but none are specifically designed for reasoning tasks.
Finetuning LLMs on a range of NLP tasks has shown improved performance on held-out downstream tasks.
It is unclear whether the improvement comes from simply adding more training data or finetuning on rationales.

Conclusions

Introduce ALERT, a benchmark for assessing LLMs’ reasoning abilities
Span over 20 datasets and covers 10 different reasoning skills
Explore what is the meaning of finetuning for these complex tasks
LLMs do not rely on memorizing training data, but are capable to learn diverse reasoning skills
Finetuning yields performance improvement, but also has side effects
LLMs memorize data template representation and templates seen during finetuning
CoT-finetuning alleviates this problem to a certain extent
Select checkpoints based on perplexity or loss from validation datasets of finetuning corpus
Select 3 tasks from NIV2 benchmark and compile a development set
Measure unigram vocabulary overlaps between finetuning corpus and evaluation corpus
GPT-3 achieves random results on datasets due to unnatural data expression
Checkpoint selection can determine the final performance of the LLMs
Examples from tasks that require reasoning skills and generated outputs from ChatGPT

Link to paper#

Abstract#

Paper Content#

Introduction#

Motivation and our benchmark#

Experiment setup#

Models#

Finetuning details#

Evaluation#

Analysis#

Does finetuning help?#

What does llms learn during finetuning?#

Related works#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Motivation and our benchmark

Experiment setup

Models

Finetuning details

Evaluation

Analysis

Does finetuning help?

What does llms learn during finetuning?

Related works

Conclusions