Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Language models have been used to solve complex reasoning tasks.
Chain-of-thought prompting has been used to help language models solve complex tasks, but it requires very large models.
This paper proposes a method to enable complex reasoning in smaller models.
The method is evaluated on publicly available language models across a range of tasks and model sizes.
The method enables small models to perform complex reasoning tasks, and in some cases outperform the teacher model.

Language models have demonstrated remarkable performance in a wide range of tasks
Transformer architecture enables scalability
Large language models have demonstrated in-context generalization capabilities
Standard prompting methods are insufficient for tasks requiring multiple reasoning steps
Fine-tune-CoT proposed to utilize CoT reasoning capabilities of large LMs to teach small models
Diverse reasoning proposed to maximize teaching effects of Fine-tune-CoT
Fine-tune-CoT elicits notable reasoning performance in small models on complex tasks
Small models under Fine-tune-CoT outperform their very large teachers in some tasks
Diverse reasoning leads to high sample efficiency and notable reasoning performance
Thorough sample studies and ablations of Fine-tune-CoT conducted

Pre-train and fine-tune paradigm used to enhance large language models’ performance on downstream tasks
Fine-tuning requires large dataset of task-specific labeled examples and does not generalize well
Paradigm shift towards “prompting” the model to predict desired output
Large LMs can exhibit strong performance in this setting
Smaller models require additional engineering to perform similarly
Chain-of-thought (CoT) prompting boosts performance on complex tasks
CoT requires extremely large models for optimal performance
Utilizing CoT methods for smaller models by fine-tuning on rationales generated by a very large model
Knowledge distillation used to reduce model size and latency while preserving accuracy and capacity to generalize

Proposed approach enables chain-of-thought reasoning in small language models
Generate reasoning samples from large teacher models using prompt-based CoT methods
Filter generated samples and reformat into prompt-completion pairs
Fine-tune small pre-trained student model on assembled reasoning samples

Generate multiple reasoning explanations for each training sample to augment the fine-tuning data
Use stochastic sampling strategy to obtain multiple generations of reasoning paths and linguistic templates
Diverse reasoning motivated by the intuition that multiple reasoning paths can be used to solve complex tasks

Evaluated method on 12 datasets related to 4 categories of complex reasoning
Datasets include SingleEq, AddSub, MultiArith, GSM8K, AQUA-RAT, SVAMP, CommonsenseQA, StrategyQA, Last Letters, Coin Flip, Date Understanding, Tracking Shuffled Objects
Evaluated on GPT-3 family of models
Compared to 3 baseline methods: Zero-shot, Fine-tune, Zero-shot-CoT

Datasets contain groups of samples which share common templates
Naive samplewise data split has potential to leak same templates into train and test sets
Fine-tune-CoT outperforms random guess performance and zero-shot and vanilla fine-tuning baselines
Teacher model can answer correctly despite incorrect reasoning
Answer-based filtering outperforms human-filtering
Max sequence length of 128 initially, but can be insufficient in many datasets
Increased max length does not necessarily improve student performance
Different tasks require different lengths of rationales

We demonstrated how large language models can be used to teach smaller student models how to reason step-by-step
We prompted a large model for chain-of-thought rationales and used its completions to fine-tune a smaller model
Our results show that this method improves the performance of small models on a range of tasks with high sample efficiency
We used the OpenAI API to generate reasoning samples and predictions
We used Lr = 128 and Lp = 1024 for maximum sequence lengths
We used a sampling temperature of T = 0 for all generations, except diverse reasoning
We analyzed the performance of Fine-tune-CoT on 50 samples per dataset
We observed that small models have weak arithmetic skills and are sensitive to how a question is formulated
We found that Fine-tune-CoT performs best on text-based datasets
We compared Fine-tune-CoT with zero-shot-CoT and found that Fine-tune-CoT can reason correctly in many cases
We observed that patterns in successful datasets do not lead to overfitting
We found that Fine-tune-CoT combines the advantages of vanilla fine-tuning and CoT reasoning on smaller models