Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Language models have been used to solve complex reasoning tasks.
- Chain-of-thought prompting has been used to help language models solve complex tasks, but it requires very large models.
- This paper proposes a method to enable complex reasoning in smaller models.
- The method is evaluated on publicly available language models across a range of tasks and model sizes.
- The method enables small models to perform complex reasoning tasks, and in some cases outperform the teacher model.
Paper Content
Introduction
- Language models have demonstrated remarkable performance in a wide range of tasks
- Transformer architecture enables scalability
- Large language models have demonstrated in-context generalization capabilities
- Standard prompting methods are insufficient for tasks requiring multiple reasoning steps
- Fine-tune-CoT proposed to utilize CoT reasoning capabilities of large LMs to teach small models
- Diverse reasoning proposed to maximize teaching effects of Fine-tune-CoT
- Fine-tune-CoT elicits notable reasoning performance in small models on complex tasks
- Small models under Fine-tune-CoT outperform their very large teachers in some tasks
- Diverse reasoning leads to high sample efficiency and notable reasoning performance
- Thorough sample studies and ablations of Fine-tune-CoT conducted
Related work
- Pre-train and fine-tune paradigm used to enhance large language models’ performance on downstream tasks
- Fine-tuning requires large dataset of task-specific labeled examples and does not generalize well
- Paradigm shift towards “prompting” the model to predict desired output
- Large LMs can exhibit strong performance in this setting
- Smaller models require additional engineering to perform similarly
- Chain-of-thought (CoT) prompting boosts performance on complex tasks
- CoT requires extremely large models for optimal performance
- Utilizing CoT methods for smaller models by fine-tuning on rationales generated by a very large model
- Knowledge distillation used to reduce model size and latency while preserving accuracy and capacity to generalize
Chain of thought fine-tuning
- Proposed approach enables chain-of-thought reasoning in small language models
- Generate reasoning samples from large teacher models using prompt-based CoT methods
- Filter generated samples and reformat into prompt-completion pairs
- Fine-tune small pre-trained student model on assembled reasoning samples
Diverse reasoning
- Generate multiple reasoning explanations for each training sample to augment the fine-tuning data
- Use stochastic sampling strategy to obtain multiple generations of reasoning paths and linguistic templates
- Diverse reasoning motivated by the intuition that multiple reasoning paths can be used to solve complex tasks
Experiments
- Evaluated method on 12 datasets related to 4 categories of complex reasoning
- Datasets include SingleEq, AddSub, MultiArith, GSM8K, AQUA-RAT, SVAMP, CommonsenseQA, StrategyQA, Last Letters, Coin Flip, Date Understanding, Tracking Shuffled Objects
- Evaluated on GPT-3 family of models
- Compared to 3 baseline methods: Zero-shot, Fine-tune, Zero-shot-CoT
Results
- Fine-tune-CoT enables complex reasoning in small models
- Fine-tune-CoT outperforms Zero-shot-CoT in small models
- Small models can outperform very large teachers in reasoning
- Fine-tune-CoT outperforms vanilla fine-tuning
- Diverse reasoning substantially improves Fine-tune-CoT performance
- Fine-tune-CoT with diverse reasoning is sample efficient
Analysis
- Datasets contain groups of samples which share common templates
- Naive samplewise data split has potential to leak same templates into train and test sets
- Fine-tune-CoT outperforms random guess performance and zero-shot and vanilla fine-tuning baselines
- Teacher model can answer correctly despite incorrect reasoning
- Answer-based filtering outperforms human-filtering
- Max sequence length of 128 initially, but can be insufficient in many datasets
- Increased max length does not necessarily improve student performance
- Different tasks require different lengths of rationales
Discussion
- Versatility and accessibility of Fine-tune-CoT
- Optimizing for efficiency in single tasks
- Towards concise answers
- Vanilla fine-tuning versus Fine-tune-CoT for complex tasks
- Reasoning in small language models
- Limitations and future work
- Can be applied to any complex task
- Can use publicly available APIs
- Can use more accessible hardware
- Can optimize for single tasks
- Can use special-characters to minimize sequence length
- Can train student models to generate concise answers
- Vanilla fine-tuning shows meaningful performance in complex tasks
- Performance scales with diversity of reasoning and amount of available samples
- Can benefit from advances in teacher models
- Can use different prompting methods
- Can use different knowledge distillation methods
Conclusion
- We demonstrated how large language models can be used to teach smaller student models how to reason step-by-step
- We prompted a large model for chain-of-thought rationales and used its completions to fine-tune a smaller model
- Our results show that this method improves the performance of small models on a range of tasks with high sample efficiency
- We used the OpenAI API to generate reasoning samples and predictions
- We used Lr = 128 and Lp = 1024 for maximum sequence lengths
- We used a sampling temperature of T = 0 for all generations, except diverse reasoning
- We analyzed the performance of Fine-tune-CoT on 50 samples per dataset
- We observed that small models have weak arithmetic skills and are sensitive to how a question is formulated
- We found that Fine-tune-CoT performs best on text-based datasets
- We compared Fine-tune-CoT with zero-shot-CoT and found that Fine-tune-CoT can reason correctly in many cases
- We observed that patterns in successful datasets do not lead to overfitting
- We found that Fine-tune-CoT combines the advantages of vanilla fine-tuning and CoT reasoning on smaller models