Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs can be used for complex tasks such as arithmetic and commonsense reasoning.
- Task-specific prompts are important for LLMs to produce high-quality answers.
- Active-Prompt is a new method to adapt LLMs to different tasks with task-specific example prompts.
- Uncertainty-based active learning is used to select the most uncertain questions for annotation.
- Experimental results show the superiority of the proposed method.
Paper Content
Introduction
- Large language models (LLMs) have achieved great success in recent years
- Typical way of applying LLMs is in-context learning
- Performs well on conventional language understanding and generation tasks, but poor on complex reasoning tasks
- Recent prompting studies found that elaborating the reasoning steps in the exemplars endows LLMs with good reasoning abilities
- Chain-of-thought prompting depends on human engineering
- Human-annotated exemplars are not necessarily the most effective for different tasks
- Key problem is how to determine which questions are the most important and helpful for annotation
- Leverage uncertainty and introduce a few human efforts to annotate a small set of questions
- Introduce several metrics to characterize the uncertainty among the model’s predictions
- Proposed approach outperforms competitive baseline models on multiple reasoning tasks
- Contributions are: judiciously select the most helpful and informative questions for annotation, introduce an effective uncertainty-based question selection strategy, and surpass competitive baseline models
Uncertainty estimation
- Start with manually annotated exemplars to help infer answers in uncertainty estimation stage
- Method not dependent on few-shot prompting, other exemplar-free methods can be applied
- Mainly report performance of disagreement-based and entropy-based methods
- Human annotation needed for selected questions
Selection and annotation
- Establish an uncertainty ranking for each question
- Select top-n uncertain questions for annotation
- Annotate questions with rationale chains and answers by human annotators
Inference
- Prompt questions with annotated exemplars
- Apply self-consistency to infer questions multiple times with a temperature T
Experimental settings
- Describe details of datasets and evaluation metrics
- Describe baseline models
- Describe implementation
Datasets and evaluation metrics
- Experiments conducted on 3 types of datasets: Arithmetic Reasoning, Commonsense Reasoning, and Symbolic Reasoning
- 1000 data randomly sampled from training set to reduce computational cost
- Evaluation metric is exact match accuracy
- Performance of model will increase with more financial support
Baselines
- Four methods used as baselines: Chain-of-thought (CoT), Self-consistency (SC), Auto-CoT, and Active-Prompt
- Experiments conducted on CodeX code-davinci-002 and text-davinci-002
- Results of experiments compared against existing models
- Experiments conducted using OpenAI’s services
Implementation
- Model was evaluated on test data
- Number of exemplars varied by dataset
- Test split data was transferred from GSM8K
- Temperature set to 0.7
- Inference done 40 times for each question, most consistent answer taken
Experimental results
- Active-Prompt (D) outperforms all baseline models by a large margin
- Active-Prompt (D) achieves state-of-the-art results with an average of 7.0% and 1.8% improvement over self-consistency
- Active-Prompt outperforms self-consistency across all three tasks (arithmetic reasoning, commonsense and symbolic reasoning)
Analysis
- Conducted additional experiments to investigate effects of fewshot prompts, active selection, annotators, uncertainty metrics, pool size, and prompt engineering
- Analyzed relationship between uncertainty and accuracy
Ablation study
- Zero-shot setting removes dependency on few exemplars
- Active example selection strategy explored
- Effects of different annotators, uncertainty metrics, and pool sizes explored
- 4-8 manually annotated exemplars used to help infer answers in uncertainty estimation stage
- Zero-Shot-Active-Prompt performs competitively to Active-Prompt
- Annotator A and B results consistently better than baseline models
- Entropy used for StrategyQA
- Self-confidence-based method performs badly
- Disagreement and entropy chosen as primary metric
- Performance increases with increase in pool size, converges at k=10
Comparison with auto-cot
- Auto-CoT and Active-Prompt are two methods for question selection
- Active-Prompt outperforms Auto-CoT by a large margin
- Improvement is attributed to uncertainty-based selection and human annotation
Effects of prompt engineering
- Letter (4) task is to concatenate the last letters of each word.
- Template of Wei et al. (2022b) was used and result was 0.7% higher than previous best baseline model.
- Coding style prompt was applied and achieved 97.1% accuracy with a 23.7% improvement.
Uncertainty analysis
- Motivation of proposed method is to reduce model uncertainty and improve few-shot prompting performance
- Negative correlation between uncertainty and accuracy, decrease of uncertainty leads to increase in accuracy
Related work
- Reasoning ability is reviewed
- Prompt-based learning is discussed
- Chain-of-thought prompting is discussed
- Active learning methods are discussed
Reasoning ability
- Reasoning ability is essential to humans and desired for machine learning models
- Reasoning ability consists of various sub-skills
- Previous efforts in machine learning exploited symbolic systems and pre-training strategies
- Recently, large language models with chain-of-thought prompting demonstrate promising reasoning abilities
Prompt-based learning
- Prompt-based Learning (Prompting) aims to elicit helpful knowledge in large language models.
- Existing prompt tuning methods can be categorized into two types based on their nature: discrete prompts and continuous prompts.
- Research is highly relevant to exemplar-based in-context learning and discrete prompts research.
Chain-of-thought prompting
- Chain-of-thought prompting is a way to use large language models to access reasoning abilities.
- Wei et al. (2022b) proposed the idea of enriching few-shot examples with reasoning steps.
- Many studies have improved CoT in terms of self-consistency, least-to-most prompting, dynamic least-to-most prompting, bootstrapping, self-training, and verifier.
- Auto-CoT (Zhang et al., 2022b) divides the test questions into clusters and generates answers via zero-shot prompting.
- Our method considers the combination of diversity and uncertainty as an important future direction.
Active learning
- Active learning aims to improve data labeling efficiency
- Recent studies show benefits of active learning for fine-tuning language models
- Max-entropy and least confidence algorithms incorporated into in-context learning scenarios
- Chain-of-thought prompting used for complex reasoning tasks
Conclusion
- Proposed Active-Prompt for eliciting reasoning in large language models
- Uncertainty-based active selection strategy to determine which questions are most important
- Four different strategies of uncertainty estimation
- Promising performance on 8 datasets for arithmetic, commonsense, and symbolic reasoning
- Analyses of uncertainty metrics, pool sizes, zero-shot learning, and accuracy-uncertainty relationship
- Ablation study on three arithmetic reasoning tasks
- Comparison with Auto-CoT