Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs can be used for complex tasks such as arithmetic and commonsense reasoning.
Task-specific prompts are important for LLMs to produce high-quality answers.
Active-Prompt is a new method to adapt LLMs to different tasks with task-specific example prompts.
Uncertainty-based active learning is used to select the most uncertain questions for annotation.
Experimental results show the superiority of the proposed method.

Paper Content

Introduction

Large language models (LLMs) have achieved great success in recent years
Typical way of applying LLMs is in-context learning
Performs well on conventional language understanding and generation tasks, but poor on complex reasoning tasks
Recent prompting studies found that elaborating the reasoning steps in the exemplars endows LLMs with good reasoning abilities
Chain-of-thought prompting depends on human engineering
Human-annotated exemplars are not necessarily the most effective for different tasks
Key problem is how to determine which questions are the most important and helpful for annotation
Leverage uncertainty and introduce a few human efforts to annotate a small set of questions
Introduce several metrics to characterize the uncertainty among the model’s predictions
Proposed approach outperforms competitive baseline models on multiple reasoning tasks
Contributions are: judiciously select the most helpful and informative questions for annotation, introduce an effective uncertainty-based question selection strategy, and surpass competitive baseline models

Uncertainty estimation

Start with manually annotated exemplars to help infer answers in uncertainty estimation stage
Method not dependent on few-shot prompting, other exemplar-free methods can be applied
Mainly report performance of disagreement-based and entropy-based methods
Human annotation needed for selected questions

Selection and annotation

Establish an uncertainty ranking for each question
Select top-n uncertain questions for annotation
Annotate questions with rationale chains and answers by human annotators

Inference

Prompt questions with annotated exemplars
Apply self-consistency to infer questions multiple times with a temperature T

Experimental settings

Describe details of datasets and evaluation metrics
Describe baseline models
Describe implementation

Datasets and evaluation metrics

Experiments conducted on 3 types of datasets: Arithmetic Reasoning, Commonsense Reasoning, and Symbolic Reasoning
1000 data randomly sampled from training set to reduce computational cost
Evaluation metric is exact match accuracy
Performance of model will increase with more financial support

Baselines

Four methods used as baselines: Chain-of-thought (CoT), Self-consistency (SC), Auto-CoT, and Active-Prompt
Experiments conducted on CodeX code-davinci-002 and text-davinci-002
Results of experiments compared against existing models
Experiments conducted using OpenAI’s services

Implementation

Model was evaluated on test data
Number of exemplars varied by dataset
Test split data was transferred from GSM8K
Temperature set to 0.7
Inference done 40 times for each question, most consistent answer taken

Experimental results

Active-Prompt (D) outperforms all baseline models by a large margin
Active-Prompt (D) achieves state-of-the-art results with an average of 7.0% and 1.8% improvement over self-consistency
Active-Prompt outperforms self-consistency across all three tasks (arithmetic reasoning, commonsense and symbolic reasoning)

Analysis

Conducted additional experiments to investigate effects of fewshot prompts, active selection, annotators, uncertainty metrics, pool size, and prompt engineering
Analyzed relationship between uncertainty and accuracy

Ablation study

Zero-shot setting removes dependency on few exemplars
Active example selection strategy explored
Effects of different annotators, uncertainty metrics, and pool sizes explored
4-8 manually annotated exemplars used to help infer answers in uncertainty estimation stage
Zero-Shot-Active-Prompt performs competitively to Active-Prompt
Annotator A and B results consistently better than baseline models
Entropy used for StrategyQA
Self-confidence-based method performs badly
Disagreement and entropy chosen as primary metric
Performance increases with increase in pool size, converges at k=10

Comparison with auto-cot

Auto-CoT and Active-Prompt are two methods for question selection
Active-Prompt outperforms Auto-CoT by a large margin
Improvement is attributed to uncertainty-based selection and human annotation

Effects of prompt engineering

Letter (4) task is to concatenate the last letters of each word.
Template of Wei et al. (2022b) was used and result was 0.7% higher than previous best baseline model.
Coding style prompt was applied and achieved 97.1% accuracy with a 23.7% improvement.

Uncertainty analysis

Motivation of proposed method is to reduce model uncertainty and improve few-shot prompting performance
Negative correlation between uncertainty and accuracy, decrease of uncertainty leads to increase in accuracy

Reasoning ability is reviewed
Prompt-based learning is discussed
Chain-of-thought prompting is discussed
Active learning methods are discussed

Reasoning ability

Reasoning ability is essential to humans and desired for machine learning models
Reasoning ability consists of various sub-skills
Previous efforts in machine learning exploited symbolic systems and pre-training strategies
Recently, large language models with chain-of-thought prompting demonstrate promising reasoning abilities

Prompt-based learning

Prompt-based Learning (Prompting) aims to elicit helpful knowledge in large language models.
Existing prompt tuning methods can be categorized into two types based on their nature: discrete prompts and continuous prompts.
Research is highly relevant to exemplar-based in-context learning and discrete prompts research.

Chain-of-thought prompting

Chain-of-thought prompting is a way to use large language models to access reasoning abilities.
Wei et al. (2022b) proposed the idea of enriching few-shot examples with reasoning steps.
Many studies have improved CoT in terms of self-consistency, least-to-most prompting, dynamic least-to-most prompting, bootstrapping, self-training, and verifier.
Auto-CoT (Zhang et al., 2022b) divides the test questions into clusters and generates answers via zero-shot prompting.
Our method considers the combination of diversity and uncertainty as an important future direction.

Active learning

Active learning aims to improve data labeling efficiency
Recent studies show benefits of active learning for fine-tuning language models
Max-entropy and least confidence algorithms incorporated into in-context learning scenarios
Chain-of-thought prompting used for complex reasoning tasks

Conclusion

Proposed Active-Prompt for eliciting reasoning in large language models
Uncertainty-based active selection strategy to determine which questions are most important
Four different strategies of uncertainty estimation
Promising performance on 8 datasets for arithmetic, commonsense, and symbolic reasoning
Analyses of uncertainty metrics, pool sizes, zero-shot learning, and accuracy-uncertainty relationship
Ablation study on three arithmetic reasoning tasks
Comparison with Auto-CoT

Link to paper#

Abstract#

Paper Content#

Introduction#

Uncertainty estimation#

Selection and annotation#

Inference#

Experimental settings#

Datasets and evaluation metrics#

Baselines#

Implementation#

Experimental results#

Analysis#

Ablation study#

Comparison with auto-cot#

Effects of prompt engineering#

Uncertainty analysis#

Related work#

Reasoning ability#

Prompt-based learning#

Chain-of-thought prompting#

Active learning#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Uncertainty estimation

Selection and annotation

Inference

Experimental settings

Datasets and evaluation metrics

Baselines

Implementation

Experimental results

Analysis

Ablation study

Comparison with auto-cot

Effects of prompt engineering

Uncertainty analysis

Related work

Reasoning ability

Prompt-based learning

Chain-of-thought prompting

Active learning

Conclusion