Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • LLMs can be used for complex tasks such as arithmetic and commonsense reasoning.
  • Task-specific prompts are important for LLMs to produce high-quality answers.
  • Active-Prompt is a new method to adapt LLMs to different tasks with task-specific example prompts.
  • Uncertainty-based active learning is used to select the most uncertain questions for annotation.
  • Experimental results show the superiority of the proposed method.

Paper Content

Introduction

  • Large language models (LLMs) have achieved great success in recent years
  • Typical way of applying LLMs is in-context learning
  • Performs well on conventional language understanding and generation tasks, but poor on complex reasoning tasks
  • Recent prompting studies found that elaborating the reasoning steps in the exemplars endows LLMs with good reasoning abilities
  • Chain-of-thought prompting depends on human engineering
  • Human-annotated exemplars are not necessarily the most effective for different tasks
  • Key problem is how to determine which questions are the most important and helpful for annotation
  • Leverage uncertainty and introduce a few human efforts to annotate a small set of questions
  • Introduce several metrics to characterize the uncertainty among the model’s predictions
  • Proposed approach outperforms competitive baseline models on multiple reasoning tasks
  • Contributions are: judiciously select the most helpful and informative questions for annotation, introduce an effective uncertainty-based question selection strategy, and surpass competitive baseline models

Uncertainty estimation

  • Start with manually annotated exemplars to help infer answers in uncertainty estimation stage
  • Method not dependent on few-shot prompting, other exemplar-free methods can be applied
  • Mainly report performance of disagreement-based and entropy-based methods
  • Human annotation needed for selected questions

Selection and annotation

  • Establish an uncertainty ranking for each question
  • Select top-n uncertain questions for annotation
  • Annotate questions with rationale chains and answers by human annotators

Inference

  • Prompt questions with annotated exemplars
  • Apply self-consistency to infer questions multiple times with a temperature T

Experimental settings

  • Describe details of datasets and evaluation metrics
  • Describe baseline models
  • Describe implementation

Datasets and evaluation metrics

  • Experiments conducted on 3 types of datasets: Arithmetic Reasoning, Commonsense Reasoning, and Symbolic Reasoning
  • 1000 data randomly sampled from training set to reduce computational cost
  • Evaluation metric is exact match accuracy
  • Performance of model will increase with more financial support

Baselines

  • Four methods used as baselines: Chain-of-thought (CoT), Self-consistency (SC), Auto-CoT, and Active-Prompt
  • Experiments conducted on CodeX code-davinci-002 and text-davinci-002
  • Results of experiments compared against existing models
  • Experiments conducted using OpenAI’s services

Implementation

  • Model was evaluated on test data
  • Number of exemplars varied by dataset
  • Test split data was transferred from GSM8K
  • Temperature set to 0.7
  • Inference done 40 times for each question, most consistent answer taken

Experimental results

  • Active-Prompt (D) outperforms all baseline models by a large margin
  • Active-Prompt (D) achieves state-of-the-art results with an average of 7.0% and 1.8% improvement over self-consistency
  • Active-Prompt outperforms self-consistency across all three tasks (arithmetic reasoning, commonsense and symbolic reasoning)

Analysis

  • Conducted additional experiments to investigate effects of fewshot prompts, active selection, annotators, uncertainty metrics, pool size, and prompt engineering
  • Analyzed relationship between uncertainty and accuracy

Ablation study

  • Zero-shot setting removes dependency on few exemplars
  • Active example selection strategy explored
  • Effects of different annotators, uncertainty metrics, and pool sizes explored
  • 4-8 manually annotated exemplars used to help infer answers in uncertainty estimation stage
  • Zero-Shot-Active-Prompt performs competitively to Active-Prompt
  • Annotator A and B results consistently better than baseline models
  • Entropy used for StrategyQA
  • Self-confidence-based method performs badly
  • Disagreement and entropy chosen as primary metric
  • Performance increases with increase in pool size, converges at k=10

Comparison with auto-cot

  • Auto-CoT and Active-Prompt are two methods for question selection
  • Active-Prompt outperforms Auto-CoT by a large margin
  • Improvement is attributed to uncertainty-based selection and human annotation

Effects of prompt engineering

  • Letter (4) task is to concatenate the last letters of each word.
  • Template of Wei et al. (2022b) was used and result was 0.7% higher than previous best baseline model.
  • Coding style prompt was applied and achieved 97.1% accuracy with a 23.7% improvement.

Uncertainty analysis

  • Motivation of proposed method is to reduce model uncertainty and improve few-shot prompting performance
  • Negative correlation between uncertainty and accuracy, decrease of uncertainty leads to increase in accuracy
  • Reasoning ability is reviewed
  • Prompt-based learning is discussed
  • Chain-of-thought prompting is discussed
  • Active learning methods are discussed

Reasoning ability

  • Reasoning ability is essential to humans and desired for machine learning models
  • Reasoning ability consists of various sub-skills
  • Previous efforts in machine learning exploited symbolic systems and pre-training strategies
  • Recently, large language models with chain-of-thought prompting demonstrate promising reasoning abilities

Prompt-based learning

  • Prompt-based Learning (Prompting) aims to elicit helpful knowledge in large language models.
  • Existing prompt tuning methods can be categorized into two types based on their nature: discrete prompts and continuous prompts.
  • Research is highly relevant to exemplar-based in-context learning and discrete prompts research.

Chain-of-thought prompting

  • Chain-of-thought prompting is a way to use large language models to access reasoning abilities.
  • Wei et al. (2022b) proposed the idea of enriching few-shot examples with reasoning steps.
  • Many studies have improved CoT in terms of self-consistency, least-to-most prompting, dynamic least-to-most prompting, bootstrapping, self-training, and verifier.
  • Auto-CoT (Zhang et al., 2022b) divides the test questions into clusters and generates answers via zero-shot prompting.
  • Our method considers the combination of diversity and uncertainty as an important future direction.

Active learning

  • Active learning aims to improve data labeling efficiency
  • Recent studies show benefits of active learning for fine-tuning language models
  • Max-entropy and least confidence algorithms incorporated into in-context learning scenarios
  • Chain-of-thought prompting used for complex reasoning tasks

Conclusion

  • Proposed Active-Prompt for eliciting reasoning in large language models
  • Uncertainty-based active selection strategy to determine which questions are most important
  • Four different strategies of uncertainty estimation
  • Promising performance on 8 datasets for arithmetic, commonsense, and symbolic reasoning
  • Analyses of uncertainty metrics, pool sizes, zero-shot learning, and accuracy-uncertainty relationship
  • Ablation study on three arithmetic reasoning tasks
  • Comparison with Auto-CoT