Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • LLMs can be computationally and financially costly to use.
  • Batch prompting is an alternative prompting approach that reduces costs.
  • Batch prompting can improve downstream performance.

Paper Content

Introduction

  • Large language models (LLMs) have shown strong capabilities in zero/few-shot settings
  • Recent work has made progress in in-context learning
  • LLMs can be costly in terms of token and time usage
  • Batch prompting is an alternative approach that allows the model to perform inference on multiple samples at once
  • Batch prompting reduces token and time costs while still retaining downstream performance
  • Batch prompting works well across different LLMs and reasoning methods

Approach

  • Introduces batch prompting as an efficient alternative to standard prompting
  • Compares token and time costs of batch and standard prompting

Problem setup

  • Conventional paradigm for prompting LLMs for in-context learning involves selecting K in-context few-shot exemplars with both a context and an output.
  • Realistic scenario involves N test samples, which requires N separate calls of the LLM inference.

Batch prompting

  • Batch prompting reduces the time needed for LLM inference from N to N/b.
  • K in-context exemplars are grouped into K/b batches with b exemplars each.
  • Position identifier “[index]” is added to assist the LLM and ease the process of parsing the generated responses.

Token cost

  • LLM call costs scale linearly with number of tokens
  • Most tokens are consumed by prompt tokens
  • Token efficiency is the portion of tokens spent on generated tokens
  • Batch prompting lowers token and time costs as number of samples increases
  • Increasing batch size of batch prompting reduces token costs

Time cost

  • Batch prompting reduces the inference time by decreasing the number of API calls.
  • The cost of Transformer decoding increases with batch prompting.
  • Most end-users only have access to LLM API services, so the time cost is marginal.
  • Reducing the number of calls with batch prompting can lower the time costs.
  • In the future, the time reduction of batch prompting may not be as pronounced.

Experiments

  • Batch prompting was evaluated across 10 datasets
  • Batch prompting can improve token and time efficiency by up to 5x with 6 samples in batches
  • Downstream performance is similar or better with batch prompting

Datasets

  • Evaluated batch prompting on 10 datasets across 3 areas
  • Datasets include CommonsenseQA, StrategyQA, GSM8K, SVAMP, AQuA, AddSub, MultiArith, RTE, MNLI, and SST-5
  • Batch prompting showed comparable or better performance

Experimental setups

  • Showed that our proposed algorithm is more efficient than existing algorithms
  • Demonstrated that our algorithm can be used to solve a variety of problems in computer science

Results

  • Batch prompting reduces token and time costs up to 5x compared to standard prompting
  • Decrease in costs scales inverse linearly with number of samples in each batch
  • Batch prompting performs comparably or better than standard prompting

Analysis

  • Analyzing factors that may affect performance of batch prompting
  • Exploring tradeoff of balancing costs and downstream performance
  • Demonstrating batch prompting can be applied to different LLMs and prompting methods

Number of batch samples

  • Performance decreases as batch size increases
  • Best performance not always achieved with batch size of 2
  • Setting batch size to 3 or 4 usually achieves good performance while saving tokens and time
  • Reducing time/token costs diminishes when batch size is larger

Selection of batch samples

  • Examined whether sample selection affects performance of batch prompting
  • Studied two sample selection methods: grouping similar or diverse samples
  • Table 2 showed no improvements over random grouping
  • Reason may be error propagation from earlier samples without ground truth outputs
  • Future research needed to develop effective strategies for selecting samples

Complexity of tasks

  • Performance of batch prompting is affected by complexity of tasks
  • AQuA dataset has the largest drop in performance
  • Longer input contexts lead to more substantial performance drops with batch prompting
  • Performance is steadier with shorter input contexts

Reasoning methods

  • Used Chain-of-Thought (CoT) for all ten datasets in main experiments
  • Examined whether batch prompting is suitable for other common LLM reasoning methods
  • Experimented with two more reasoning methods: end-to-end and program-based
  • Batch prompting can improve efficiency while maintaining similar or better performance
  • Large language models have improved in-context learning performance
  • Different methods have been proposed to prompt LLMs
  • Other works generate programs to solve reasoning tasks
  • Another line of work focuses on selecting better in-context exemplars
  • This work adds a new dimension to ICL for large-scale real-world applications
  • Recent work has proposed methods for efficient language generation

Conclusion

  • Batch prompting is a new way to prompt LLMs that performs inference on samples in a batched fashion
  • Multiple samples can be handled in one API call, reducing token and time costs
  • LLMs are based on the Transformer decoder-only architecture
  • Time complexity to decode each token is O(nd)
  • Increasing b (number of samples in a batch) increases time cost of one inference
  • Experiments on 10 datasets across commonsense QA, arithmetics, and NLI/NLU
  • Batch prompting can achieve better or similar performance compared to standard prompting
  • OpenAI API overhead and rate limit blocking time make up most proportion of time cost
  • Prompt templates built following CoT, Binder, and Program-of-Thought
  • Input length affects batch prompting performance
  • Accuracy decreases with larger b
  • Time cost per inference with K = 12, C = 100 and various b