Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs can be computationally and financially costly to use.
Batch prompting is an alternative prompting approach that reduces costs.
Batch prompting can improve downstream performance.

Paper Content

Introduction

Large language models (LLMs) have shown strong capabilities in zero/few-shot settings
Recent work has made progress in in-context learning
LLMs can be costly in terms of token and time usage
Batch prompting is an alternative approach that allows the model to perform inference on multiple samples at once
Batch prompting reduces token and time costs while still retaining downstream performance
Batch prompting works well across different LLMs and reasoning methods

Approach

Introduces batch prompting as an efficient alternative to standard prompting
Compares token and time costs of batch and standard prompting

Problem setup

Conventional paradigm for prompting LLMs for in-context learning involves selecting K in-context few-shot exemplars with both a context and an output.
Realistic scenario involves N test samples, which requires N separate calls of the LLM inference.

Batch prompting

Batch prompting reduces the time needed for LLM inference from N to N/b.
K in-context exemplars are grouped into K/b batches with b exemplars each.
Position identifier “[index]” is added to assist the LLM and ease the process of parsing the generated responses.

Token cost

LLM call costs scale linearly with number of tokens
Most tokens are consumed by prompt tokens
Token efficiency is the portion of tokens spent on generated tokens
Batch prompting lowers token and time costs as number of samples increases
Increasing batch size of batch prompting reduces token costs

Time cost

Batch prompting reduces the inference time by decreasing the number of API calls.
The cost of Transformer decoding increases with batch prompting.
Most end-users only have access to LLM API services, so the time cost is marginal.
Reducing the number of calls with batch prompting can lower the time costs.
In the future, the time reduction of batch prompting may not be as pronounced.

Experiments

Batch prompting was evaluated across 10 datasets
Batch prompting can improve token and time efficiency by up to 5x with 6 samples in batches
Downstream performance is similar or better with batch prompting

Datasets

Evaluated batch prompting on 10 datasets across 3 areas
Datasets include CommonsenseQA, StrategyQA, GSM8K, SVAMP, AQuA, AddSub, MultiArith, RTE, MNLI, and SST-5
Batch prompting showed comparable or better performance

Experimental setups

Showed that our proposed algorithm is more efficient than existing algorithms
Demonstrated that our algorithm can be used to solve a variety of problems in computer science

Results

Batch prompting reduces token and time costs up to 5x compared to standard prompting
Decrease in costs scales inverse linearly with number of samples in each batch
Batch prompting performs comparably or better than standard prompting

Analysis

Analyzing factors that may affect performance of batch prompting
Exploring tradeoff of balancing costs and downstream performance
Demonstrating batch prompting can be applied to different LLMs and prompting methods

Number of batch samples

Performance decreases as batch size increases
Best performance not always achieved with batch size of 2
Setting batch size to 3 or 4 usually achieves good performance while saving tokens and time
Reducing time/token costs diminishes when batch size is larger

Selection of batch samples

Examined whether sample selection affects performance of batch prompting
Studied two sample selection methods: grouping similar or diverse samples
Table 2 showed no improvements over random grouping
Reason may be error propagation from earlier samples without ground truth outputs
Future research needed to develop effective strategies for selecting samples

Complexity of tasks

Performance of batch prompting is affected by complexity of tasks
AQuA dataset has the largest drop in performance
Longer input contexts lead to more substantial performance drops with batch prompting
Performance is steadier with shorter input contexts

Reasoning methods

Used Chain-of-Thought (CoT) for all ten datasets in main experiments
Examined whether batch prompting is suitable for other common LLM reasoning methods
Experimented with two more reasoning methods: end-to-end and program-based
Batch prompting can improve efficiency while maintaining similar or better performance

Large language models have improved in-context learning performance
Different methods have been proposed to prompt LLMs
Other works generate programs to solve reasoning tasks
Another line of work focuses on selecting better in-context exemplars
This work adds a new dimension to ICL for large-scale real-world applications
Recent work has proposed methods for efficient language generation

Conclusion

Batch prompting is a new way to prompt LLMs that performs inference on samples in a batched fashion
Multiple samples can be handled in one API call, reducing token and time costs
LLMs are based on the Transformer decoder-only architecture
Time complexity to decode each token is O(nd)
Increasing b (number of samples in a batch) increases time cost of one inference
Experiments on 10 datasets across commonsense QA, arithmetics, and NLI/NLU
Batch prompting can achieve better or similar performance compared to standard prompting
OpenAI API overhead and rate limit blocking time make up most proportion of time cost
Prompt templates built following CoT, Binder, and Program-of-Thought
Input length affects batch prompting performance
Accuracy decreases with larger b
Time cost per inference with K = 12, C = 100 and various b

Link to paper#

Abstract#

Paper Content#

Introduction#

Approach#

Problem setup#

Batch prompting#

Token cost#

Time cost#

Experiments#

Datasets#

Experimental setups#

Results#

Analysis#

Number of batch samples#

Selection of batch samples#

Complexity of tasks#

Reasoning methods#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Approach

Problem setup

Batch prompting

Token cost

Time cost

Experiments

Datasets

Experimental setups

Results

Analysis

Number of batch samples

Selection of batch samples

Complexity of tasks

Reasoning methods

Related work

Conclusion