Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs can be computationally and financially costly to use.
- Batch prompting is an alternative prompting approach that reduces costs.
- Batch prompting can improve downstream performance.
Paper Content
Introduction
- Large language models (LLMs) have shown strong capabilities in zero/few-shot settings
- Recent work has made progress in in-context learning
- LLMs can be costly in terms of token and time usage
- Batch prompting is an alternative approach that allows the model to perform inference on multiple samples at once
- Batch prompting reduces token and time costs while still retaining downstream performance
- Batch prompting works well across different LLMs and reasoning methods
Approach
- Introduces batch prompting as an efficient alternative to standard prompting
- Compares token and time costs of batch and standard prompting
Problem setup
- Conventional paradigm for prompting LLMs for in-context learning involves selecting K in-context few-shot exemplars with both a context and an output.
- Realistic scenario involves N test samples, which requires N separate calls of the LLM inference.
Batch prompting
- Batch prompting reduces the time needed for LLM inference from N to N/b.
- K in-context exemplars are grouped into K/b batches with b exemplars each.
- Position identifier “[index]” is added to assist the LLM and ease the process of parsing the generated responses.
Token cost
- LLM call costs scale linearly with number of tokens
- Most tokens are consumed by prompt tokens
- Token efficiency is the portion of tokens spent on generated tokens
- Batch prompting lowers token and time costs as number of samples increases
- Increasing batch size of batch prompting reduces token costs
Time cost
- Batch prompting reduces the inference time by decreasing the number of API calls.
- The cost of Transformer decoding increases with batch prompting.
- Most end-users only have access to LLM API services, so the time cost is marginal.
- Reducing the number of calls with batch prompting can lower the time costs.
- In the future, the time reduction of batch prompting may not be as pronounced.
Experiments
- Batch prompting was evaluated across 10 datasets
- Batch prompting can improve token and time efficiency by up to 5x with 6 samples in batches
- Downstream performance is similar or better with batch prompting
Datasets
- Evaluated batch prompting on 10 datasets across 3 areas
- Datasets include CommonsenseQA, StrategyQA, GSM8K, SVAMP, AQuA, AddSub, MultiArith, RTE, MNLI, and SST-5
- Batch prompting showed comparable or better performance
Experimental setups
- Showed that our proposed algorithm is more efficient than existing algorithms
- Demonstrated that our algorithm can be used to solve a variety of problems in computer science
Results
- Batch prompting reduces token and time costs up to 5x compared to standard prompting
- Decrease in costs scales inverse linearly with number of samples in each batch
- Batch prompting performs comparably or better than standard prompting
Analysis
- Analyzing factors that may affect performance of batch prompting
- Exploring tradeoff of balancing costs and downstream performance
- Demonstrating batch prompting can be applied to different LLMs and prompting methods
Number of batch samples
- Performance decreases as batch size increases
- Best performance not always achieved with batch size of 2
- Setting batch size to 3 or 4 usually achieves good performance while saving tokens and time
- Reducing time/token costs diminishes when batch size is larger
Selection of batch samples
- Examined whether sample selection affects performance of batch prompting
- Studied two sample selection methods: grouping similar or diverse samples
- Table 2 showed no improvements over random grouping
- Reason may be error propagation from earlier samples without ground truth outputs
- Future research needed to develop effective strategies for selecting samples
Complexity of tasks
- Performance of batch prompting is affected by complexity of tasks
- AQuA dataset has the largest drop in performance
- Longer input contexts lead to more substantial performance drops with batch prompting
- Performance is steadier with shorter input contexts
Reasoning methods
- Used Chain-of-Thought (CoT) for all ten datasets in main experiments
- Examined whether batch prompting is suitable for other common LLM reasoning methods
- Experimented with two more reasoning methods: end-to-end and program-based
- Batch prompting can improve efficiency while maintaining similar or better performance
Related work
- Large language models have improved in-context learning performance
- Different methods have been proposed to prompt LLMs
- Other works generate programs to solve reasoning tasks
- Another line of work focuses on selecting better in-context exemplars
- This work adds a new dimension to ICL for large-scale real-world applications
- Recent work has proposed methods for efficient language generation
Conclusion
- Batch prompting is a new way to prompt LLMs that performs inference on samples in a batched fashion
- Multiple samples can be handled in one API call, reducing token and time costs
- LLMs are based on the Transformer decoder-only architecture
- Time complexity to decode each token is O(nd)
- Increasing b (number of samples in a batch) increases time cost of one inference
- Experiments on 10 datasets across commonsense QA, arithmetics, and NLI/NLU
- Batch prompting can achieve better or similar performance compared to standard prompting
- OpenAI API overhead and rate limit blocking time make up most proportion of time cost
- Prompt templates built following CoT, Binder, and Program-of-Thought
- Input length affects batch prompting performance
- Accuracy decreases with larger b
- Time cost per inference with K = 12, C = 100 and various b