Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs have been developed for commercial applications
- Inference hyperparameters can affect the utility/cost of text generation
- EcoOptiGen is a framework for economical hyperparameter optimization and cost-based pruning
- Experiments with GPT-3.5 models verify the effectiveness of EcoOptiGen
- EcoOptiGen is implemented in the FLAML library
Paper Content
Introduction
- LLMs have demonstrated impressive capabilities in a range of generative tasks
- LLMs have been used to build powerful user experiences
- Research community has studied the effect of individual hyperparameters on inference performance
- Need to optimize hyperparameters collectively and systematically
- Cost is a concern for most application builders
- Proposed cost-based pruning strategy to improve optimization efficiency
- Evaluated on four datasets, found higher quality hyperparameter settings than default
- Pruning technique increases tuning performance significantly
- Holistic hyperparameter optimization can mitigate idiosyncrasies
Background
Text generation with llms
- LLM Text Generation takes an input prompt and generates one or more output responses
- Input prompt can include multiple examples to demonstrate desired responses
- Output can be used by applications in various ways
- Cost of LLM Text Generation is measured in number of tokens in input and output
- Goal is to maximize utility of generated text under inference budget constraint
- Hyperparameters affect cost and utility of generated text
- Interactions between hyperparameters can be complex
Ecooptigen
- Notations and definitions introduced for EcoOptiGen: Tuning Data D, Utility Function U, Budget Constraints B, Search Space S
- Utility of multiple verifiable responses is defined as the best utility score in all responses
- Budget Constraints B is a tuple of two values: B.i and B.o
- Search Space S is a dictionary with hyperparameter names and their range of values
- Framework outputs configuration x* with maximal average utility U x* (D) subject to average inference budget C x* (D) ≤ B.i
Hyperparameter searcher
- Aim to make framework generically applicable to LLMs and complex utility functions
- Abstraction away from internal process of computing utility
- Variety of blackbox optimization techniques
- Chose BlendSearch due to cost efficiency
- BlendSearch combines Bayesian optimization and local search
Configuration evaluator
- Evaluator takes configuration as input and outputs metric to optimize
- Loops over data examples in D
- Makes request to LLM service using configuration and input fields
- Computes utility and cost consumption
- Compares average cost with user-provided bound
- Presents two improvements to simple evaluator
- Prompt templates replaced by input fields
- Hierarchical search space
- Initial validity check
- Pruning leverages assumption
- Outer loop of algorithm varies number of responses
- Inner loop of algorithm varies number of data examples
- Optimization metric is probabilistic success rate
- Agnostic to optimization metric
Experiments
- Investigate research questions related to text generation tasks
- Describe experiment setting in Section 4.1
- Investigate research questions in Sections 4.2-4.4
- Discuss future work in Section 4.5
Setup
- Evaluate EcoOptiGen on 3 datasets: APPS, HumanEval, MATH and XSum
- 20 examples used for tuning, except for 60 of XSum1
- 100 examples used for testing for APPS, few hundred for other datasets
- Evaluation metric: code generation - pass/fail, MATH - equivalent final answer, XSum - Rouge-2 score
- Comparative methods: HELM, Search, Search + PSR
- Search space follows Table 1, input set to B.i = 1K, B.o = 1M
- HumanEval - 4 templates, MATH - 1 fixed demonstration example, XSum - same prompts, n and max tokens from HELM
Ecooptigen’s performance
- EcoOptiGen outperforms the best untuned GPT-3.5 model in the HELM benchmark
- Jointly tuning all the hyperparameters can be better than simply increasing the number of responses
- EcoOptiGen consistently outperforms the other non-pruning methods
- EcoOptiGen searches for 2-27x more trials under the same optimization budget
Effect of inference budget
- Performance of EcoOptiGen is affected by inference budget
- Performance score increases as inference budget increases from 500 to 2000
- Performance score drops when inference budget is increased from 1500 to 2000
- Hypothesis that performance drop is due to decrease of number of trials within total optimization budget
- Additional experiment supports hypothesis
Effect of model
- Experiments were conducted using models from the GPT-3.5 family
- OpenAI recommended “code-davinci-002” for code generation and “text-davinci-003” for other text generation
- Results showed that the best models after tuning were different from the best models according to the HELM benchmark
- On HumanEval, text-davinci-003 performed the best consistently
- On MATH, code-davinci-002 was superior in the low inference budget range
- Hyperparameter optimization can avoid suboptimal choices
Discussions, limitations, and future work
- Optimized hyperparameter configurations for text-davinci-003 vary across tasks
- APPS has smallest max tokens and n due to higher input length
- APPS has higher top p than HumanEval due to difficulty of dataset
- Future work to develop methods to understand choices and automate tuning
Related work
- Large language models are used for text generation tasks
- Techniques such as fine-tuning, supervision with human feedback and reinforcement learning have improved model performance
- Different sampling algorithms have been compared and found to perform similarly
- No existing work has provided a systematic guide to hyperparameter tuning for large language models
- Automated hyperparameter optimization methods have been studied for a decade
- Automated hyperparameter optimization has been studied specifically for NLP tasks
- Figure 1, 2, 3, 4, 5, 6 and example 3.2 are included in the paper