Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs have been developed for commercial applications
Inference hyperparameters can affect the utility/cost of text generation
EcoOptiGen is a framework for economical hyperparameter optimization and cost-based pruning
Experiments with GPT-3.5 models verify the effectiveness of EcoOptiGen
EcoOptiGen is implemented in the FLAML library

Paper Content

Introduction

LLMs have demonstrated impressive capabilities in a range of generative tasks
LLMs have been used to build powerful user experiences
Research community has studied the effect of individual hyperparameters on inference performance
Need to optimize hyperparameters collectively and systematically
Cost is a concern for most application builders
Proposed cost-based pruning strategy to improve optimization efficiency
Evaluated on four datasets, found higher quality hyperparameter settings than default
Pruning technique increases tuning performance significantly
Holistic hyperparameter optimization can mitigate idiosyncrasies

Background

Text generation with llms

LLM Text Generation takes an input prompt and generates one or more output responses
Input prompt can include multiple examples to demonstrate desired responses
Output can be used by applications in various ways
Cost of LLM Text Generation is measured in number of tokens in input and output
Goal is to maximize utility of generated text under inference budget constraint
Hyperparameters affect cost and utility of generated text
Interactions between hyperparameters can be complex

Ecooptigen

Notations and definitions introduced for EcoOptiGen: Tuning Data D, Utility Function U, Budget Constraints B, Search Space S
Utility of multiple verifiable responses is defined as the best utility score in all responses
Budget Constraints B is a tuple of two values: B.i and B.o
Search Space S is a dictionary with hyperparameter names and their range of values
Framework outputs configuration x* with maximal average utility U x* (D) subject to average inference budget C x* (D) ≤ B.i

Hyperparameter searcher

Aim to make framework generically applicable to LLMs and complex utility functions
Abstraction away from internal process of computing utility
Variety of blackbox optimization techniques
Chose BlendSearch due to cost efficiency
BlendSearch combines Bayesian optimization and local search

Configuration evaluator

Evaluator takes configuration as input and outputs metric to optimize
Loops over data examples in D
Makes request to LLM service using configuration and input fields
Computes utility and cost consumption
Compares average cost with user-provided bound
Presents two improvements to simple evaluator
Prompt templates replaced by input fields
Hierarchical search space
Initial validity check
Pruning leverages assumption
Outer loop of algorithm varies number of responses
Inner loop of algorithm varies number of data examples
Optimization metric is probabilistic success rate
Agnostic to optimization metric

Experiments

Investigate research questions related to text generation tasks
Describe experiment setting in Section 4.1
Investigate research questions in Sections 4.2-4.4
Discuss future work in Section 4.5

Setup

Evaluate EcoOptiGen on 3 datasets: APPS, HumanEval, MATH and XSum
20 examples used for tuning, except for 60 of XSum1
100 examples used for testing for APPS, few hundred for other datasets
Evaluation metric: code generation - pass/fail, MATH - equivalent final answer, XSum - Rouge-2 score
Comparative methods: HELM, Search, Search + PSR
Search space follows Table 1, input set to B.i = 1K, B.o = 1M
HumanEval - 4 templates, MATH - 1 fixed demonstration example, XSum - same prompts, n and max tokens from HELM

Ecooptigen’s performance

EcoOptiGen outperforms the best untuned GPT-3.5 model in the HELM benchmark
Jointly tuning all the hyperparameters can be better than simply increasing the number of responses
EcoOptiGen consistently outperforms the other non-pruning methods
EcoOptiGen searches for 2-27x more trials under the same optimization budget

Effect of inference budget

Performance of EcoOptiGen is affected by inference budget
Performance score increases as inference budget increases from 500 to 2000
Performance score drops when inference budget is increased from 1500 to 2000
Hypothesis that performance drop is due to decrease of number of trials within total optimization budget
Additional experiment supports hypothesis

Effect of model

Experiments were conducted using models from the GPT-3.5 family
OpenAI recommended “code-davinci-002” for code generation and “text-davinci-003” for other text generation
Results showed that the best models after tuning were different from the best models according to the HELM benchmark
On HumanEval, text-davinci-003 performed the best consistently
On MATH, code-davinci-002 was superior in the low inference budget range
Hyperparameter optimization can avoid suboptimal choices

Discussions, limitations, and future work

Optimized hyperparameter configurations for text-davinci-003 vary across tasks
APPS has smallest max tokens and n due to higher input length
APPS has higher top p than HumanEval due to difficulty of dataset
Future work to develop methods to understand choices and automate tuning

Large language models are used for text generation tasks
Techniques such as fine-tuning, supervision with human feedback and reinforcement learning have improved model performance
Different sampling algorithms have been compared and found to perform similarly
No existing work has provided a systematic guide to hyperparameter tuning for large language models
Automated hyperparameter optimization methods have been studied for a decade
Automated hyperparameter optimization has been studied specifically for NLP tasks
Figure 1, 2, 3, 4, 5, 6 and example 3.2 are included in the paper

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Text generation with llms#

Ecooptigen#

Hyperparameter searcher#

Configuration evaluator#

Experiments#

Setup#

Ecooptigen’s performance#

Effect of inference budget#

Effect of model#

Discussions, limitations, and future work#

Related work#

Link to paper

Abstract

Paper Content

Introduction

Background

Text generation with llms

Ecooptigen

Hyperparameter searcher

Configuration evaluator

Experiments

Setup

Ecooptigen’s performance

Effect of inference budget

Effect of model

Discussions, limitations, and future work

Related work