Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • LLMs have been developed for commercial applications
  • Inference hyperparameters can affect the utility/cost of text generation
  • EcoOptiGen is a framework for economical hyperparameter optimization and cost-based pruning
  • Experiments with GPT-3.5 models verify the effectiveness of EcoOptiGen
  • EcoOptiGen is implemented in the FLAML library

Paper Content

Introduction

  • LLMs have demonstrated impressive capabilities in a range of generative tasks
  • LLMs have been used to build powerful user experiences
  • Research community has studied the effect of individual hyperparameters on inference performance
  • Need to optimize hyperparameters collectively and systematically
  • Cost is a concern for most application builders
  • Proposed cost-based pruning strategy to improve optimization efficiency
  • Evaluated on four datasets, found higher quality hyperparameter settings than default
  • Pruning technique increases tuning performance significantly
  • Holistic hyperparameter optimization can mitigate idiosyncrasies

Background

Text generation with llms

  • LLM Text Generation takes an input prompt and generates one or more output responses
  • Input prompt can include multiple examples to demonstrate desired responses
  • Output can be used by applications in various ways
  • Cost of LLM Text Generation is measured in number of tokens in input and output
  • Goal is to maximize utility of generated text under inference budget constraint
  • Hyperparameters affect cost and utility of generated text
  • Interactions between hyperparameters can be complex

Ecooptigen

  • Notations and definitions introduced for EcoOptiGen: Tuning Data D, Utility Function U, Budget Constraints B, Search Space S
  • Utility of multiple verifiable responses is defined as the best utility score in all responses
  • Budget Constraints B is a tuple of two values: B.i and B.o
  • Search Space S is a dictionary with hyperparameter names and their range of values
  • Framework outputs configuration x* with maximal average utility U x* (D) subject to average inference budget C x* (D) ≤ B.i

Hyperparameter searcher

  • Aim to make framework generically applicable to LLMs and complex utility functions
  • Abstraction away from internal process of computing utility
  • Variety of blackbox optimization techniques
  • Chose BlendSearch due to cost efficiency
  • BlendSearch combines Bayesian optimization and local search

Configuration evaluator

  • Evaluator takes configuration as input and outputs metric to optimize
  • Loops over data examples in D
  • Makes request to LLM service using configuration and input fields
  • Computes utility and cost consumption
  • Compares average cost with user-provided bound
  • Presents two improvements to simple evaluator
  • Prompt templates replaced by input fields
  • Hierarchical search space
  • Initial validity check
  • Pruning leverages assumption
  • Outer loop of algorithm varies number of responses
  • Inner loop of algorithm varies number of data examples
  • Optimization metric is probabilistic success rate
  • Agnostic to optimization metric

Experiments

  • Investigate research questions related to text generation tasks
  • Describe experiment setting in Section 4.1
  • Investigate research questions in Sections 4.2-4.4
  • Discuss future work in Section 4.5

Setup

  • Evaluate EcoOptiGen on 3 datasets: APPS, HumanEval, MATH and XSum
  • 20 examples used for tuning, except for 60 of XSum1
  • 100 examples used for testing for APPS, few hundred for other datasets
  • Evaluation metric: code generation - pass/fail, MATH - equivalent final answer, XSum - Rouge-2 score
  • Comparative methods: HELM, Search, Search + PSR
  • Search space follows Table 1, input set to B.i = 1K, B.o = 1M
  • HumanEval - 4 templates, MATH - 1 fixed demonstration example, XSum - same prompts, n and max tokens from HELM

Ecooptigen’s performance

  • EcoOptiGen outperforms the best untuned GPT-3.5 model in the HELM benchmark
  • Jointly tuning all the hyperparameters can be better than simply increasing the number of responses
  • EcoOptiGen consistently outperforms the other non-pruning methods
  • EcoOptiGen searches for 2-27x more trials under the same optimization budget

Effect of inference budget

  • Performance of EcoOptiGen is affected by inference budget
  • Performance score increases as inference budget increases from 500 to 2000
  • Performance score drops when inference budget is increased from 1500 to 2000
  • Hypothesis that performance drop is due to decrease of number of trials within total optimization budget
  • Additional experiment supports hypothesis

Effect of model

  • Experiments were conducted using models from the GPT-3.5 family
  • OpenAI recommended “code-davinci-002” for code generation and “text-davinci-003” for other text generation
  • Results showed that the best models after tuning were different from the best models according to the HELM benchmark
  • On HumanEval, text-davinci-003 performed the best consistently
  • On MATH, code-davinci-002 was superior in the low inference budget range
  • Hyperparameter optimization can avoid suboptimal choices

Discussions, limitations, and future work

  • Optimized hyperparameter configurations for text-davinci-003 vary across tasks
  • APPS has smallest max tokens and n due to higher input length
  • APPS has higher top p than HumanEval due to difficulty of dataset
  • Future work to develop methods to understand choices and automate tuning
  • Large language models are used for text generation tasks
  • Techniques such as fine-tuning, supervision with human feedback and reinforcement learning have improved model performance
  • Different sampling algorithms have been compared and found to perform similarly
  • No existing work has provided a systematic guide to hyperparameter tuning for large language models
  • Automated hyperparameter optimization methods have been studied for a decade
  • Automated hyperparameter optimization has been studied specifically for NLP tasks
  • Figure 1, 2, 3, 4, 5, 6 and example 3.2 are included in the paper