Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduce a new framework, Directional Stimulus Prompting, to provide guidance for black-box frozen large language models on downstream tasks.
- Train a policy LM to generate discrete tokens as ``directional stimulus’’ of each input.
- Policy LM can be trained through supervised learning and reinforcement learning.
- Framework is flexibly applicable to various LMs and tasks.
- Verified effectiveness through summarization and dialogue response generation tasks.
- T5 trained with 2,000 samples from the CNN/Daily Mail dataset improves Codex’s performance by 7.2%.
- 500 dialogues boost the combined score by 52.5%.
Paper Content
Introduction
- Large Language Models (LLMs) have been developed for natural language processing (NLP)
- LLMs are used to perform tasks without parameter updates
- LLMs have achieved success on diverse tasks
- Bob Barker returned to host “The Price Is Right” on April 1, after leaving in 2007 at age 91
Reference
- Bob Barker returned to host “The Price Is Right”
- Barker retired as host in 2007
Directional stimulus prompting
- Bob Barker returned to TV show “The Price Is Right” after 8 years
- Figure 1 shows comparison of proposed Directional Stimulus Prompting with standard prompting method
- Directional Stimulus Prompting uses tuneable policy LM to generate stimulus (keywords) to guide LLM on generating desired summary
- LLMs are expensive to fine-tune, so how to improve LLMs’ performance with few training examples is a challenging problem
- Directional Stimulus Prompting uses small tuneable LM to improve frozen black-box LLM on downstream tasks with reinforcement learning
- Training objective is to maximize reward (downstream performance measure scores)
- Evaluated on summarization and dialogue response generation tasks
- Results show improved performance with keywords as directional stimulus and dialog acts as directional stimulus
Methods
- Presents Directional Stimulus Prompting (DSP), a framework to generate prompts for a black-box frozen LLM
- Uses SFT and RL to optimize the policy LM and minimize rewards defined by evaluation scores of the LLM’s generation
Directional stimulus prompting
- There is an input space X, a data distribution D over X, and an output space Y for a downstream task.
- The LLM can generate output without parameter update given the input x and some demonstrations as input.
- There is a small piece of discrete tokens z named “stimulus” that can provide the LLM hints on generating output that better aligns with human preference or task requirements.
- The output is obtained via the LLM with Directional Stimulus Prompting (DSP).
Supervised fine-tuning
- Perform supervised fine-tuning on a pre-trained language model
- Collect data by heuristically selecting pseudo-stimulus for each input
- Fine-tune policy LM by maximizing log-likelihood
- Further fine-tune policy model using reinforcement learning to optimize LLM’s generation
Reinforcement learning
- Goal is to improve LLM generation
- Measure R can be downstream task performance, human preferences, or quality measures
- Optimization objective is to maximize measure R
- Formulated as an RL problem and solved with PPO
- Reward function is the optimization objective plus KL-divergence penalty
- Coefficient β is dynamically adapted during training
- Beam search decoding used for inference
- NLPO used to mask out less relevant tokens in the vocabulary
Experiments
- Proposed framework DSP can be applied to various types of language models and generation tasks
- Evaluated DSP on summarization and dialogue response generation tasks in few-shot setting
- Used 780M parameter version of pre-trained Flan-T5 and 175B parameter Codex as policy LM and LLM respectively
Summarization
- Summarization is an important task in NLP
- GPT-3 can generate high-quality summaries, but benchmark results are lower than fine-tuned methods
- This paper uses a few training data to improve Codex’s performance
- Evaluated on MultiWOZ dataset with 4 metrics
- Supervised fine-tuning and RL fine-tuning both improve performance
- Performance is closely related to training rewards
- Low quality of dataset leads to superfluous texts
Dialog response generation
- There are two types of studied dialogue systems: chit-chat and task-oriented
- LLMs are usually proficient at chit-chat dialogue systems
- Task-oriented dialogue systems are designed to help users complete specific tasks
- LLMs have been used to deal with dialogue state tracking
- LLMs perform poorly in generating system responses that follow conversation flow
- This work uses a small policy model to control the LLM to generate better system responses
- Evaluation metrics are defined at the dialogue level, BLEU score is computed at corpus level
- Policy network is trained with top-k sampling and beam search decoding
Related work
- Large language models (LLMs) are used in natural language processing (NLP)
- LLMs have many parameters and require a lot of training data
- Most LLMs are not open-sourced and can only be accessed through black-box APIs
- OPT-175B and Bloom are open-sourced LLMs, but require significant computational resources to run and fine-tune
- LLMs need improvement or adjustment on some specific tasks
- Some methods use external knowledge to improve LLMs
- Other methods try to find optimal prompts
- Reinforcement learning has been applied to various NLP tasks
- Proximal Policy Optimization (PPO) is used to optimize a policy model to generate text to guide LLMs
- Natural Language Policy Optimization (NLP) is an extension of PPO for NLP tasks
Conclusion
- Brazilian police arrested Joao Vaccari Neto, treasurer of the ruling Workers’ Party
- Vaccari faces charges of corruption and money laundering
- Allegations of bribery at state-run oil company Petrobras
- Vaccari denies any wrongdoing
- Investigation has not implicated President Dilma Rousseff
- Rousseff was chairwoman of Petrobras when alleged corruption took place
- Investigators looking into whether bribes went towards Rousseff’s election campaigns