Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introduce a new framework, Directional Stimulus Prompting, to provide guidance for black-box frozen large language models on downstream tasks.
Train a policy LM to generate discrete tokens as ``directional stimulus’’ of each input.
Policy LM can be trained through supervised learning and reinforcement learning.
Framework is flexibly applicable to various LMs and tasks.
Verified effectiveness through summarization and dialogue response generation tasks.
T5 trained with 2,000 samples from the CNN/Daily Mail dataset improves Codex’s performance by 7.2%.
500 dialogues boost the combined score by 52.5%.

Paper Content

Introduction

Large Language Models (LLMs) have been developed for natural language processing (NLP)
LLMs are used to perform tasks without parameter updates
LLMs have achieved success on diverse tasks
Bob Barker returned to host “The Price Is Right” on April 1, after leaving in 2007 at age 91

Reference

Bob Barker returned to host “The Price Is Right”
Barker retired as host in 2007

Directional stimulus prompting

Bob Barker returned to TV show “The Price Is Right” after 8 years
Figure 1 shows comparison of proposed Directional Stimulus Prompting with standard prompting method
Directional Stimulus Prompting uses tuneable policy LM to generate stimulus (keywords) to guide LLM on generating desired summary
LLMs are expensive to fine-tune, so how to improve LLMs’ performance with few training examples is a challenging problem
Directional Stimulus Prompting uses small tuneable LM to improve frozen black-box LLM on downstream tasks with reinforcement learning
Training objective is to maximize reward (downstream performance measure scores)
Evaluated on summarization and dialogue response generation tasks
Results show improved performance with keywords as directional stimulus and dialog acts as directional stimulus

Methods

Presents Directional Stimulus Prompting (DSP), a framework to generate prompts for a black-box frozen LLM
Uses SFT and RL to optimize the policy LM and minimize rewards defined by evaluation scores of the LLM’s generation

Directional stimulus prompting

There is an input space X, a data distribution D over X, and an output space Y for a downstream task.
The LLM can generate output without parameter update given the input x and some demonstrations as input.
There is a small piece of discrete tokens z named “stimulus” that can provide the LLM hints on generating output that better aligns with human preference or task requirements.
The output is obtained via the LLM with Directional Stimulus Prompting (DSP).

Supervised fine-tuning

Perform supervised fine-tuning on a pre-trained language model
Collect data by heuristically selecting pseudo-stimulus for each input
Fine-tune policy LM by maximizing log-likelihood
Further fine-tune policy model using reinforcement learning to optimize LLM’s generation

Reinforcement learning

Goal is to improve LLM generation
Measure R can be downstream task performance, human preferences, or quality measures
Optimization objective is to maximize measure R
Formulated as an RL problem and solved with PPO
Reward function is the optimization objective plus KL-divergence penalty
Coefficient β is dynamically adapted during training
Beam search decoding used for inference
NLPO used to mask out less relevant tokens in the vocabulary

Experiments

Proposed framework DSP can be applied to various types of language models and generation tasks
Evaluated DSP on summarization and dialogue response generation tasks in few-shot setting
Used 780M parameter version of pre-trained Flan-T5 and 175B parameter Codex as policy LM and LLM respectively

Summarization

Summarization is an important task in NLP
GPT-3 can generate high-quality summaries, but benchmark results are lower than fine-tuned methods
This paper uses a few training data to improve Codex’s performance
Evaluated on MultiWOZ dataset with 4 metrics
Supervised fine-tuning and RL fine-tuning both improve performance
Performance is closely related to training rewards
Low quality of dataset leads to superfluous texts

Dialog response generation

There are two types of studied dialogue systems: chit-chat and task-oriented
LLMs are usually proficient at chit-chat dialogue systems
Task-oriented dialogue systems are designed to help users complete specific tasks
LLMs have been used to deal with dialogue state tracking
LLMs perform poorly in generating system responses that follow conversation flow
This work uses a small policy model to control the LLM to generate better system responses
Evaluation metrics are defined at the dialogue level, BLEU score is computed at corpus level
Policy network is trained with top-k sampling and beam search decoding

Large language models (LLMs) are used in natural language processing (NLP)
LLMs have many parameters and require a lot of training data
Most LLMs are not open-sourced and can only be accessed through black-box APIs
OPT-175B and Bloom are open-sourced LLMs, but require significant computational resources to run and fine-tune
LLMs need improvement or adjustment on some specific tasks
Some methods use external knowledge to improve LLMs
Other methods try to find optimal prompts
Reinforcement learning has been applied to various NLP tasks
Proximal Policy Optimization (PPO) is used to optimize a policy model to generate text to guide LLMs
Natural Language Policy Optimization (NLP) is an extension of PPO for NLP tasks

Conclusion

Brazilian police arrested Joao Vaccari Neto, treasurer of the ruling Workers’ Party
Vaccari faces charges of corruption and money laundering
Allegations of bribery at state-run oil company Petrobras
Vaccari denies any wrongdoing
Investigation has not implicated President Dilma Rousseff
Rousseff was chairwoman of Petrobras when alleged corruption took place
Investigators looking into whether bribes went towards Rousseff’s election campaigns

Link to paper#

Abstract#

Paper Content#

Introduction#

Reference#

Directional stimulus prompting#

Methods#

Directional stimulus prompting#

Supervised fine-tuning#

Reinforcement learning#

Experiments#

Summarization#

Dialog response generation#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Reference

Directional stimulus prompting

Methods

Directional stimulus prompting

Supervised fine-tuning

Reinforcement learning

Experiments

Summarization

Dialog response generation

Related work

Conclusion