Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Introduce a new framework, Directional Stimulus Prompting, to provide guidance for black-box frozen large language models on downstream tasks.
  • Train a policy LM to generate discrete tokens as ``directional stimulus’’ of each input.
  • Policy LM can be trained through supervised learning and reinforcement learning.
  • Framework is flexibly applicable to various LMs and tasks.
  • Verified effectiveness through summarization and dialogue response generation tasks.
  • T5 trained with 2,000 samples from the CNN/Daily Mail dataset improves Codex’s performance by 7.2%.
  • 500 dialogues boost the combined score by 52.5%.

Paper Content


  • Large Language Models (LLMs) have been developed for natural language processing (NLP)
  • LLMs are used to perform tasks without parameter updates
  • LLMs have achieved success on diverse tasks
  • Bob Barker returned to host “The Price Is Right” on April 1, after leaving in 2007 at age 91


  • Bob Barker returned to host “The Price Is Right”
  • Barker retired as host in 2007

Directional stimulus prompting

  • Bob Barker returned to TV show “The Price Is Right” after 8 years
  • Figure 1 shows comparison of proposed Directional Stimulus Prompting with standard prompting method
  • Directional Stimulus Prompting uses tuneable policy LM to generate stimulus (keywords) to guide LLM on generating desired summary
  • LLMs are expensive to fine-tune, so how to improve LLMs’ performance with few training examples is a challenging problem
  • Directional Stimulus Prompting uses small tuneable LM to improve frozen black-box LLM on downstream tasks with reinforcement learning
  • Training objective is to maximize reward (downstream performance measure scores)
  • Evaluated on summarization and dialogue response generation tasks
  • Results show improved performance with keywords as directional stimulus and dialog acts as directional stimulus


  • Presents Directional Stimulus Prompting (DSP), a framework to generate prompts for a black-box frozen LLM
  • Uses SFT and RL to optimize the policy LM and minimize rewards defined by evaluation scores of the LLM’s generation

Directional stimulus prompting

  • There is an input space X, a data distribution D over X, and an output space Y for a downstream task.
  • The LLM can generate output without parameter update given the input x and some demonstrations as input.
  • There is a small piece of discrete tokens z named “stimulus” that can provide the LLM hints on generating output that better aligns with human preference or task requirements.
  • The output is obtained via the LLM with Directional Stimulus Prompting (DSP).

Supervised fine-tuning

  • Perform supervised fine-tuning on a pre-trained language model
  • Collect data by heuristically selecting pseudo-stimulus for each input
  • Fine-tune policy LM by maximizing log-likelihood
  • Further fine-tune policy model using reinforcement learning to optimize LLM’s generation

Reinforcement learning

  • Goal is to improve LLM generation
  • Measure R can be downstream task performance, human preferences, or quality measures
  • Optimization objective is to maximize measure R
  • Formulated as an RL problem and solved with PPO
  • Reward function is the optimization objective plus KL-divergence penalty
  • Coefficient β is dynamically adapted during training
  • Beam search decoding used for inference
  • NLPO used to mask out less relevant tokens in the vocabulary


  • Proposed framework DSP can be applied to various types of language models and generation tasks
  • Evaluated DSP on summarization and dialogue response generation tasks in few-shot setting
  • Used 780M parameter version of pre-trained Flan-T5 and 175B parameter Codex as policy LM and LLM respectively


  • Summarization is an important task in NLP
  • GPT-3 can generate high-quality summaries, but benchmark results are lower than fine-tuned methods
  • This paper uses a few training data to improve Codex’s performance
  • Evaluated on MultiWOZ dataset with 4 metrics
  • Supervised fine-tuning and RL fine-tuning both improve performance
  • Performance is closely related to training rewards
  • Low quality of dataset leads to superfluous texts

Dialog response generation

  • There are two types of studied dialogue systems: chit-chat and task-oriented
  • LLMs are usually proficient at chit-chat dialogue systems
  • Task-oriented dialogue systems are designed to help users complete specific tasks
  • LLMs have been used to deal with dialogue state tracking
  • LLMs perform poorly in generating system responses that follow conversation flow
  • This work uses a small policy model to control the LLM to generate better system responses
  • Evaluation metrics are defined at the dialogue level, BLEU score is computed at corpus level
  • Policy network is trained with top-k sampling and beam search decoding
  • Large language models (LLMs) are used in natural language processing (NLP)
  • LLMs have many parameters and require a lot of training data
  • Most LLMs are not open-sourced and can only be accessed through black-box APIs
  • OPT-175B and Bloom are open-sourced LLMs, but require significant computational resources to run and fine-tune
  • LLMs need improvement or adjustment on some specific tasks
  • Some methods use external knowledge to improve LLMs
  • Other methods try to find optimal prompts
  • Reinforcement learning has been applied to various NLP tasks
  • Proximal Policy Optimization (PPO) is used to optimize a policy model to generate text to guide LLMs
  • Natural Language Policy Optimization (NLP) is an extension of PPO for NLP tasks


  • Brazilian police arrested Joao Vaccari Neto, treasurer of the ruling Workers’ Party
  • Vaccari faces charges of corruption and money laundering
  • Allegations of bribery at state-run oil company Petrobras
  • Vaccari denies any wrongdoing
  • Investigation has not implicated President Dilma Rousseff
  • Rousseff was chairwoman of Petrobras when alleged corruption took place
  • Investigators looking into whether bribes went towards Rousseff’s election campaigns