Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • LLMs are popular for their impressive abilities, but require fine-tuning or prompt engineering to generalize.
  • UPRISE is a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input.
  • UPRISE is universal in a cross-task and cross-model scenario.
  • UPRISE mitigates the hallucination problem in experiments with ChatGPT.

Paper Content

Introduction

  • Large Language Models (LLMs) have shown impressive capabilities across a range of tasks.
  • Two approaches to improve performance: fine-tuning LLMs and developing prompt engineering techniques.
  • Fine-tuning LLMs can be limited by computational resources and unavailable model weights.
  • Multi-task tuning provides an alternative approach to improve zero-shot task generalization.
  • Prompt engineering constructs prompts to guide frozen LLMs.
  • UPRISE proposed to improve zero-shot performance of LLMs in cross-task and cross-model scenarios.
  • UPRISE can benefit different LLMs of much larger scales.
  • UPRISE has potential to improve performance of even strongest LLMs.

Problem definition

  • Aim to improve zero-shot performance of LLMs
  • Decompose prompting process into two steps: retrieve then predict
  • Optimize performance of y P + to match target y
  • Prompt retrieval tunes a retriever to retrieve natural language prompts
  • Cross-task retrieval retrieves for task types not trained on
  • Cross-model retrieval evaluates generalization of small-to-large LLMs

Method

  • UPRISE uses a frozen LLM to supervise the fine-tuning of a prompt retriever
  • UPRISE uses the trained retriever to retrieve prompts for different task types during inference with different LLMs

Data construction

  • Task data is converted into natural language instructions using instruction templates from FLAN.
  • For each data example, one of seven templates is randomly selected.
  • Option suffices and new line characters are removed from instructions to make the format more similar to pre-training corpus.
  • Prompt pool for testing cluster is made of training demonstrations from remaining task clusters.

Prompt scoring

  • We collect positive and negative prompts from a prompt pool to supervise the contrastive learning of the retriever
  • We categorize tasks into two types: text completion and multiple-choice
  • We calculate score of the prompt using an equation
  • For multiple choice tasks, we calculate per-token likelihood of each option
  • We use an equation to calculate the final score
  • We design a prompt filtering mechanism to reduce the number of prompts that need to be scored

Retriever tuning

  • Data is split into two sets: 90% for training and 10% for validation
  • Prompt retriever is a bi-encoder model
  • InfoNCE loss is used to maximize similarity score between encoded prompt and input for positive prompt-input pairs, and minimize it for negative prompt-input pairs
  • Loss function is defined for positive and negative prompts

Inference

  • Fine-tuned prompt encoder is used to encode entire prompt pool
  • Maximum inner-product search is used to retrieve K most similar prompts
  • Prompts are concatenated with task input
  • Model predictions are generated and evaluated using corresponding evaluation metric

Experiment settings

  • Group tasks into clusters
  • Randomly sample up to 10k data examples from each task’s training set
  • Use GPT-Neo-2.7B to tune the retriever
  • Evaluate performance on larger LLMs from various sources
  • Set size of randomly sampled subset to 50 and number of negatives to 20
  • Initialize both encoders of the retriever with BERT BASE
  • Fine-tune for three epochs
  • Set number of concatenated prompts to 3 during inference
  • Report metric scores on test set, or validation set if not available

Main results

  • Evaluated prompt retriever on natural language understanding tasks
  • Generative LLMs need improvement
  • Table 1 compares performance of UPRISE to zero-shot prompting

Cross-task prompt retrieval

  • UPRISE has positive impacts on most of the testing clusters.
  • UPRISE shows consistent performance improvements across all tasks in closed-book QA and natural language inference.
  • UPRISE has negative impacts on tasks in commonsense reasoning and coreference resolution clusters.
  • Alternative techniques such as chain-of-thought prompting may be more effective.

Cross-model prompt retrieval

  • Evaluated cross-task generalization and cross-model ability
  • UPRISE improves performance on reading comprehension, closed-book QA, and paraphrase detection tasks
  • Performance on sentiment analysis is negative with small 2.7B GPT-Neo, but positive with larger models
  • Consistent gains on natural language inference tasks with models that have not been fine-tuned
  • Performance drop on text-davinci-001 due to model being fine-tuned
  • Figure 4 shows consistent performance gains across all LLMs

Hallucination mitigation of chatgpt

  • ChatGPT suffers from a significant issue known as hallucination
  • UPRISE can mitigate the hallucination problem
  • UPRISE outperforms vanilla zero-shot prompting in two fact-checking tasks
  • UPRISE successfully induces a precise answer due to retrieved demonstration
  • UPRISE achieves best results among all universal retrievers

Universal prompt pool

  • We use training demonstrations to construct a prompt pool.
  • The prompt pool outperforms raw texts on all testing clusters.
  • Randomly sampled prompts from the prompt pool can improve performance on four of the five clusters.
  • Prompt engineering works include prompt design, prompt tuning, and prompt search.
  • In-Context Learning (Brown et al., 2020) is a method that helps LLMs transfer to new tasks without gradient updates.
  • Chain-of-Thoughts (CoT) (Wei et al., 2022b) provides LLMs with a series of intermediate reasoning steps as demonstrations.
  • Prompt tuning proposes to learn a prompt represented by continuous parameters rather than discrete natural language tokens (Liu et al., 2021).
  • Prompt search involves searching for prompts from pre-training corpora or downstream task datasets (Gao et al., 2021;Liu et al., 2022;van de Kar et al., 2022;Ye et al., 2023).

Conclusion

  • Proposed UPRISE approach to improve zero-shot performance of LLMs on various tasks
  • Cross-task and cross-model scenario to evaluate universality of retriever
  • Potential to improve even strongest LLMs
  • Figures 4 and 5 show results of cross-task retriever and case of chats on FEVER2.0 dataset
  • Comparison of average performance on GPT-Neo-2.7B with different prompt pools