Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs are popular for their impressive abilities, but require fine-tuning or prompt engineering to generalize.
UPRISE is a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input.
UPRISE is universal in a cross-task and cross-model scenario.
UPRISE mitigates the hallucination problem in experiments with ChatGPT.

Paper Content

Introduction

Large Language Models (LLMs) have shown impressive capabilities across a range of tasks.
Two approaches to improve performance: fine-tuning LLMs and developing prompt engineering techniques.
Fine-tuning LLMs can be limited by computational resources and unavailable model weights.
Multi-task tuning provides an alternative approach to improve zero-shot task generalization.
Prompt engineering constructs prompts to guide frozen LLMs.
UPRISE proposed to improve zero-shot performance of LLMs in cross-task and cross-model scenarios.
UPRISE can benefit different LLMs of much larger scales.
UPRISE has potential to improve performance of even strongest LLMs.

Problem definition

Aim to improve zero-shot performance of LLMs
Decompose prompting process into two steps: retrieve then predict
Optimize performance of y P + to match target y
Prompt retrieval tunes a retriever to retrieve natural language prompts
Cross-task retrieval retrieves for task types not trained on
Cross-model retrieval evaluates generalization of small-to-large LLMs

Method

UPRISE uses a frozen LLM to supervise the fine-tuning of a prompt retriever
UPRISE uses the trained retriever to retrieve prompts for different task types during inference with different LLMs

Data construction

Task data is converted into natural language instructions using instruction templates from FLAN.
For each data example, one of seven templates is randomly selected.
Option suffices and new line characters are removed from instructions to make the format more similar to pre-training corpus.
Prompt pool for testing cluster is made of training demonstrations from remaining task clusters.

Prompt scoring

We collect positive and negative prompts from a prompt pool to supervise the contrastive learning of the retriever
We categorize tasks into two types: text completion and multiple-choice
We calculate score of the prompt using an equation
For multiple choice tasks, we calculate per-token likelihood of each option
We use an equation to calculate the final score
We design a prompt filtering mechanism to reduce the number of prompts that need to be scored

Retriever tuning

Data is split into two sets: 90% for training and 10% for validation
Prompt retriever is a bi-encoder model
InfoNCE loss is used to maximize similarity score between encoded prompt and input for positive prompt-input pairs, and minimize it for negative prompt-input pairs
Loss function is defined for positive and negative prompts

Inference

Fine-tuned prompt encoder is used to encode entire prompt pool
Maximum inner-product search is used to retrieve K most similar prompts
Prompts are concatenated with task input
Model predictions are generated and evaluated using corresponding evaluation metric

Experiment settings

Group tasks into clusters
Randomly sample up to 10k data examples from each task’s training set
Use GPT-Neo-2.7B to tune the retriever
Evaluate performance on larger LLMs from various sources
Set size of randomly sampled subset to 50 and number of negatives to 20
Initialize both encoders of the retriever with BERT BASE
Fine-tune for three epochs
Set number of concatenated prompts to 3 during inference
Report metric scores on test set, or validation set if not available

Main results

Evaluated prompt retriever on natural language understanding tasks
Generative LLMs need improvement
Table 1 compares performance of UPRISE to zero-shot prompting

Cross-task prompt retrieval

UPRISE has positive impacts on most of the testing clusters.
UPRISE shows consistent performance improvements across all tasks in closed-book QA and natural language inference.
UPRISE has negative impacts on tasks in commonsense reasoning and coreference resolution clusters.
Alternative techniques such as chain-of-thought prompting may be more effective.

Cross-model prompt retrieval

Evaluated cross-task generalization and cross-model ability
UPRISE improves performance on reading comprehension, closed-book QA, and paraphrase detection tasks
Performance on sentiment analysis is negative with small 2.7B GPT-Neo, but positive with larger models
Consistent gains on natural language inference tasks with models that have not been fine-tuned
Performance drop on text-davinci-001 due to model being fine-tuned
Figure 4 shows consistent performance gains across all LLMs

Hallucination mitigation of chatgpt

ChatGPT suffers from a significant issue known as hallucination
UPRISE can mitigate the hallucination problem
UPRISE outperforms vanilla zero-shot prompting in two fact-checking tasks
UPRISE successfully induces a precise answer due to retrieved demonstration
UPRISE achieves best results among all universal retrievers

Universal prompt pool

We use training demonstrations to construct a prompt pool.
The prompt pool outperforms raw texts on all testing clusters.
Randomly sampled prompts from the prompt pool can improve performance on four of the five clusters.

Prompt engineering works include prompt design, prompt tuning, and prompt search.
In-Context Learning (Brown et al., 2020) is a method that helps LLMs transfer to new tasks without gradient updates.
Chain-of-Thoughts (CoT) (Wei et al., 2022b) provides LLMs with a series of intermediate reasoning steps as demonstrations.
Prompt tuning proposes to learn a prompt represented by continuous parameters rather than discrete natural language tokens (Liu et al., 2021).
Prompt search involves searching for prompts from pre-training corpora or downstream task datasets (Gao et al., 2021;Liu et al., 2022;van de Kar et al., 2022;Ye et al., 2023).

Conclusion

Proposed UPRISE approach to improve zero-shot performance of LLMs on various tasks
Cross-task and cross-model scenario to evaluate universality of retriever
Potential to improve even strongest LLMs
Figures 4 and 5 show results of cross-task retriever and case of chats on FEVER2.0 dataset
Comparison of average performance on GPT-Neo-2.7B with different prompt pools

Link to paper#

Abstract#

Paper Content#

Introduction#

Problem definition#

Method#

Data construction#

Prompt scoring#

Retriever tuning#

Inference#

Experiment settings#

Main results#

Cross-task prompt retrieval#

Cross-model prompt retrieval#

Hallucination mitigation of chatgpt#

Universal prompt pool#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Problem definition

Method

Data construction

Prompt scoring

Retriever tuning

Inference

Experiment settings

Main results

Cross-task prompt retrieval

Cross-model prompt retrieval

Hallucination mitigation of chatgpt

Universal prompt pool

Related work

Conclusion