Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs are popular for their impressive abilities, but require fine-tuning or prompt engineering to generalize.
- UPRISE is a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input.
- UPRISE is universal in a cross-task and cross-model scenario.
- UPRISE mitigates the hallucination problem in experiments with ChatGPT.
Paper Content
Introduction
- Large Language Models (LLMs) have shown impressive capabilities across a range of tasks.
- Two approaches to improve performance: fine-tuning LLMs and developing prompt engineering techniques.
- Fine-tuning LLMs can be limited by computational resources and unavailable model weights.
- Multi-task tuning provides an alternative approach to improve zero-shot task generalization.
- Prompt engineering constructs prompts to guide frozen LLMs.
- UPRISE proposed to improve zero-shot performance of LLMs in cross-task and cross-model scenarios.
- UPRISE can benefit different LLMs of much larger scales.
- UPRISE has potential to improve performance of even strongest LLMs.
Problem definition
- Aim to improve zero-shot performance of LLMs
- Decompose prompting process into two steps: retrieve then predict
- Optimize performance of y P + to match target y
- Prompt retrieval tunes a retriever to retrieve natural language prompts
- Cross-task retrieval retrieves for task types not trained on
- Cross-model retrieval evaluates generalization of small-to-large LLMs
Method
- UPRISE uses a frozen LLM to supervise the fine-tuning of a prompt retriever
- UPRISE uses the trained retriever to retrieve prompts for different task types during inference with different LLMs
Data construction
- Task data is converted into natural language instructions using instruction templates from FLAN.
- For each data example, one of seven templates is randomly selected.
- Option suffices and new line characters are removed from instructions to make the format more similar to pre-training corpus.
- Prompt pool for testing cluster is made of training demonstrations from remaining task clusters.
Prompt scoring
- We collect positive and negative prompts from a prompt pool to supervise the contrastive learning of the retriever
- We categorize tasks into two types: text completion and multiple-choice
- We calculate score of the prompt using an equation
- For multiple choice tasks, we calculate per-token likelihood of each option
- We use an equation to calculate the final score
- We design a prompt filtering mechanism to reduce the number of prompts that need to be scored
Retriever tuning
- Data is split into two sets: 90% for training and 10% for validation
- Prompt retriever is a bi-encoder model
- InfoNCE loss is used to maximize similarity score between encoded prompt and input for positive prompt-input pairs, and minimize it for negative prompt-input pairs
- Loss function is defined for positive and negative prompts
Inference
- Fine-tuned prompt encoder is used to encode entire prompt pool
- Maximum inner-product search is used to retrieve K most similar prompts
- Prompts are concatenated with task input
- Model predictions are generated and evaluated using corresponding evaluation metric
Experiment settings
- Group tasks into clusters
- Randomly sample up to 10k data examples from each task’s training set
- Use GPT-Neo-2.7B to tune the retriever
- Evaluate performance on larger LLMs from various sources
- Set size of randomly sampled subset to 50 and number of negatives to 20
- Initialize both encoders of the retriever with BERT BASE
- Fine-tune for three epochs
- Set number of concatenated prompts to 3 during inference
- Report metric scores on test set, or validation set if not available
Main results
- Evaluated prompt retriever on natural language understanding tasks
- Generative LLMs need improvement
- Table 1 compares performance of UPRISE to zero-shot prompting
Cross-task prompt retrieval
- UPRISE has positive impacts on most of the testing clusters.
- UPRISE shows consistent performance improvements across all tasks in closed-book QA and natural language inference.
- UPRISE has negative impacts on tasks in commonsense reasoning and coreference resolution clusters.
- Alternative techniques such as chain-of-thought prompting may be more effective.
Cross-model prompt retrieval
- Evaluated cross-task generalization and cross-model ability
- UPRISE improves performance on reading comprehension, closed-book QA, and paraphrase detection tasks
- Performance on sentiment analysis is negative with small 2.7B GPT-Neo, but positive with larger models
- Consistent gains on natural language inference tasks with models that have not been fine-tuned
- Performance drop on text-davinci-001 due to model being fine-tuned
- Figure 4 shows consistent performance gains across all LLMs
Hallucination mitigation of chatgpt
- ChatGPT suffers from a significant issue known as hallucination
- UPRISE can mitigate the hallucination problem
- UPRISE outperforms vanilla zero-shot prompting in two fact-checking tasks
- UPRISE successfully induces a precise answer due to retrieved demonstration
- UPRISE achieves best results among all universal retrievers
Universal prompt pool
- We use training demonstrations to construct a prompt pool.
- The prompt pool outperforms raw texts on all testing clusters.
- Randomly sampled prompts from the prompt pool can improve performance on four of the five clusters.
Related work
- Prompt engineering works include prompt design, prompt tuning, and prompt search.
- In-Context Learning (Brown et al., 2020) is a method that helps LLMs transfer to new tasks without gradient updates.
- Chain-of-Thoughts (CoT) (Wei et al., 2022b) provides LLMs with a series of intermediate reasoning steps as demonstrations.
- Prompt tuning proposes to learn a prompt represented by continuous parameters rather than discrete natural language tokens (Liu et al., 2021).
- Prompt search involves searching for prompts from pre-training corpora or downstream task datasets (Gao et al., 2021;Liu et al., 2022;van de Kar et al., 2022;Ye et al., 2023).
Conclusion
- Proposed UPRISE approach to improve zero-shot performance of LLMs on various tasks
- Cross-task and cross-model scenario to evaluate universality of retriever
- Potential to improve even strongest LLMs
- Figures 4 and 5 show results of cross-task retriever and case of chats on FEVER2.0 dataset
- Comparison of average performance on GPT-Neo-2.7B with different prompt pools