Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Fine-tuning large language models is becoming impractical
- Prompt tuning (PT) and in-context learning (ICL) are parameter-efficient adaptation methods
- Instruction prompt tuning (IPT) combines PT and ICL
- How these methods interact with each other is unexplored
- This paper empirically studies when and how in-context examples improve PT
Paper Content
Introduction
- LLMs are becoming too large to fine-tune all parameters for new tasks
- Three methods studied to adapt LLMs to downstream tasks: In-context learning (ICL), Prompt tuning (PT), Instruction prompt tuning (IPT)
- ICL struggles on complex and out-of-domain tasks
- PT generally outperforms ICL, but is unstable and difficult to optimize
- IPT combines ICL and PT and is effective at adapting LLMs to medical domain
- PT and IPT consistently outperform ICL across five text generation tasks
- Performance of PT and IPT depends on task and experimental configuration
- IPT outperforms PT on examples with similar test input
- PT exhibits high variance, IPT reduces variance
- Prompt embeddings learned via PT can be transferred to new tasks with in-context demonstrations
Background
- Parameter-efficient fine-tuning methods specialize LLMs to a target task while adjusting a small number of task-specific parameters.
- In-context learning uses in-context instructions and/or demonstrations to solve unseen tasks.
- Prompt tuning prepends soft tunable prompt embeddings to the input tokens.
- Instruction prompt tuning combines soft prompts with hard in-context demonstrations.
- Instruction prompt tuning tunes just the prompt embeddings rather than the full model.
Experimental setup
- Experiments compare performance of ICL, PT, and IPT across different tasks, configurations, and base language models
- Focus on language generation tasks where input or output is out-of-domain for pretrained LLM
- Three kinds of tasks: data-to-text, logic-to-text, and semantic parsing
- Experiments use BLOOM-1.1B, OPT-1.3B, and GPT-2-XL-1.5B models
- Reparameterization trick used for prompt tuning convergence
- Results show best PT and IPT configurations outperform ICL
- IPT is less sensitive to number of prompt tokens
Analysis
- PT and IPT outperform ICL with randomly retrieved in-context demonstration on all five tasks
- ICL performance can be improved with “good” retrieved in-context demonstrations, but still lags behind PT and IPT
- No clear trend in relative performance of PT and IPT, except on ToTTo where IPT is a clear winner
- In-context demonstration included in IPT is helpful when test input and demonstration are semantically similar
- IPT reduces variance across all tasks, indicating additional in-context example improves stability of prompt tuning
- Soft prompts trained on source task can be used for different target task with improvements over ICL
- ICL performs worse than PT and IPT, even when using retrieved demonstrations
- PT and IPT performance depends on task and number of tunable parameters
- IPT helps when in-context demonstration is similar to test input
- IPT is more stable than PT with more soft prompt tokens
- Prompt embeddings are transferable to new tasks with in-context demonstrations
Conclusion
- In-context demonstrations improve prompt tuning for language generation tasks
- Instruction prompt tuning is more stable than prompt tuning
- In-context demonstrations are more effective when closely resembling the test input
- Soft prompts learned for a source task can be transferred to new target tasks
- Experiments were limited to 1B parameter language models