Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Fine-tuning large language models is becoming impractical
Prompt tuning (PT) and in-context learning (ICL) are parameter-efficient adaptation methods
Instruction prompt tuning (IPT) combines PT and ICL
How these methods interact with each other is unexplored
This paper empirically studies when and how in-context examples improve PT

LLMs are becoming too large to fine-tune all parameters for new tasks
Three methods studied to adapt LLMs to downstream tasks: In-context learning (ICL), Prompt tuning (PT), Instruction prompt tuning (IPT)
ICL struggles on complex and out-of-domain tasks
PT generally outperforms ICL, but is unstable and difficult to optimize
IPT combines ICL and PT and is effective at adapting LLMs to medical domain
PT and IPT consistently outperform ICL across five text generation tasks
Performance of PT and IPT depends on task and experimental configuration
IPT outperforms PT on examples with similar test input
PT exhibits high variance, IPT reduces variance
Prompt embeddings learned via PT can be transferred to new tasks with in-context demonstrations

Parameter-efficient fine-tuning methods specialize LLMs to a target task while adjusting a small number of task-specific parameters.
In-context learning uses in-context instructions and/or demonstrations to solve unseen tasks.
Prompt tuning prepends soft tunable prompt embeddings to the input tokens.
Instruction prompt tuning combines soft prompts with hard in-context demonstrations.
Instruction prompt tuning tunes just the prompt embeddings rather than the full model.

Experiments compare performance of ICL, PT, and IPT across different tasks, configurations, and base language models
Focus on language generation tasks where input or output is out-of-domain for pretrained LLM
Three kinds of tasks: data-to-text, logic-to-text, and semantic parsing
Experiments use BLOOM-1.1B, OPT-1.3B, and GPT-2-XL-1.5B models
Reparameterization trick used for prompt tuning convergence
Results show best PT and IPT configurations outperform ICL
IPT is less sensitive to number of prompt tokens

PT and IPT outperform ICL with randomly retrieved in-context demonstration on all five tasks
ICL performance can be improved with “good” retrieved in-context demonstrations, but still lags behind PT and IPT
No clear trend in relative performance of PT and IPT, except on ToTTo where IPT is a clear winner
In-context demonstration included in IPT is helpful when test input and demonstration are semantically similar
IPT reduces variance across all tasks, indicating additional in-context example improves stability of prompt tuning
Soft prompts trained on source task can be used for different target task with improvements over ICL
ICL performs worse than PT and IPT, even when using retrieved demonstrations
PT and IPT performance depends on task and number of tunable parameters
IPT helps when in-context demonstration is similar to test input
IPT is more stable than PT with more soft prompt tokens
Prompt embeddings are transferable to new tasks with in-context demonstrations

In-context demonstrations improve prompt tuning for language generation tasks
Instruction prompt tuning is more stable than prompt tuning
In-context demonstrations are more effective when closely resembling the test input
Soft prompts learned for a source task can be transferred to new target tasks
Experiments were limited to 1B parameter language models