Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Fine-tuning large language models is becoming impractical
  • Prompt tuning (PT) and in-context learning (ICL) are parameter-efficient adaptation methods
  • Instruction prompt tuning (IPT) combines PT and ICL
  • How these methods interact with each other is unexplored
  • This paper empirically studies when and how in-context examples improve PT

Paper Content


  • LLMs are becoming too large to fine-tune all parameters for new tasks
  • Three methods studied to adapt LLMs to downstream tasks: In-context learning (ICL), Prompt tuning (PT), Instruction prompt tuning (IPT)
  • ICL struggles on complex and out-of-domain tasks
  • PT generally outperforms ICL, but is unstable and difficult to optimize
  • IPT combines ICL and PT and is effective at adapting LLMs to medical domain
  • PT and IPT consistently outperform ICL across five text generation tasks
  • Performance of PT and IPT depends on task and experimental configuration
  • IPT outperforms PT on examples with similar test input
  • PT exhibits high variance, IPT reduces variance
  • Prompt embeddings learned via PT can be transferred to new tasks with in-context demonstrations


  • Parameter-efficient fine-tuning methods specialize LLMs to a target task while adjusting a small number of task-specific parameters.
  • In-context learning uses in-context instructions and/or demonstrations to solve unseen tasks.
  • Prompt tuning prepends soft tunable prompt embeddings to the input tokens.
  • Instruction prompt tuning combines soft prompts with hard in-context demonstrations.
  • Instruction prompt tuning tunes just the prompt embeddings rather than the full model.

Experimental setup

  • Experiments compare performance of ICL, PT, and IPT across different tasks, configurations, and base language models
  • Focus on language generation tasks where input or output is out-of-domain for pretrained LLM
  • Three kinds of tasks: data-to-text, logic-to-text, and semantic parsing
  • Experiments use BLOOM-1.1B, OPT-1.3B, and GPT-2-XL-1.5B models
  • Reparameterization trick used for prompt tuning convergence
  • Results show best PT and IPT configurations outperform ICL
  • IPT is less sensitive to number of prompt tokens


  • PT and IPT outperform ICL with randomly retrieved in-context demonstration on all five tasks
  • ICL performance can be improved with “good” retrieved in-context demonstrations, but still lags behind PT and IPT
  • No clear trend in relative performance of PT and IPT, except on ToTTo where IPT is a clear winner
  • In-context demonstration included in IPT is helpful when test input and demonstration are semantically similar
  • IPT reduces variance across all tasks, indicating additional in-context example improves stability of prompt tuning
  • Soft prompts trained on source task can be used for different target task with improvements over ICL
  • ICL performs worse than PT and IPT, even when using retrieved demonstrations
  • PT and IPT performance depends on task and number of tunable parameters
  • IPT helps when in-context demonstration is similar to test input
  • IPT is more stable than PT with more soft prompt tokens
  • Prompt embeddings are transferable to new tasks with in-context demonstrations


  • In-context demonstrations improve prompt tuning for language generation tasks
  • Instruction prompt tuning is more stable than prompt tuning
  • In-context demonstrations are more effective when closely resembling the test input
  • Soft prompts learned for a source task can be transferred to new target tasks
  • Experiments were limited to 1B parameter language models