Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Retrieval-augmented in-context learning is a powerful approach for knowledge-intensive tasks.
  • Existing work combines language models and retrieval models in a “retrieve-then-read” pipeline.
  • Demonstrate-Search-Predict (DSP) is a framework that passes natural language texts between an LM and an RM.
  • DSP can express high-level programs that break down problems into small transformations.
  • Novel DSP programs have been written for answering questions in open-domain, multi-hop, and conversational settings.
  • Early evaluations show new state-of-the-art in-context learning results with 37-200%, 8-40%, and 80-290% relative gains.

Paper Content


  • In-context learning adapts a frozen language model to tasks by conditioning it on a textual prompt
  • Retrieval models are used to augment prompts with relevant information from a large corpus
  • On its own, the language model often makes false assertions
  • Retrieve-then-read pipelines fail when simple search can’t find an answer
  • DSP framework relies entirely on passing natural language text between a frozen RM and LM
  • DSP introduces a number of composable functions to bootstrap training examples, gather information, and generate grounded outputs
  • DSP suggests powerful strategies for knowledge-intensive tasks
  • DSP programs set new state-of-the-art results
  • DSP programs implement transformations like weak supervision, rewriting questions, and generating grounded responses
  • DSP programs deliver 37-200%, 8-40%, and 80-290% relative gains against corresponding vanilla LMs

Datatypes and control flow

  • Implemented DSP framework in Python
  • Core data types and composable functions provided by the framework
  • Examples contain arbitrary keys and values
  • Human-labeled training data do not include labels for intermediate transformations
  • Program invokes and composes DSP primitives
  • Transformations take an Example as input and return an Example
  • DEMONSTRATE stage bootstraps annotations for intermediate transformations
  • SEARCH stage gathers passages to support LM transformations
  • PREDICT stage uses LM to generate output
  • Sample primitive randomly samples from training set
  • Knn primitive finds k nearest neighbors to input text
  • Crossval primitive selects among sampled sets of demonstrations
  • Generate primitive invokes LM to produce query for each retrieval hop
  • Prompt templates designed to generate summary of context and query to gather information for answering complex question


  • Three knowledge-intensive NLP tasks are considered: open-domain QA, multi-hop QA, and conversational QA.
  • Systems are given a short question or participate in a multi-turn conversation without context.
  • Intuitive compositions of functions are built and evaluated for each task.
  • Low development effort results in strong quality and empirical gains over vanilla in-context learning.

Evaluation methodology

  • Considered one development dataset for each task
  • Used SQuAD, HotPotQA, and QReCC datasets
  • Reported validation set accuracy
  • Systems given access to 16 shot training examples
  • Subsampled validation and test sets to 1000 questions
  • Dedicated held-out test datasets and tasks for evaluating pre-defined DSP programs

Pretrained modules

  • Use ColBERTv2 for its zero-shot search quality and efficient search
  • DSP allows for changing or updating the search index over time
  • Use GPT-3.5 language model with greedy decoding when generating one prediction and sampling with temperature when generating more than one prediction


  • Vanilla LM randomly samples 16 demonstrations from the training set
  • Retrieve-then-Read uses a retrieval model to support each example with a potentially relevant passage
  • Conversational QA concatenates the first turn and the final question
  • Multi-hop QA retrieves and concatenates two passages per question
  • Self-ask uses ColBERTv2-style passages in the handcrafted demonstrations
  • Self-ask concatenates 16-shot training examples from the task as a prefix of the prompt

Proposed dsp programs

  • Builds on transformations presented in §2
  • Programs for all three tasks have same structure
  • Greedy decoding used by default
  • SEARCH uses question to retrieve 7 passages
  • PREDICT generates 20 reasoning chains and uses self-consistency
  • DEMONSTRATE uses 3 annotations, k=3, n=10 queries per hop
  • Table 1 compares task-aware DSP program against baselines and other approaches

Development datasets & results

  • Conducted open-domain version of SQuAD over Wikipedia 2016 corpus
  • Used same train/validation/test splits as Karpukhin et al. (2020) and Khattab et al. (2021b)
  • Task-aware DSP program achieved 36.6% EM, outperforming vanilla LM baseline by 126% EM relative gains
  • 8% EM and 6% F1 relative gains over retrieve-then-read pipeline
  • Self-ask pipeline achieved 9.3% EM
  • Si et al. (2022) achieved 20.2% EM without retrieval and 34.0% EM with retrieval
  • Used open-domain “fullwiki” setting of HotPotQA with official Wikipedia 2017 “abstracts” corpus
  • Task-aware DSP program outperformed baselines and existing work by 82%, 39%, and 80%, respectively, in EM
  • Si et al. (2022) achieved 25.2% EM with CoT prompting
  • Sun et al. (2022) achieved 26.5% EM with “recite-and-answer” technique
  • Wang et al. (2022b) achieved 33.8% EM and 44.6 F1 with self-consistency prompt
  • Yao et al. (2022) achieved 35.1% EM using system capable of searching using Wikipedia API
  • Task-aware DSP program achieved 51.4% EM on HotPotQA
  • Used QReCC in open-domain setting over Wikipedia 2018
  • Reported novel-F1 metric (nF1)
  • Task-aware DSP program outperformed baselines and existing work by large margins


  • Dominant paradigm for building models in AI is multiplication of tensor representations
  • Deep learning era has given rise to layer-wise designs for fast development and exploration
  • Building complex systems from pretrained components requires domain expertise
  • In-context learning allows components to communicate with each other using natural language
  • Broaden participation in AI system development
  • Rapidly prototype systems for new domains
  • Maximize value of specialized pretrained components
  • DEMONSTRATE-SEARCH-PREDICT (DSP) framework for retrieval augmented in-context learning
  • DSP consists of simple, composable functions for implementing in-context learning systems
  • Implemented DSP as a Python library
  • Used DSP to write programs for Open-SQuAD, HotPotQA, and QReCC
  • Programs deliver substantial gains over previous in-context learning approaches
  • Reveal a large space of conceptual possibilities for in-context learning