Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Retrieval-augmented in-context learning is a powerful approach for knowledge-intensive tasks.
Existing work combines language models and retrieval models in a “retrieve-then-read” pipeline.
Demonstrate-Search-Predict (DSP) is a framework that passes natural language texts between an LM and an RM.
DSP can express high-level programs that break down problems into small transformations.
Novel DSP programs have been written for answering questions in open-domain, multi-hop, and conversational settings.
Early evaluations show new state-of-the-art in-context learning results with 37-200%, 8-40%, and 80-290% relative gains.

In-context learning adapts a frozen language model to tasks by conditioning it on a textual prompt
Retrieval models are used to augment prompts with relevant information from a large corpus
On its own, the language model often makes false assertions
Retrieve-then-read pipelines fail when simple search can’t find an answer
DSP framework relies entirely on passing natural language text between a frozen RM and LM
DSP introduces a number of composable functions to bootstrap training examples, gather information, and generate grounded outputs
DSP suggests powerful strategies for knowledge-intensive tasks
DSP programs set new state-of-the-art results
DSP programs implement transformations like weak supervision, rewriting questions, and generating grounded responses
DSP programs deliver 37-200%, 8-40%, and 80-290% relative gains against corresponding vanilla LMs

Implemented DSP framework in Python
Core data types and composable functions provided by the framework
Examples contain arbitrary keys and values
Human-labeled training data do not include labels for intermediate transformations
Program invokes and composes DSP primitives
Transformations take an Example as input and return an Example
DEMONSTRATE stage bootstraps annotations for intermediate transformations
SEARCH stage gathers passages to support LM transformations
PREDICT stage uses LM to generate output
Sample primitive randomly samples from training set
Knn primitive finds k nearest neighbors to input text
Crossval primitive selects among sampled sets of demonstrations
Generate primitive invokes LM to produce query for each retrieval hop
Prompt templates designed to generate summary of context and query to gather information for answering complex question

Three knowledge-intensive NLP tasks are considered: open-domain QA, multi-hop QA, and conversational QA.
Systems are given a short question or participate in a multi-turn conversation without context.
Intuitive compositions of functions are built and evaluated for each task.
Low development effort results in strong quality and empirical gains over vanilla in-context learning.

Considered one development dataset for each task
Used SQuAD, HotPotQA, and QReCC datasets
Reported validation set accuracy
Systems given access to 16 shot training examples
Subsampled validation and test sets to 1000 questions
Dedicated held-out test datasets and tasks for evaluating pre-defined DSP programs

Use ColBERTv2 for its zero-shot search quality and efficient search
DSP allows for changing or updating the search index over time
Use GPT-3.5 language model with greedy decoding when generating one prediction and sampling with temperature when generating more than one prediction

Vanilla LM randomly samples 16 demonstrations from the training set
Retrieve-then-Read uses a retrieval model to support each example with a potentially relevant passage
Conversational QA concatenates the first turn and the final question
Multi-hop QA retrieves and concatenates two passages per question
Self-ask uses ColBERTv2-style passages in the handcrafted demonstrations
Self-ask concatenates 16-shot training examples from the task as a prefix of the prompt

Conducted open-domain version of SQuAD over Wikipedia 2016 corpus
Used same train/validation/test splits as Karpukhin et al. (2020) and Khattab et al. (2021b)
Task-aware DSP program achieved 36.6% EM, outperforming vanilla LM baseline by 126% EM relative gains
8% EM and 6% F1 relative gains over retrieve-then-read pipeline
Self-ask pipeline achieved 9.3% EM
Si et al. (2022) achieved 20.2% EM without retrieval and 34.0% EM with retrieval
Used open-domain “fullwiki” setting of HotPotQA with official Wikipedia 2017 “abstracts” corpus
Task-aware DSP program outperformed baselines and existing work by 82%, 39%, and 80%, respectively, in EM
Si et al. (2022) achieved 25.2% EM with CoT prompting
Sun et al. (2022) achieved 26.5% EM with “recite-and-answer” technique
Wang et al. (2022b) achieved 33.8% EM and 44.6 F1 with self-consistency prompt
Yao et al. (2022) achieved 35.1% EM using system capable of searching using Wikipedia API
Task-aware DSP program achieved 51.4% EM on HotPotQA
Used QReCC in open-domain setting over Wikipedia 2018
Reported novel-F1 metric (nF1)
Task-aware DSP program outperformed baselines and existing work by large margins

Dominant paradigm for building models in AI is multiplication of tensor representations
Deep learning era has given rise to layer-wise designs for fast development and exploration
Building complex systems from pretrained components requires domain expertise
In-context learning allows components to communicate with each other using natural language
Broaden participation in AI system development
Rapidly prototype systems for new domains
Maximize value of specialized pretrained components
DEMONSTRATE-SEARCH-PREDICT (DSP) framework for retrieval augmented in-context learning
DSP consists of simple, composable functions for implementing in-context learning systems
Implemented DSP as a Python library
Used DSP to write programs for Open-SQuAD, HotPotQA, and QReCC
Programs deliver substantial gains over previous in-context learning approaches
Reveal a large space of conceptual possibilities for in-context learning