Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Retrieval-augmented in-context learning is a powerful approach for knowledge-intensive tasks.
- Existing work combines language models and retrieval models in a “retrieve-then-read” pipeline.
- Demonstrate-Search-Predict (DSP) is a framework that passes natural language texts between an LM and an RM.
- DSP can express high-level programs that break down problems into small transformations.
- Novel DSP programs have been written for answering questions in open-domain, multi-hop, and conversational settings.
- Early evaluations show new state-of-the-art in-context learning results with 37-200%, 8-40%, and 80-290% relative gains.
Paper Content
Introduction
- In-context learning adapts a frozen language model to tasks by conditioning it on a textual prompt
- Retrieval models are used to augment prompts with relevant information from a large corpus
- On its own, the language model often makes false assertions
- Retrieve-then-read pipelines fail when simple search can’t find an answer
- DSP framework relies entirely on passing natural language text between a frozen RM and LM
- DSP introduces a number of composable functions to bootstrap training examples, gather information, and generate grounded outputs
- DSP suggests powerful strategies for knowledge-intensive tasks
- DSP programs set new state-of-the-art results
- DSP programs implement transformations like weak supervision, rewriting questions, and generating grounded responses
- DSP programs deliver 37-200%, 8-40%, and 80-290% relative gains against corresponding vanilla LMs
Datatypes and control flow
- Implemented DSP framework in Python
- Core data types and composable functions provided by the framework
- Examples contain arbitrary keys and values
- Human-labeled training data do not include labels for intermediate transformations
- Program invokes and composes DSP primitives
- Transformations take an Example as input and return an Example
- DEMONSTRATE stage bootstraps annotations for intermediate transformations
- SEARCH stage gathers passages to support LM transformations
- PREDICT stage uses LM to generate output
- Sample primitive randomly samples from training set
- Knn primitive finds k nearest neighbors to input text
- Crossval primitive selects among sampled sets of demonstrations
- Generate primitive invokes LM to produce query for each retrieval hop
- Prompt templates designed to generate summary of context and query to gather information for answering complex question
Evaluation
- Three knowledge-intensive NLP tasks are considered: open-domain QA, multi-hop QA, and conversational QA.
- Systems are given a short question or participate in a multi-turn conversation without context.
- Intuitive compositions of functions are built and evaluated for each task.
- Low development effort results in strong quality and empirical gains over vanilla in-context learning.
Evaluation methodology
- Considered one development dataset for each task
- Used SQuAD, HotPotQA, and QReCC datasets
- Reported validation set accuracy
- Systems given access to 16 shot training examples
- Subsampled validation and test sets to 1000 questions
- Dedicated held-out test datasets and tasks for evaluating pre-defined DSP programs
Pretrained modules
- Use ColBERTv2 for its zero-shot search quality and efficient search
- DSP allows for changing or updating the search index over time
- Use GPT-3.5 language model with greedy decoding when generating one prediction and sampling with temperature when generating more than one prediction
Baselines
- Vanilla LM randomly samples 16 demonstrations from the training set
- Retrieve-then-Read uses a retrieval model to support each example with a potentially relevant passage
- Conversational QA concatenates the first turn and the final question
- Multi-hop QA retrieves and concatenates two passages per question
- Self-ask uses ColBERTv2-style passages in the handcrafted demonstrations
- Self-ask concatenates 16-shot training examples from the task as a prefix of the prompt
Proposed dsp programs
- Builds on transformations presented in §2
- Programs for all three tasks have same structure
- Greedy decoding used by default
- SEARCH uses question to retrieve 7 passages
- PREDICT generates 20 reasoning chains and uses self-consistency
- DEMONSTRATE uses 3 annotations, k=3, n=10 queries per hop
- Table 1 compares task-aware DSP program against baselines and other approaches
Development datasets & results
- Conducted open-domain version of SQuAD over Wikipedia 2016 corpus
- Used same train/validation/test splits as Karpukhin et al. (2020) and Khattab et al. (2021b)
- Task-aware DSP program achieved 36.6% EM, outperforming vanilla LM baseline by 126% EM relative gains
- 8% EM and 6% F1 relative gains over retrieve-then-read pipeline
- Self-ask pipeline achieved 9.3% EM
- Si et al. (2022) achieved 20.2% EM without retrieval and 34.0% EM with retrieval
- Used open-domain “fullwiki” setting of HotPotQA with official Wikipedia 2017 “abstracts” corpus
- Task-aware DSP program outperformed baselines and existing work by 82%, 39%, and 80%, respectively, in EM
- Si et al. (2022) achieved 25.2% EM with CoT prompting
- Sun et al. (2022) achieved 26.5% EM with “recite-and-answer” technique
- Wang et al. (2022b) achieved 33.8% EM and 44.6 F1 with self-consistency prompt
- Yao et al. (2022) achieved 35.1% EM using system capable of searching using Wikipedia API
- Task-aware DSP program achieved 51.4% EM on HotPotQA
- Used QReCC in open-domain setting over Wikipedia 2018
- Reported novel-F1 metric (nF1)
- Task-aware DSP program outperformed baselines and existing work by large margins
Conclusion
- Dominant paradigm for building models in AI is multiplication of tensor representations
- Deep learning era has given rise to layer-wise designs for fast development and exploration
- Building complex systems from pretrained components requires domain expertise
- In-context learning allows components to communicate with each other using natural language
- Broaden participation in AI system development
- Rapidly prototype systems for new domains
- Maximize value of specialized pretrained components
- DEMONSTRATE-SEARCH-PREDICT (DSP) framework for retrieval augmented in-context learning
- DSP consists of simple, composable functions for implementing in-context learning systems
- Implemented DSP as a Python library
- Used DSP to write programs for Open-SQuAD, HotPotQA, and QReCC
- Programs deliver substantial gains over previous in-context learning approaches
- Reveal a large space of conceptual possibilities for in-context learning