Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • REPLUG is a retrieval-augmented language modeling framework.
  • REPLUG treats the language model as a black box and augments it with a tuneable retrieval model.
  • REPLUG prepends retrieved documents to the input for the frozen black-box LM.
  • REPLUG can be applied to any existing retrieval and language models.
  • REPLUG can be used to supervise the retrieval model.
  • REPLUG improves the performance of GPT-3 and Codex.

Paper Content

Introduction

  • Large language models (LLMs) have demonstrated impressive performance on a wide range of language tasks.
  • LLMs store a substantial amount of world or domain knowledge implicitly in their parameters.
  • LLMs are prone to hallucination and cannot represent the full long tail of knowledge from the training corpus.
  • Retrieval-augmented language models can retrieve knowledge from an external datastore when needed, potentially reducing hallucination and increasing coverage.
  • REPLUG treats the language model as a black box and augments it with a frozen or tunable retriever.
  • REPLUG is applicable to large LMs (i.e., >100B parameters).
  • REPLUG can improve the performance of diverse black-box LMs on both language modeling and downstream tasks.
  • REPLUG LSR adapts the retriever to the LM, using the language modeling scores as supervision signals.
  • Large language models are not open-sourced and are only available as black-box APIs
  • Finetuning large language models requires significant computational resources
  • Retrieval-augmented model frameworks traditionally focus on the white-box setting
  • Retrieval-augmentation improves performance on various NLP tasks
  • Retrieval-augmented language models require access to internal LM representations
  • Concurrent work has demonstrated that using a frozen retriever can improve GPT-3 performance
  • REPLUG uses Contriever as the retrieval model

Document retrieval

  • Retriever aims to retrieve documents from a corpus that are relevant to an input context
  • Dual encoder architecture is used to encode both the input context and the document
  • Mean pooling of the last hidden representation is used to map each document to an embedding
  • Query embedding is obtained by applying the same encoder to the input context
  • Cosine similarity is used to compute the similarity between the query embedding and the document embedding
  • FAISS index is constructed over the precomputed document embeddings

Input reformulation

  • Retrieve top-k documents to provide rich information to the LM
  • Ensemble strategy to incorporate documents into the input to the LM
  • REPLUG LSR uses LM to provide supervision for which documents should be retrieved
  • Training algorithm consists of 4 steps: retrieval, scoring, updating, and asynchronous update

Computing retrieval likelihood

  • Retrieve k documents from corpus D given input context x
  • Compute retrieval likelihood of each retrieved document by marginalizing over retrieved documents D

Computing lm likelihood

  • LM is used as a scoring function to measure how much a document can improve LM perplexity
  • LM likelihood of each document is computed using a hyperparameter β

Loss function

  • Compute retrieval and language model likelihoods given input context and ground truth continuation
  • Minimize KL divergence between two distributions by updating retrieval model parameters, LM parameters are fixed

Asynchronous update of the datastore index

  • Parameters in the retriever are updated during training
  • Document embeddings are no longer up to date
  • Document embeddings are recomputed and search index is rebuilt every T training steps
  • New document embeddings and index are used for retrieval

Training setup

  • Model setting described in REPLUG
  • Training procedure for REPLUG LSR
  • Initialize retriever with Contriever model
  • Use GPT-3 Curie as supervision LM
  • Pre-compute document embeddings and create FAISS index
  • Retrieve top 20 documents from FAISS index
  • Compute retrieval and LM likelihood with temperature of 0.1
  • Train retriever with Adam optimizer
  • Re-compute document embeddings every 3k steps
  • Fine-tune retriever for total of 25k steps

Experiments

  • Evaluations performed on language modeling and downstream tasks
  • REPLUG improves performance of black-box language models

Language modeling

  • Pile is a language modeling benchmark consisting of text sources from diverse domains
  • GPT-3 and GPT-2 family language models are used as baselines
  • REPLUG and REPLUG LSR are added to the baselines
  • Results show that both REPLUG and REPLUG LSR significantly outperform the baselines
  • REPLUG LSR performs better than REPLUG by a large margin

Mmlu

  • MMLU is a multiple choice QA dataset with 57 tasks
  • Two groups of strong previous models used as baselines
  • REPLUG and REPLUG LSR improve original Codex model by 4.5% and 5.1%
  • REPLUG LSR outperforms retrieval-augmented language model Atlas
  • REPLUG LSR outperforms original model by 1.9% in STEM category

Open domain qa

  • Evaluated on two open-domain QA datasets: Natural Questions and TriviaQA
  • Questions and answers collected from Wikipedia and the Web
  • Evaluated in few-shot setting with few examples
  • Compared to state-of-the-art baselines
  • Baselines include large language models and retrieval-augmented language models
  • Proposed model lags behind performance of retrieval-augmented language models with full training data

Analysis

  • REPLUG performance gain does not come from ensembling effect
  • Comparing REPLUG to ensembling random documents shows worse performance
  • REPLUG performance improves with more documents
  • REPLUG applicable to diverse language models with different sizes

Qualitative analysis: rare entities benefit from retrieval

  • REPLUG improves language modeling performance when texts contain rare entities.
  • REPLUG is most helpful for rare entity names, likely because the original LM does not have enough information about them.
  • Incorporating the retrieved document during inference improves perplexity of the continuation by 11%.
  • The entity “Li Bai” and the token “greatest” show the most improvement in perplexity (15% and 5% respectively).

Conclusion

  • REPLUG is a retrieval-augmented language modeling paradigm
  • REPLUG can be integrated with any existing language model
  • REPLUG uses a retriever to retrieve documents from an external corpus
  • REPLUG ensembles output probabilities from different passes
  • REPLUG improves performance on language modeling and downstream tasks