Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- REPLUG is a retrieval-augmented language modeling framework.
- REPLUG treats the language model as a black box and augments it with a tuneable retrieval model.
- REPLUG prepends retrieved documents to the input for the frozen black-box LM.
- REPLUG can be applied to any existing retrieval and language models.
- REPLUG can be used to supervise the retrieval model.
- REPLUG improves the performance of GPT-3 and Codex.
Paper Content
Introduction
- Large language models (LLMs) have demonstrated impressive performance on a wide range of language tasks.
- LLMs store a substantial amount of world or domain knowledge implicitly in their parameters.
- LLMs are prone to hallucination and cannot represent the full long tail of knowledge from the training corpus.
- Retrieval-augmented language models can retrieve knowledge from an external datastore when needed, potentially reducing hallucination and increasing coverage.
- REPLUG treats the language model as a black box and augments it with a frozen or tunable retriever.
- REPLUG is applicable to large LMs (i.e., >100B parameters).
- REPLUG can improve the performance of diverse black-box LMs on both language modeling and downstream tasks.
- REPLUG LSR adapts the retriever to the LM, using the language modeling scores as supervision signals.
Background and related work
- Large language models are not open-sourced and are only available as black-box APIs
- Finetuning large language models requires significant computational resources
- Retrieval-augmented model frameworks traditionally focus on the white-box setting
- Retrieval-augmentation improves performance on various NLP tasks
- Retrieval-augmented language models require access to internal LM representations
- Concurrent work has demonstrated that using a frozen retriever can improve GPT-3 performance
- REPLUG uses Contriever as the retrieval model
Document retrieval
- Retriever aims to retrieve documents from a corpus that are relevant to an input context
- Dual encoder architecture is used to encode both the input context and the document
- Mean pooling of the last hidden representation is used to map each document to an embedding
- Query embedding is obtained by applying the same encoder to the input context
- Cosine similarity is used to compute the similarity between the query embedding and the document embedding
- FAISS index is constructed over the precomputed document embeddings
Input reformulation
- Retrieve top-k documents to provide rich information to the LM
- Ensemble strategy to incorporate documents into the input to the LM
- REPLUG LSR uses LM to provide supervision for which documents should be retrieved
- Training algorithm consists of 4 steps: retrieval, scoring, updating, and asynchronous update
Computing retrieval likelihood
- Retrieve k documents from corpus D given input context x
- Compute retrieval likelihood of each retrieved document by marginalizing over retrieved documents D
Computing lm likelihood
- LM is used as a scoring function to measure how much a document can improve LM perplexity
- LM likelihood of each document is computed using a hyperparameter β
Loss function
- Compute retrieval and language model likelihoods given input context and ground truth continuation
- Minimize KL divergence between two distributions by updating retrieval model parameters, LM parameters are fixed
Asynchronous update of the datastore index
- Parameters in the retriever are updated during training
- Document embeddings are no longer up to date
- Document embeddings are recomputed and search index is rebuilt every T training steps
- New document embeddings and index are used for retrieval
Training setup
- Model setting described in REPLUG
- Training procedure for REPLUG LSR
- Initialize retriever with Contriever model
- Use GPT-3 Curie as supervision LM
- Pre-compute document embeddings and create FAISS index
- Retrieve top 20 documents from FAISS index
- Compute retrieval and LM likelihood with temperature of 0.1
- Train retriever with Adam optimizer
- Re-compute document embeddings every 3k steps
- Fine-tune retriever for total of 25k steps
Experiments
- Evaluations performed on language modeling and downstream tasks
- REPLUG improves performance of black-box language models
Language modeling
- Pile is a language modeling benchmark consisting of text sources from diverse domains
- GPT-3 and GPT-2 family language models are used as baselines
- REPLUG and REPLUG LSR are added to the baselines
- Results show that both REPLUG and REPLUG LSR significantly outperform the baselines
- REPLUG LSR performs better than REPLUG by a large margin
Mmlu
- MMLU is a multiple choice QA dataset with 57 tasks
- Two groups of strong previous models used as baselines
- REPLUG and REPLUG LSR improve original Codex model by 4.5% and 5.1%
- REPLUG LSR outperforms retrieval-augmented language model Atlas
- REPLUG LSR outperforms original model by 1.9% in STEM category
Open domain qa
- Evaluated on two open-domain QA datasets: Natural Questions and TriviaQA
- Questions and answers collected from Wikipedia and the Web
- Evaluated in few-shot setting with few examples
- Compared to state-of-the-art baselines
- Baselines include large language models and retrieval-augmented language models
- Proposed model lags behind performance of retrieval-augmented language models with full training data
Analysis
- REPLUG performance gain does not come from ensembling effect
- Comparing REPLUG to ensembling random documents shows worse performance
- REPLUG performance improves with more documents
- REPLUG applicable to diverse language models with different sizes
Qualitative analysis: rare entities benefit from retrieval
- REPLUG improves language modeling performance when texts contain rare entities.
- REPLUG is most helpful for rare entity names, likely because the original LM does not have enough information about them.
- Incorporating the retrieved document during inference improves perplexity of the continuation by 11%.
- The entity “Li Bai” and the token “greatest” show the most improvement in perplexity (15% and 5% respectively).
Conclusion
- REPLUG is a retrieval-augmented language modeling paradigm
- REPLUG can be integrated with any existing language model
- REPLUG uses a retriever to retrieve documents from an external corpus
- REPLUG ensembles output probabilities from different passes
- REPLUG improves performance on language modeling and downstream tasks