Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Retrieval-Augmented Language Modeling (RALM) methods improve language modeling and provide a source attribution mechanism.
- Existing RALM approaches modify the LM architecture, making deployment complicated.
- This paper proposes an alternative, In-Context RALM, which leaves the LM architecture unchanged and prepends grounding documents to the input.
- In-Context RALM provides large LM gains and the document retrieval and ranking mechanism can be specialized to further boost performance.
- Code is publicly available.
Paper Content
Introduction
- Recent advances in language modeling have increased the usefulness of machine-generated text.
- Mainstream language modeling has limitations in access to external knowledge.
- Factual inaccuracies or errors may be included.
- Retrieval-Augmented Language Modeling (RALM) is a promising approach to address these limitations.
- RALM systems include two components: document retrieval and document reading.
- Leading RALM systems tend to focus on altering the language model architecture.
- In-Context RALM is a simple but powerful RALM framework.
- In-Context RALM can lead to LM performance gains equivalent to increasing the LM’s number of parameters.
- In-Context RALM can help drive wider deployment of RALM systems.
Related work
- RALM approaches can be divided into two families: nearest-neighbor language models and retrieve and read models
- Nearest-neighbor language models were first introduced in Khandelwal et al. (2020)
- Retrieve and read models involve training the LM
- In-Context RALM does not involve further training of the LM and focuses on how to choose documents for improved LM performance
In-context ralm
- Language models define probability distributions over sequences of tokens.
- The probability of a sequence is modeled by predicting the next token given the sequence of tokens preceding it.
- This is usually done with a transformer network.
- Retrieval augmented language models (RALMs) add an operation that retrieves one or more documents from an external corpus and conditions the LM predictions on these documents.
Experimental details
- Described experimental setup
- Included models and implementation details
Datasets
- Evaluated effectiveness of RALM across 5 datasets
- WikiText-103 used to evaluate RALM
- 3 datasets from The Pile: ArXiv, Stack Exchange, FreeLaw
- Investigated Real-News from Zellers et al.
Models
- Four models of GPT-2 were used in experiments
- Models were trained on WebText, excluding Wikipedia documents
- GPT-Neo and OPT models allowed for larger models and Wikipedia documents seen during training
- Maximum sequence length of 1,024 used for all models
- Sparse and dense retrievers used
- Rerankers initialized from RoBERTa-base
Implementation details
- Implemented code base using Transformers library
- Used Pyserini library for sparse retrieval and FAISS for dense retrieval
The effectiveness of in-context ralm with off-the-shelf retrievers
- In-Context RALM leads to substantial LM gains.
- Recommended configuration for applying in-context RALM is a BM25 retriever with 32 query tokens and a retrieval frequency of every 4 tokens.
- Across all examined corpora, In-Context RALM improved LM perplexity to match that of a 2-3x larger model.
Bm25 outperforms off-the-shelf neural retrievers in language modeling
- BM25 outperformed three popular dense (neural) retrievers
- BM25 optimal query length = 32, dense retrievers optimal query length = 64
- BM25 outperformed all dense retrievers
Frequent retrieval improves language modeling
- LM performance improved when retrieval operations became more frequent
- Retrieving with high frequency allows the LM to be grounded in higher resolution
- Experiments used s = 4 to balance performance and runtime
A contextualization vs. recency tradeoff in query length
- Varying query length affects BM25 performance
- Sweet spot for query length is around 32 tokens
Improving in-context ralm with lm-oriented reranking
- In-context RALM uses a fixed document reading component
- There is potential for improvement by specializing the document retrieval mechanism to the LM task
- BM25 is based only on the bag of words signal and does not allow for different degrees of importance to different query tokens
- Performance gains can be obtained by using an LM to perform zero-shot reranking of the top-k BM25 retrieved documents
- Training a specialized bidirectional reranker of the top-k BM25 retrieved documents can provide further LM gains
Lms as zero-shot rerankers
- Used language models as document rerankers
- Found document that maximizes probability of text for generation
- Reranking uses last prefix tokens available at test time
- Reranking does not require same LM used for generation
- Optimal query length is 16,4
- Reranking yields better results than taking first result from retriever
Training lm-dedicated rerankers
- Trained a reranker to choose documents from top-k documents retrieved by BM25
- Reranker learns to choose document to help predict upcoming text
- Training data from target corpus assumed to be available
- Reranker is a fine-tuned RoBERTa-base
- Training process involves sampling prefixes from training data and running BM25
- Results show significant gains from predictive reranking
- Room for further improvements compared to top-16 BM25 Oracle results
Discussion
- Retrieval from external sources is a common practice in knowledge-intensive tasks
- Recent breakthroughs in LM generation capabilities have led to LMs that can generate useful long texts
- Factual inaccuracies remain a common way in which machine-generated text can fall short
- Existing approaches rely upon fine-tuning the LM, which is difficult and costly
- This paper presented the framework of in-context RALM, enabling frozen, off-the-shelf LMs to benefit from retrieval
- Performance gains can be achieved by using general purpose retrievers
- Additional gains can be achieved by tailoring the document selection to the LM setting
- Future work includes adding more documents, retrieving more sparsely, and parallelizing the input sequence