Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Retrieval-Augmented Language Modeling (RALM) methods improve language modeling and provide a source attribution mechanism.
Existing RALM approaches modify the LM architecture, making deployment complicated.
This paper proposes an alternative, In-Context RALM, which leaves the LM architecture unchanged and prepends grounding documents to the input.
In-Context RALM provides large LM gains and the document retrieval and ranking mechanism can be specialized to further boost performance.
Code is publicly available.

Paper Content

Introduction

Recent advances in language modeling have increased the usefulness of machine-generated text.
Mainstream language modeling has limitations in access to external knowledge.
Factual inaccuracies or errors may be included.
Retrieval-Augmented Language Modeling (RALM) is a promising approach to address these limitations.
RALM systems include two components: document retrieval and document reading.
Leading RALM systems tend to focus on altering the language model architecture.
In-Context RALM is a simple but powerful RALM framework.
In-Context RALM can lead to LM performance gains equivalent to increasing the LM’s number of parameters.
In-Context RALM can help drive wider deployment of RALM systems.

RALM approaches can be divided into two families: nearest-neighbor language models and retrieve and read models
Nearest-neighbor language models were first introduced in Khandelwal et al. (2020)
Retrieve and read models involve training the LM
In-Context RALM does not involve further training of the LM and focuses on how to choose documents for improved LM performance

In-context ralm

Language models define probability distributions over sequences of tokens.
The probability of a sequence is modeled by predicting the next token given the sequence of tokens preceding it.
This is usually done with a transformer network.
Retrieval augmented language models (RALMs) add an operation that retrieves one or more documents from an external corpus and conditions the LM predictions on these documents.

Experimental details

Described experimental setup
Included models and implementation details

Datasets

Evaluated effectiveness of RALM across 5 datasets
WikiText-103 used to evaluate RALM
3 datasets from The Pile: ArXiv, Stack Exchange, FreeLaw
Investigated Real-News from Zellers et al.

Models

Four models of GPT-2 were used in experiments
Models were trained on WebText, excluding Wikipedia documents
GPT-Neo and OPT models allowed for larger models and Wikipedia documents seen during training
Maximum sequence length of 1,024 used for all models
Sparse and dense retrievers used
Rerankers initialized from RoBERTa-base

Implementation details

Implemented code base using Transformers library
Used Pyserini library for sparse retrieval and FAISS for dense retrieval

The effectiveness of in-context ralm with off-the-shelf retrievers

In-Context RALM leads to substantial LM gains.
Recommended configuration for applying in-context RALM is a BM25 retriever with 32 query tokens and a retrieval frequency of every 4 tokens.
Across all examined corpora, In-Context RALM improved LM perplexity to match that of a 2-3x larger model.

Bm25 outperforms off-the-shelf neural retrievers in language modeling

BM25 outperformed three popular dense (neural) retrievers
BM25 optimal query length = 32, dense retrievers optimal query length = 64
BM25 outperformed all dense retrievers

Frequent retrieval improves language modeling

LM performance improved when retrieval operations became more frequent
Retrieving with high frequency allows the LM to be grounded in higher resolution
Experiments used s = 4 to balance performance and runtime

A contextualization vs. recency tradeoff in query length

Varying query length affects BM25 performance
Sweet spot for query length is around 32 tokens

Improving in-context ralm with lm-oriented reranking

In-context RALM uses a fixed document reading component
There is potential for improvement by specializing the document retrieval mechanism to the LM task
BM25 is based only on the bag of words signal and does not allow for different degrees of importance to different query tokens
Performance gains can be obtained by using an LM to perform zero-shot reranking of the top-k BM25 retrieved documents
Training a specialized bidirectional reranker of the top-k BM25 retrieved documents can provide further LM gains

Lms as zero-shot rerankers

Used language models as document rerankers
Found document that maximizes probability of text for generation
Reranking uses last prefix tokens available at test time
Reranking does not require same LM used for generation
Optimal query length is 16,4
Reranking yields better results than taking first result from retriever

Training lm-dedicated rerankers

Trained a reranker to choose documents from top-k documents retrieved by BM25
Reranker learns to choose document to help predict upcoming text
Training data from target corpus assumed to be available
Reranker is a fine-tuned RoBERTa-base
Training process involves sampling prefixes from training data and running BM25
Results show significant gains from predictive reranking
Room for further improvements compared to top-16 BM25 Oracle results

Discussion

Retrieval from external sources is a common practice in knowledge-intensive tasks
Recent breakthroughs in LM generation capabilities have led to LMs that can generate useful long texts
Factual inaccuracies remain a common way in which machine-generated text can fall short
Existing approaches rely upon fine-tuning the LM, which is difficult and costly
This paper presented the framework of in-context RALM, enabling frozen, off-the-shelf LMs to benefit from retrieval
Performance gains can be achieved by using general purpose retrievers
Additional gains can be achieved by tailoring the document selection to the LM setting
Future work includes adding more documents, retrieving more sparsely, and parallelizing the input sequence

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

In-context ralm#

Experimental details#

Datasets#

Models#

Implementation details#

The effectiveness of in-context ralm with off-the-shelf retrievers#

Bm25 outperforms off-the-shelf neural retrievers in language modeling#

Frequent retrieval improves language modeling#

A contextualization vs. recency tradeoff in query length#

Improving in-context ralm with lm-oriented reranking#

Lms as zero-shot rerankers#

Training lm-dedicated rerankers#

Discussion#

Link to paper

Abstract

Paper Content

Introduction

Related work

In-context ralm

Experimental details

Datasets

Models

Implementation details

The effectiveness of in-context ralm with off-the-shelf retrievers

Bm25 outperforms off-the-shelf neural retrievers in language modeling

Frequent retrieval improves language modeling

A contextualization vs. recency tradeoff in query length

Improving in-context ralm with lm-oriented reranking

Lms as zero-shot rerankers

Training lm-dedicated rerankers

Discussion