Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Retrieval-augmented language models are powerful but expensive.
Some work avoids cost by pre-encoding a text corpus into a memory.
LUMEN is a hybrid between these two extremes.
LUMEN outperforms pure memory and FiD on multiple question-answering tasks.
LUMEN advantage increases with model size.

Paper Content

Introduction

Retrieval-augmented language models such as Fusion-in-Decoder (Izacard & Grave, 2021) achieve strong performance on knowledge intensive tasks.
Retrieval-augmented models retrieve related text passages and process the passages along with the input to extract relevant context information.
Encoding retrieved passages can be computationally expensive.
An increasingly common approach to reduce this encoding cost retrieves and extracts information from a memory of pre-computed representations.
LUMEN (Live Update Memory Network) is a middle ground between retrieval and memory.
LUMEN divides the task of encoding passages between a frozen memory encoder and a fine-tuned live encoder.

Background

Interested in achieving best performance for given resource budget
Different types of computational resources
Algorithmic approaches yield trade-offs between resources
Existing retrieval-augmented models
Describe costs of models along different computational dimensions

Computational resources

Pretraining, fine-tuning and inference are stages of the model life-cycle
Each stage has a different cost per sample
Inference can be slower than FLOPs indicate due to decoder memory bandwidth constraints
FLOPs are used as measure of computational cost
Retrieval-augmented models have additional costs
Storage and pre-computation costs are determined by corpus size and size of a single sample

Fusion-in-decoder

Fusion-in-Decoder (Izacard & Grave, 2021) uses an encoder and cross-attention
Pre-computation costs are lower than MemoryFiD without finetuning the memory encoder
Storage and bandwidth costs are the same as MemoryFiD without finetuning the memory encoder

Memory

Pre-computing dense representations of retrieval candidates and storing them in a memory reduces the cost of retrieval-augmented models.
MemoryFiD removes encoder costs and only uses cross-attention and other decoder compute.
MemoryFiD incurs a performance penalty relative to normal FiD.
LUMEN uses a two-step process to generate a general representation for each passage and condition it on the input and task.

Architecture

LUMEN is initialized from a pre-trained T5 encoder-decoder model
LUMEN has three encoders: a large memory encoder, a smaller live encoder, and a question encoder
The memory encoder is applied offline to passages in the corpus to pre-compute memory representations
The live encoder updates the memory representations conditioned on input and task
The question encoder is used to ensure that memory representations and input are compatible

Computational analysis

LUMEN applies only a proportion of layers during fine-tuning and inference
This leads to a fraction of FLOPs for any given model size
Cross-attention is used

Experiments

Experiment setup

Experiments use models based on T5.1.1 architecture
Models are initialized from public T5 checkpoints
FiD is trained according to standard recipe
LUMEN is initialized with 1-α proportion of layers of T5 encoder and α proportion of layers of T5 encoder
Models are fine-tuned with T5X framework based on JAX and FLAX

Main results

Small proportion of live layers is enough to achieve quality close to FiD
Required live proportion decreases as model size increases
Gap between MemoryFiD and FiD increases with scale
LUMEN achieves similar performance at lower FLOPs
LUMEN has stronger performance than MemoryFiD for any FLOP value

Transfer

Transferring the Live Encoder from Natural Questions to other tasks can be beneficial, especially for smaller live encoders.
Transferring both the Live and Memory Encoder from Natural Questions to other tasks can be beneficial for small live encoders, where the Live Encoder does not contain enough layers to fully adapt the memory.
Gains from transfer are higher for smaller live proportion and significantly higher for WebQuestions, a task with a small amount of data.

Memory shape

Retrieval-augmented models are expensive for training and inference
Computational cost of retrieval-augmented models can be partitioned into cost from reading retrieved passages, decoding, and long-range attention
Majority of inference time in FiD is due to memory bandwidth constraints in cross-attention
LUMEN is a hybrid between Fusion-in-Decoder and dense memory
Passage representations are partially pre-encoded into a dense memory, and then reprocessed on the fly by a fine-tuned encoder that conditions on the question
Primary gains from the live encoder in LUMEN result from updating memory representations conditioned on the question
Conditioning the passage on the input is critical, although the passage-conditioned question is still helpful
Fine-tuning the question encoder improves performance significantly
LUMEN uses significantly less compute than FiD for the same performance, and this advantage grows with scale
LUMEN achieves much better performance than MemoryFiD at any compute budget
Transferring memory and especially live encoder from a related dataset can partially close the gap with FiD
Comparison of LUMEN with published results on Natural Questions and TriviaQA test sets

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Computational resources#

Fusion-in-decoder#

Memory#

Architecture#

Computational analysis#

Experiments#

Experiment setup#

Main results#

Transfer#

Memory shape#

Link to paper

Abstract

Paper Content

Introduction

Background

Computational resources

Fusion-in-decoder

Memory

Architecture

Computational analysis

Experiments

Experiment setup

Main results

Transfer

Memory shape