Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Retrieval-augmented language models are powerful but expensive.
  • Some work avoids cost by pre-encoding a text corpus into a memory.
  • LUMEN is a hybrid between these two extremes.
  • LUMEN outperforms pure memory and FiD on multiple question-answering tasks.
  • LUMEN advantage increases with model size.

Paper Content

Introduction

  • Retrieval-augmented language models such as Fusion-in-Decoder (Izacard & Grave, 2021) achieve strong performance on knowledge intensive tasks.
  • Retrieval-augmented models retrieve related text passages and process the passages along with the input to extract relevant context information.
  • Encoding retrieved passages can be computationally expensive.
  • An increasingly common approach to reduce this encoding cost retrieves and extracts information from a memory of pre-computed representations.
  • LUMEN (Live Update Memory Network) is a middle ground between retrieval and memory.
  • LUMEN divides the task of encoding passages between a frozen memory encoder and a fine-tuned live encoder.

Background

  • Interested in achieving best performance for given resource budget
  • Different types of computational resources
  • Algorithmic approaches yield trade-offs between resources
  • Existing retrieval-augmented models
  • Describe costs of models along different computational dimensions

Computational resources

  • Pretraining, fine-tuning and inference are stages of the model life-cycle
  • Each stage has a different cost per sample
  • Inference can be slower than FLOPs indicate due to decoder memory bandwidth constraints
  • FLOPs are used as measure of computational cost
  • Retrieval-augmented models have additional costs
  • Storage and pre-computation costs are determined by corpus size and size of a single sample

Fusion-in-decoder

  • Fusion-in-Decoder (Izacard & Grave, 2021) uses an encoder and cross-attention
  • Pre-computation costs are lower than MemoryFiD without finetuning the memory encoder
  • Storage and bandwidth costs are the same as MemoryFiD without finetuning the memory encoder

Memory

  • Pre-computing dense representations of retrieval candidates and storing them in a memory reduces the cost of retrieval-augmented models.
  • MemoryFiD removes encoder costs and only uses cross-attention and other decoder compute.
  • MemoryFiD incurs a performance penalty relative to normal FiD.
  • LUMEN uses a two-step process to generate a general representation for each passage and condition it on the input and task.

Architecture

  • LUMEN is initialized from a pre-trained T5 encoder-decoder model
  • LUMEN has three encoders: a large memory encoder, a smaller live encoder, and a question encoder
  • The memory encoder is applied offline to passages in the corpus to pre-compute memory representations
  • The live encoder updates the memory representations conditioned on input and task
  • The question encoder is used to ensure that memory representations and input are compatible

Computational analysis

  • LUMEN applies only a proportion of layers during fine-tuning and inference
  • This leads to a fraction of FLOPs for any given model size
  • Cross-attention is used

Experiments

Experiment setup

  • Experiments use models based on T5.1.1 architecture
  • Models are initialized from public T5 checkpoints
  • FiD is trained according to standard recipe
  • LUMEN is initialized with 1-α proportion of layers of T5 encoder and α proportion of layers of T5 encoder
  • Models are fine-tuned with T5X framework based on JAX and FLAX

Main results

  • Small proportion of live layers is enough to achieve quality close to FiD
  • Required live proportion decreases as model size increases
  • Gap between MemoryFiD and FiD increases with scale
  • LUMEN achieves similar performance at lower FLOPs
  • LUMEN has stronger performance than MemoryFiD for any FLOP value

Transfer

  • Transferring the Live Encoder from Natural Questions to other tasks can be beneficial, especially for smaller live encoders.
  • Transferring both the Live and Memory Encoder from Natural Questions to other tasks can be beneficial for small live encoders, where the Live Encoder does not contain enough layers to fully adapt the memory.
  • Gains from transfer are higher for smaller live proportion and significantly higher for WebQuestions, a task with a small amount of data.

Memory shape

  • Retrieval-augmented models are expensive for training and inference
  • Computational cost of retrieval-augmented models can be partitioned into cost from reading retrieved passages, decoding, and long-range attention
  • Majority of inference time in FiD is due to memory bandwidth constraints in cross-attention
  • LUMEN is a hybrid between Fusion-in-Decoder and dense memory
  • Passage representations are partially pre-encoded into a dense memory, and then reprocessed on the fly by a fine-tuned encoder that conditions on the question
  • Primary gains from the live encoder in LUMEN result from updating memory representations conditioned on the question
  • Conditioning the passage on the input is critical, although the passage-conditioned question is still helpful
  • Fine-tuning the question encoder improves performance significantly
  • LUMEN uses significantly less compute than FiD for the same performance, and this advantage grows with scale
  • LUMEN achieves much better performance than MemoryFiD at any compute budget
  • Transferring memory and especially live encoder from a related dataset can partially close the gap with FiD
  • Comparison of LUMEN with published results on Natural Questions and TriviaQA test sets