Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

DSIs encode documents in a model and use the same model to map queries to relevant documents.
Reindexing the corpus is computationally expensive.
DSI++ is a continual learning challenge to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents.
Model experiences forgetting events during training, leading to unstable learning.
Two approaches are proposed to mitigate these issues: modifying the training dynamics and introducing a generative memory.

Paper Content

Introduction

Differentiable Search Indices (DSI) represent a new modeling paradigm for information retrieval tasks
Leverage Transformer memory to encode all of the information in a corpus of documents and answer user queries directly
Jointly optimize for indexing and retrieval tasks
Significantly outperforms state-of-the-art “retrieve-and-rank” methods
Open questions about applicability in dynamic corpora
Realistic scenario of new documents continually added to the indexed corpus
DSI++ proposed to address this issue
DSI++ is a continual learning challenge for DSI to incrementally index new documents
Novel benchmarks constructed from existing Natural Questions and MS MARCO datasets
Naive solution for DSI++ is to continuously fine-tune the model with an indexing objective
Catastrophic forgetting of previously memorized documents
Significant number of documents experience forgetting events during memorization
Optimize for flatter loss basins using Sharpness-Aware Minimization (SAM)
Generative memory to sample pseudo-queries for already indexed documents and use them to alleviate forgetting of the retrieval task

Problem setup

Initial corpus of documents and user queries corresponding to a subset of them
Two tasks: memorization and retrieval
Representing docids with unstructured atomic and structured string docids
Dynamic corpus scenario with new batches of documents arriving
Goal: Learn a DSI++ system that incrementally indexes new batches of documents while being able to answer queries related to previously and additionally indexed documents

Benchmark for dsi++.

We introduce two benchmarks from the Natural Questions and MS MARCO datasets.
We use the NQ train split to construct train/validation splits and use NQ validation as a test split.
We randomly sample 50K unique articles/passages to constitute the initial D 0 corpus.
We construct five more corpora, each with 10K articles/passages.

Evaluation metrics

Indexing accuracy and Hits@k measure the proportion of documents correctly memorized and the proportion of correct documents ranked in the top k predictions, respectively.
Average performance (A n ), forgetting (F n ) and learning performance (LA n ) metrics are used to summarize the model performance as documents are incrementally indexed.
Naively structured docids underperform unstructured atomic docids across all metrics.
Increasing the model scale improves average A n and learning LA n performance, but forgetting F n is severe across all model scales.
DSI model undergoes severe forgetting and becomes less effective for the retrieval task when documents are continually indexed.

Docid representations.

Unstructured atomic, naively structured, and semantically structured docid representations are considered for studying.
Naively structured approach performs worse than unstructured atomic docids.
Semantic structure helps reduce forgetting, but still underperforms unstructured atomic docids.
Larger models outperform smaller counterparts in terms of average performance and learning performance.
Forgetting is severe across all model scales.

Implicit forgetting during memorization: sam

Memorization is a primary task in the DSI paradigm
Each docid is assigned a unique token/class label
Memorization is an instance of challenging extreme classification setting
For every class, there is only one labeled example
Model performance fluctuates a lot over the course of training
Forgetting events occur when a document is classified correctly then incorrectly
88% of documents undergo forgetting at least once
Geometric properties of minima affect forgetting
Pre-trained initialization and Sharpness-Aware Minimization (SAM) can mitigate forgetting
SAM outperforms Adafactor in terms of indexing accuracy and retrieval task
SAM helps stably memorize documents, but there is room for improvement

Explicit forgetting: generative memory

DSI paradigm consists of two tasks - memorization and retrieval
SAM alleviates implicit forgetting by stably memorizing documents
DSI models undergo severe forgetting when indexing new documents
Humans rely on episodic memory to retain previously learned knowledge
Memory-based approaches use a subset of previous task data to reduce forgetting
Generative language models can be used to generate queries for documents
Generative memory is used to generate pseudo-queries for continual semi-supervised learning

Experimentation

Implementation details

Utilize pre-trained T5-Base model to initialize all models
Train models for 1M steps with 100K warmup for initial indexing, 100K steps with 100 warmup for continual indexing
Follow Tay et al. (2022) for hyper-parameters
Evaluate models every 5K steps and select best performing checkpoint
Co-train on indexing and retrieval tasks for initial training
Use average of all DSI metrics for model selection
Use indexing accuracy for continual learning experiments
Use retrieval dataset R 0 for generative memory model
Set maximum sequence length for document contents to 1024, target length for generated queries to 32
Use beam decoding to generate pseudo-queries
Tune learning rate and linear warmup
Use T5X framework and 4-8 TPUv4 chips for training

Methods

Proposed generative memory-based approach compared to continual indexing and continual indexing with all seen documents.
DSI model is fine-tuned with indexing objective on incoming and updated corpus.
Documents sampled from old and new corpora in equal proportion.

Added

Method

Continual indexing of updated corpus reduces forgetting
Experience replay with either D0 or D1 hurts forward transfer
Augmenting pseudo-queries for all documents with continual indexing reduces forgetting and improves forward transfer
Reduces forgetting of D0 while incrementally indexing in a large corpus setting

Results

Generative memory-based approach helps to alleviate forgetting of older documents.
Augmenting generative memory during continual indexing reduces forgetting and improves average Hits@10.

Does generative memory alleviate forgetting of old documents?

Hits@1 for model after training on D0 is 35.9
Continually indexing both D0 and D1 reduces forgetting of retrieval task
ER with generative memory outperforms episodic memory
Generative memory reduces forgetting of previously indexed documents
ER with generative memory enables forward transfer to newly indexed documents

Does generative memory generalize to different datasets?

MS MARCO dataset Hits@1 is 78.2 after training on D 0 passages
Indexing both D 0 and D 1 corpora reduces forgetting the retrieval task
Incremental learning with indexing objective shows impressive forward transfer
ER with generative memory improves overall performance and reduces forgetting
Results hold across two datasets - Natural Questions and MS MARCO
Sparse regularization updates from ER positively influence backward and forward transfer
Incremental indexer updating outperforms “train from scratch” baseline
Computationally efficient - 6 times fewer updates than re-training the model from scratch

Review relevant prior works related to DSI++ and continual learning methods
Pre-trained language models can be used to extract relational knowledge
Roberts et al. (2020) demonstrate pre-trained models can answer open-domain questions without external knowledge
Non-trivial to update knowledge stored in weights of these models
Zhu et al. (2020) introduces an experimentation setup to update facts stored within pre-trained models
Dhingra et al. (2022) introduces a diagnostic dataset to probe LMs for facts that change over time
Recent works investigate efficient ways to localize and edit facts stored in LMs
Task is to memorize facts in DSI++
Parameter isolation-based approaches assign different dedicated subsets of model parameters to each task
Parameter isolation-based approaches require corpus identity for every user query at test-time

Conclusion

DSI++ presents a novel direction for exploration of DSI models to solve a critical requirement
Experiments show proposed solutions to optimize for flatter loss basins using SAM and generative memory alleviate forgetting
Aim to reduce need to re-train DSI models from scratch when new documents are added
Indexing accuracy of D 0 , D 1 , and D 2 document corpora visualized when continuously indexing new documents
Cumulative histogram of forgetting events over the course of memorization of the initial D0 corpus
Systematic study about forgetting and forward transfer when incrementally indexing new corpus of documents
Investigating effectiveness of SAM and generative memory in mitigating forgetting
SAM increases percentage of examples experiencing zero forgetting events
Generative memory reduces forgetting and improves average Hits@10

Link to paper#

Abstract#

Paper Content#

Introduction#

Problem setup#

Benchmark for dsi++.#

Evaluation metrics#

Docid representations.#

Implicit forgetting during memorization: sam#

Explicit forgetting: generative memory#

Experimentation#

Implementation details#

Methods#

Added#

Method#

Results#

Does generative memory alleviate forgetting of old documents?#

Does generative memory generalize to different datasets?#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Problem setup

Benchmark for dsi++.

Evaluation metrics

Docid representations.

Implicit forgetting during memorization: sam

Explicit forgetting: generative memory

Experimentation

Implementation details

Methods

Added

Method

Results

Does generative memory alleviate forgetting of old documents?

Does generative memory generalize to different datasets?

Related work

Conclusion