Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- DSIs encode documents in a model and use the same model to map queries to relevant documents.
- Reindexing the corpus is computationally expensive.
- DSI++ is a continual learning challenge to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents.
- Model experiences forgetting events during training, leading to unstable learning.
- Two approaches are proposed to mitigate these issues: modifying the training dynamics and introducing a generative memory.
Paper Content
Introduction
- Differentiable Search Indices (DSI) represent a new modeling paradigm for information retrieval tasks
- Leverage Transformer memory to encode all of the information in a corpus of documents and answer user queries directly
- Jointly optimize for indexing and retrieval tasks
- Significantly outperforms state-of-the-art “retrieve-and-rank” methods
- Open questions about applicability in dynamic corpora
- Realistic scenario of new documents continually added to the indexed corpus
- DSI++ proposed to address this issue
- DSI++ is a continual learning challenge for DSI to incrementally index new documents
- Novel benchmarks constructed from existing Natural Questions and MS MARCO datasets
- Naive solution for DSI++ is to continuously fine-tune the model with an indexing objective
- Catastrophic forgetting of previously memorized documents
- Significant number of documents experience forgetting events during memorization
- Optimize for flatter loss basins using Sharpness-Aware Minimization (SAM)
- Generative memory to sample pseudo-queries for already indexed documents and use them to alleviate forgetting of the retrieval task
Problem setup
- Initial corpus of documents and user queries corresponding to a subset of them
- Two tasks: memorization and retrieval
- Representing docids with unstructured atomic and structured string docids
- Dynamic corpus scenario with new batches of documents arriving
- Goal: Learn a DSI++ system that incrementally indexes new batches of documents while being able to answer queries related to previously and additionally indexed documents
Benchmark for dsi++.
- We introduce two benchmarks from the Natural Questions and MS MARCO datasets.
- We use the NQ train split to construct train/validation splits and use NQ validation as a test split.
- We randomly sample 50K unique articles/passages to constitute the initial D 0 corpus.
- We construct five more corpora, each with 10K articles/passages.
Evaluation metrics
- Indexing accuracy and Hits@k measure the proportion of documents correctly memorized and the proportion of correct documents ranked in the top k predictions, respectively.
- Average performance (A n ), forgetting (F n ) and learning performance (LA n ) metrics are used to summarize the model performance as documents are incrementally indexed.
- Naively structured docids underperform unstructured atomic docids across all metrics.
- Increasing the model scale improves average A n and learning LA n performance, but forgetting F n is severe across all model scales.
- DSI model undergoes severe forgetting and becomes less effective for the retrieval task when documents are continually indexed.
Docid representations.
- Unstructured atomic, naively structured, and semantically structured docid representations are considered for studying.
- Naively structured approach performs worse than unstructured atomic docids.
- Semantic structure helps reduce forgetting, but still underperforms unstructured atomic docids.
- Larger models outperform smaller counterparts in terms of average performance and learning performance.
- Forgetting is severe across all model scales.
Implicit forgetting during memorization: sam
- Memorization is a primary task in the DSI paradigm
- Each docid is assigned a unique token/class label
- Memorization is an instance of challenging extreme classification setting
- For every class, there is only one labeled example
- Model performance fluctuates a lot over the course of training
- Forgetting events occur when a document is classified correctly then incorrectly
- 88% of documents undergo forgetting at least once
- Geometric properties of minima affect forgetting
- Pre-trained initialization and Sharpness-Aware Minimization (SAM) can mitigate forgetting
- SAM outperforms Adafactor in terms of indexing accuracy and retrieval task
- SAM helps stably memorize documents, but there is room for improvement
Explicit forgetting: generative memory
- DSI paradigm consists of two tasks - memorization and retrieval
- SAM alleviates implicit forgetting by stably memorizing documents
- DSI models undergo severe forgetting when indexing new documents
- Humans rely on episodic memory to retain previously learned knowledge
- Memory-based approaches use a subset of previous task data to reduce forgetting
- Generative language models can be used to generate queries for documents
- Generative memory is used to generate pseudo-queries for continual semi-supervised learning
Experimentation
Implementation details
- Utilize pre-trained T5-Base model to initialize all models
- Train models for 1M steps with 100K warmup for initial indexing, 100K steps with 100 warmup for continual indexing
- Follow Tay et al. (2022) for hyper-parameters
- Evaluate models every 5K steps and select best performing checkpoint
- Co-train on indexing and retrieval tasks for initial training
- Use average of all DSI metrics for model selection
- Use indexing accuracy for continual learning experiments
- Use retrieval dataset R 0 for generative memory model
- Set maximum sequence length for document contents to 1024, target length for generated queries to 32
- Use beam decoding to generate pseudo-queries
- Tune learning rate and linear warmup
- Use T5X framework and 4-8 TPUv4 chips for training
Methods
- Proposed generative memory-based approach compared to continual indexing and continual indexing with all seen documents.
- DSI model is fine-tuned with indexing objective on incoming and updated corpus.
- Documents sampled from old and new corpora in equal proportion.
Added
Method
- Continual indexing of updated corpus reduces forgetting
- Experience replay with either D0 or D1 hurts forward transfer
- Augmenting pseudo-queries for all documents with continual indexing reduces forgetting and improves forward transfer
- Reduces forgetting of D0 while incrementally indexing in a large corpus setting
Results
- Generative memory-based approach helps to alleviate forgetting of older documents.
- Augmenting generative memory during continual indexing reduces forgetting and improves average Hits@10.
Does generative memory alleviate forgetting of old documents?
- Hits@1 for model after training on D0 is 35.9
- Continually indexing both D0 and D1 reduces forgetting of retrieval task
- ER with generative memory outperforms episodic memory
- Generative memory reduces forgetting of previously indexed documents
- ER with generative memory enables forward transfer to newly indexed documents
Does generative memory generalize to different datasets?
- MS MARCO dataset Hits@1 is 78.2 after training on D 0 passages
- Indexing both D 0 and D 1 corpora reduces forgetting the retrieval task
- Incremental learning with indexing objective shows impressive forward transfer
- ER with generative memory improves overall performance and reduces forgetting
- Results hold across two datasets - Natural Questions and MS MARCO
- Sparse regularization updates from ER positively influence backward and forward transfer
- Incremental indexer updating outperforms “train from scratch” baseline
- Computationally efficient - 6 times fewer updates than re-training the model from scratch
Related work
- Review relevant prior works related to DSI++ and continual learning methods
- Pre-trained language models can be used to extract relational knowledge
- Roberts et al. (2020) demonstrate pre-trained models can answer open-domain questions without external knowledge
- Non-trivial to update knowledge stored in weights of these models
- Zhu et al. (2020) introduces an experimentation setup to update facts stored within pre-trained models
- Dhingra et al. (2022) introduces a diagnostic dataset to probe LMs for facts that change over time
- Recent works investigate efficient ways to localize and edit facts stored in LMs
- Task is to memorize facts in DSI++
- Parameter isolation-based approaches assign different dedicated subsets of model parameters to each task
- Parameter isolation-based approaches require corpus identity for every user query at test-time
Conclusion
- DSI++ presents a novel direction for exploration of DSI models to solve a critical requirement
- Experiments show proposed solutions to optimize for flatter loss basins using SAM and generative memory alleviate forgetting
- Aim to reduce need to re-train DSI models from scratch when new documents are added
- Indexing accuracy of D 0 , D 1 , and D 2 document corpora visualized when continuously indexing new documents
- Cumulative histogram of forgetting events over the course of memorization of the initial D0 corpus
- Systematic study about forgetting and forward transfer when incrementally indexing new corpus of documents
- Investigating effectiveness of SAM and generative memory in mitigating forgetting
- SAM increases percentage of examples experiencing zero forgetting events
- Generative memory reduces forgetting and improves average Hits@10