Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Various techniques have been developed to improve dense retrieval.
- Existing DRs often suffer from effectiveness tradeoffs between supervised and zero-shot retrieval.
- A generalizable DR can be trained to achieve high accuracy in both supervised and zero-shot retrieval without increasing model size.
- Common data augmentation practices are often inefficient and sub-optimal.
- DRAGON is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations.
Paper Content
Introduction
- Bi-encoder based neural retrievers allow documents to be pre-computed and stored
- Training data is often scarce in real-world scenarios
- SPLADE++ and Col-BERTv2 are more expressive representations
- Dense retrieval (DR) is a simpler bi-encoder retrieval model
- Pretraining, query augmentation and distillation can improve DR effectiveness
- Tradeoff between supervised and zero-shot effectiveness
- GTR-XXL breaks the effectiveness tradeoff but is inefficient
- Unified framework of data augmentation for contrastive learning
- Key to training a generalizable DR is to create diverse relevance labels
- Cheap and large-scale augmented queries can be used instead of neural generative queries
- Progressive label augmentation strategy proposed
- DRAGON breaks the supervised and zero-shot effectiveness tradeoff without increasing model size
Background
- Retrieval task and contrastive learning approach for dense retrieval introduced
- Unified framework for understanding recent approaches to improve dense retrieval training provided
Training dense retrieval models
- Task is to retrieve documents to maximize ranking metrics
- Dense retrieval uses bi-encoder architecture and dot product between encoded vectors as similarity score
- Contrastive learning used to train DR models by contrasting positive pairs against negatives
- Data augmentation used to increase query size and improve data quality
- Query augmentation includes sentence cropping and pseudo query generation
- Supervised label augmentation used on augmented queries
Settings for empirical studies
- Introduce basic experimental settings in Section 3
- Detailed settings in Section 4
- Use MS MARCO and BEIR datasets
- Evaluate models on MS MARCO Dev
- Report MRR@10 and Recall@1000
- Use BEIR for zero-shot evaluations
- Report averaged nDCG@10 over BEIR-13
Pilot studies on data augmentation
- Discuss exploration space of data augmentation
- Conduct empirical studies on how to better train a dense retriever
- Propose data augmentation recipe to train DRAGON
An exploration of data augmentation
- Query augmentation uses sentences from MS MARCO corpus and synthetic queries from doct5query
- Label augmentation uses multiple sources of supervisions from existing sparse, dense and multi-vector retrievers
Training with diverse supervisions
- Introduced searching space for query and label augmentation
- Training dense retriever on augmented data is not trivial
- Need to create supervised training data using a teacher from augmented queries
- Need to train dense retriever to digest multiple supervisions
- Strategies to train dense retriever with diverse supervisions: linear score fusion, uniform supervision, progressive supervision
- Train dense retriever using contrastive loss
Empirical studies
- Empirical studies conducted on how to better train a dense retriever
- Three teachers used for supervised labels: uniCOIL, Contriever, ColBERTv2
- Models trained with single and multiple sources of supervisions
- Progressive supervision used: uniCOIL→ Contriever → Col-BERTv2
- Learning from diverse relevance labels from multiple retrievers is key to gain generalizability capability
- Different trajectories have impact on models’ zero-shot retrieval effectiveness
- Query size is key to successful training
- Mixture of cropped sentences and generative queries yields strong retrieval effectiveness
- Final recipe proposed: 20 epochs, trajectory of progressive supervision, mixture of cropped sentences and synthetic queries
- DRAGON variants: DRAGON-S, DRAGON-Q, DRAGON+
Comparison with the state of the art
Datasets
- Evaluated model supervised effectiveness on TREC DL queries
- TREC DL queries have on average 95 graded relevance labels per query
- Evaluated models on 18 datasets in BEIR
- Evaluated models on LoTTE, which consists of questions and answers from 5 topics
- Reported retrieval effectiveness of Success@5 on search and forum queries
Baseline models
- DRAGON is compared to dense retrievers using bert-base-uncased
- Knowledge Distillation, Contrastive Pre-Training, Masked Auto-Encoding Pre-Training and Domain Adaptation techniques are used
- Different corpus are used for pre-training
- MS MARCO training queries are used for fine-tuning
- Results are reported from original papers and pyserini
Implementation details
- Trained dense retrievers using 32 A100 GPUs with batch size of 64 and learning rate of 3e-5
- Used asymmetric dual encoder with two distinctly parameterized encoders and inbatch negative mining
- Maximum query and passage lengths set to 32 and 128 for MS MARCO training and evaluation
- Maximum input lengths set to 512 for BEIR evaluation
Results
- Baseline dense retrievers perform well on MS MARCO dev set but not on TREC DL queries
- DRAGON variants show strong retrieval effectiveness on MS MARCO dev and TREC DL queries
- Dense retrievers perform better in zero-shot evaluations than MS MARCO dev queries
- DRAGON variants outperform dense retrievers in zero-shot evaluations
- DRAGON+ combined with MAE pre-training sees further improvement on zero-shot evaluations
- Mixing different types of queries can mitigate the issue of queries being far different from human-like queries
Discussions
- DRAGON-S is trained with augmented relevance labels from a cross encoder
- Re-ranking with the cross encoder does not improve retrieval effectiveness
- DRAGON-S benefits from masked autoencoding rather than contrastive pre-training
- It is challenging to normalize relevance scores from a sparse and dense retriever
- Sentence cropping yields a generalizable dense retriever
- Cropped sentences provide diverse queries from a passage
- Cropped sentences have more unique augmented relevant passages than generative queries
Related work
- Knowledge Distillation is a technique used to improve the effectiveness of DR
- Previous work used soft labels from KD and relevant passages labeled by humans
- Recent work mines more positive samples using cross encoders to augment labels
- Curriculum Learning is used to improve machine learning tasks
- Pre-training techniques include contrastive pre-training and masked auto encoding pre-training
- Supervised contrastive learning is used on artificially created text pairs
Conclusion
- We present DRAGON, a dense retriever trained with diverse data augmentation
- We propose a unified framework of data augmentation to understand recent progress of training dense retrievers
- We study how to improve dense retrieval training through query and relevance label augmentation
- We propose a diverse data augmentation recipe, query augmentation with the mixture of sentence cropping and generative queries, and progressive relevance label augmentation with multiple teachers
- We demonstrate that a single BERT-base-sized dense retriever can achieve state-of-the-art effectiveness in both supervised and zero-shot retrieval tasks
- We study the impact of top-k positive sampling
- We observe that treating top-10 passages from each teacher as positives yields the best supervised and zero-shot effectiveness
- We estimate the accuracy and diversity of supervision
- We compare with existing state-of-the-art dense retrievers
- We compare uniform and progressive supervision effectiveness
- We list data statistics of MS MARCO dataset and DRAGONs’ detailed effectiveness on LoTTE